Today I noticed a problem in the rss feed generated by the script that I wrote. An invalid character (character code 0x93) was included in the feed. The problem was that the page stated that the character set was iso-8859-1, even though there are characters in the cp1250 character set.
Here is the updated script. I also had to apply the patch described on this page that allows the RSS module handle multiple byte characters
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
use XML::RSS;
# First - LWP::Simple. Download the page using get();.
my $content = get( "http://www.caltrain.com/news.html" ) or die $!;
# convert the string from iso-8859-1 to utf-8
$content = decode("cp1250", $content, Encode::FB_HTMLCREF);
# Second - Create a TokeParser object, using our downloaded HTML.
my $stream = HTML::TokeParser->new( \$content ) or die $!;
# Finally - create the RSS object.
my $rss = XML::RSS->new( version => '0.9' );
# Prep the RSS.
$rss->channel(
title => "Caltrain news",
link => "http://www.caltrain.com/news.html",
description => "Latest caltrain news");
# Declare variables.
my ($tag, $headline, $url);
# First indication of a headline - A <div> tag is present.
while ( $tag = $stream->get_tag("a") ) {
# Inside this loop, $tag is at a <a> tag.
# But do we have a "class="newstitle">" token, too?
if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') {
# We do!
# Now, we're at the <a> with the headline in.
# We need to put the contents of the 'href' token in $url.
$url = $tag->[1]{href} || "--";
# Now we can grab $headline, by using get_trimmed_text
# up to the close of the <a> tag.
$headline = $stream->get_trimmed_text('/a');
# We need to escape ampersands, as they start entity references in XML.
$url =~ s/&/&/g;
# The <a> tags contain relative URLs - we need to qualify these.
$url = 'http://www.caltrain.com/'.$url;
# And that's it. We can add our pair to the RSS channel.
$rss->add_item( title => $headline, link => $url);
}
}
$rss->save("caltrain.rss");
No comments:
Post a Comment