Paul's Time Sink: Another fix for caltrain rss script

Wednesday, October 11, 2006

Another fix for caltrain rss script

Today I noticed a problem in the rss feed generated by the script that I wrote. An invalid character (character code 0x93) was included in the feed. The problem was that the page stated that the character set was iso-8859-1, even though there are characters in the cp1250 character set.

Here is the updated script. I also had to apply the patch described on this page that allows the RSS module handle multiple byte characters

#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; use XML::RSS; # First - LWP::Simple. Download the page using get();. my $content = get( "http://www.caltrain.com/news.html" ) or die $!; # convert the string from iso-8859-1 to utf-8 $content = decode("cp1250", $content, Encode::FB_HTMLCREF); # Second - Create a TokeParser object, using our downloaded HTML. my $stream = HTML::TokeParser->new( \$content ) or die $!; # Finally - create the RSS object. my $rss = XML::RSS->new( version => '0.9' ); # Prep the RSS. $rss->channel( title => "Caltrain news", link => "http://www.caltrain.com/news.html", description => "Latest caltrain news"); # Declare variables. my ($tag, $headline, $url); # First indication of a headline - A <div> tag is present. while ( $tag = $stream->get_tag("a") ) { # Inside this loop, $tag is at a <a> tag. # But do we have a "class="newstitle">" token, too? if ($tag->[1]{class} and $tag->[1]{class} eq 'newstitle') { # We do! # Now, we're at the <a> with the headline in. # We need to put the contents of the 'href' token in $url. $url = $tag->[1]{href} || "--"; # Now we can grab $headline, by using get_trimmed_text # up to the close of the <a> tag. $headline = $stream->get_trimmed_text('/a'); # We need to escape ampersands, as they start entity references in XML. $url =~ s/&/&/g; # The <a> tags contain relative URLs - we need to qualify these. $url = 'http://www.caltrain.com/'.$url; # And that's it. We can add our pair to the RSS channel. $rss->add_item( title => $headline, link => $url); } } $rss->save("caltrain.rss");

Wednesday, October 11, 2006

Another fix for caltrain rss script

No comments:

Post a Comment

Mastering Matter: Seamless Smart Home Integration with Network Segmentation