Tag Archive for 'feed'

Feeds updated !

Last night I’ve changed the way my feeds are handled on this blog. I’ve taken care of all redirections with a mix of WordPress plugins, Apache’s 301 redirects and Feedburner’s “My Brand” service. So everything should be transparent from your (and your feed reader) point of you.

But since this blog moved from one location to another in the last past years, I doubt everyone use the right URLs. This update is a good opportunity to check that you are using the “official” feed URLs:

In the future, I plan to support these URLs only. So please update your feed aggregation settings ! :)

How-to fix ruby’s FeedTools latin-1 parsing

While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I’ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I’ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong… So I dug in the FeedTools source code, and what I found is particularly disappointing…

FeedTools do a really nice job to detect the charset and handle feed’s data. So when it encounter HTML entities, it decode them to plain text. That’s good as at the end you get ready-to-use strings. Unfortunately, the method it use, CGI::unescapeHTML, stick too much to the W3C specification, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds…

To fix that, I’ve recoded the FeedTools::HtmlHelper.unescape_entities() method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the environment.rb file of all my Ruby on Rails projects:

require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: <description>Téléchargements et Multim&#233;dia</description>
module FeedTools::HtmlHelper
  class << self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&(.*?);/n) do
        $KCODE = "UTF8"
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) < 256
            [Integer($1)].pack("U")
          else
            "&##{$1};"
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex < 256
            [$1.hex].pack("U")
          else
            "&#x#{$1};"
          end
        else
          "&#{match};"
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end

Ok, so this fix the issue.

But I’m not comfortable about this problem not solved cleanly. I still don’t have a clue about which component should solve the problem definitively. But I have some ideas… Here are my propositions:

  1. Submit my monkey patch to FeedTools project for integration, or
  2. Merge my monkey patch upstream in legacy ruby CGI library, or
  3. Do not allow usage of HTML entities in feeds.

How-to add proxy support to Feedalizer ruby library

Here is a little code snippet that monkey-patch Feedalizer to let it grab web content through a HTTP proxy:

# HTTP proxy settings
HTTP_PROXY_HOST = "123.456.78.90"
HTTP_PROXY_PORT = 8080

# Calculate proxy URL
HTTP_PROXY_URL = "http://#{HTTP_PROXY_HOST}:#{HTTP_PROXY_PORT}"

# Monkey patch feedalizer to support page grabbing through a proxy
require 'feedalizer'
class Feedalizer
  # Backup original grab_page method
  alias_method :grab_page_orig, :grab_page
  # Define new grab_page() method with proxy support
  def grab_page(url)
    open(url, :proxy => HTTP_PROXY_URL) { |io| Hpricot(io) }
  end
end

This fix, written for a Ruby on Rails-based project, lay in the environment.rb file, but I wonder if this is the right place and the right way of doing it… Anyway, it works for me ! :)

Update: A post from Matthew Higgins’ blog that answer my question above has just shown up in my feed aggregator. What’s he telling us ? That I’m a naughty programmer :

Previous to 2.0, naughty developers pasted code at the bottom of environment.rb, and the config/initializer folder was a welcome convention to help organize this madness.

For your instance, the code in this post is extracted from an “old” (prior to RoR 2.0) project, thus explaining my naughtyness… ;)

FeedBurner and e107 integration

FeedBurner and e107 integration

In the context of my plan to move an e107-based website to WordPress, I need to take care of my RSS subscribers. To let people (and search engines) get my content via old URLs, I will use Apache redirections to do this transparently and permanently. My final goal is to have a WordPress website with all RSS feeds (blog posts and comments) managed by FeedBurner, to gather statistics about my audience.

Actually there is plenty of feeds format available in e107 (RSS 1.0, RSS 2.0, Atom and RDF) and one feed can be accessed through multiple URLs. We will reduce this incredible mess by using RSS 2.0 feeds only and redirect all others to it.

First, check that the e107 RSS feed plugin is activated. Then create an account on FeedBurner and setup there two feeds, one for your website’s news and another one for comments. Based on default e107 parameters, your news feed URL is like http://www.my-domain.com/e107_plugins/rss_menu/rss.php?1.2 and comments feed like http://www.my-domain.com/e107_plugins/rss_menu/rss.php?5.2.

Then, create (or edit) the http://www.my-domain.com/.htaccess file, and add following code:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} !FeedBurner [NC]
RewriteCond %{QUERY_STRING} ^(5|Comments)
RewriteRule e107_plugins/rss_menu/rss\.php http://feeds.feedburner.com/myfeed-comments? [R=301,L]

RewriteCond %{HTTP_USER_AGENT} !FeedBurner [NC]
RewriteCond %{QUERY_STRING} ^(1|News|.*)
RewriteRule e107_plugins/rss_menu/rss\.php http://feeds.feedburner.com/myfeed? [R=301,L]

This code is inspired by the one written by Mike Atlas, who had a similar issue and wanted to outsource his e107 forum RSS feeds to FeedBurner.

The first rewrite rule will redirect all URLs that start with http://www.my-domain.com/e107_plugins/rss_menu/rss.php?5 or http://www.my-domain.com/e107_plugins/rss_menu/rss.php?Comments to http://feeds.feedburner.com/myfeed-comments.

The second rewrite rule will redirect all other URLs that start with http://www.my-domain.com/e107_plugins/rss_menu/rss.php (including http://www.my-domain.com/e107_plugins/rss_menu/rss.php?1 and http://www.my-domain.com/e107_plugins/rss_menu/rss.php?News) to http://feeds.feedburner.com/myfeed.

That’s all ! Thanks to this server-side redirection, nobody will notice that the feeds have moved and no subscriber will be bothered to update their aggregator.

In my case, the only remaining task to do is to move my e107 website to WordPress then install FeedSmith plugin. But that’s another story… ;)