While playing with FeedTools , a ruby library to parse RSS (or other) feeds, I’ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I’ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong… So I dug in the FeedTools source code , and what I found is particularly disappointing…
FeedTools do a really nice job to detect the charset and handle feed’s data. So when it encounter HTML entities , it decode them to plain text. That’s good as at the end you get ready-to-use strings. Unfortunately, the method it use, CGI::unescapeHTML , stick too much to the W3C specification , which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds…
To fix that, I’ve recoded the
FeedTools::HtmlHelper.unescape_entities()
method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the
environment.rb
file of all my
Ruby on Rails
projects:
require 'feed_tools'
# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: <description>Téléchargements et Multimédia</description>
module FeedTools::HtmlHelper
class << self
# Force UTF-8 conversion of HTML entities with number lower than 256.
# Based on CGI::unescapeHTML method.
def convert_html_entities_to_unicode(string)
string.gsub(/&(.*?);/n) do
$KCODE = "UTF8"
match = $1.dup
case match
when /\A#0*(\d+)\z/n then
if Integer($1) < 256
[Integer($1)].pack("U")
else
"&##{$1};"
end
when /\A#x([0-9a-f]+)\z/ni then
if $1.hex < 256
[$1.hex].pack("U")
else
"&#x#{$1};"
end
else
"&#{match};"
end
end
end
# Patch unescape_entities() method
alias_method :unescape_entities_orig, :unescape_entities
def unescape_entities(html)
return unescape_entities_orig(convert_html_entities_to_unicode(html))
end
end
end
Ok, so this fix the issue.
But I’m not comfortable about this problem not solved cleanly. I still don’t have a clue about which component should solve the problem definitively. But I have some ideas… Here are my propositions:
-
Submit my monkey patch to FeedTools project for integration, or
-
Merge my monkey patch upstream in legacy ruby CGI library, or
-
Do not allow usage of HTML entities in feeds.