<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Kevin Deldycke &#187; FeedTools</title> <atom:link href="http://kevin.deldycke.com/tag/feedtools/feed/" rel="self" type="application/rss+xml" /><link>http://kevin.deldycke.com</link> <description>Free software engineer &#38; wannabe videomaker</description> <lastBuildDate>Fri, 03 Feb 2012 19:08:27 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>How-to fix ruby&#8217;s FeedTools latin-1 parsing</title><link>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/</link> <comments>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/#comments</comments> <pubDate>Thu, 31 Jul 2008 18:48:22 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[feed]]></category> <category><![CDATA[FeedTools]]></category> <category><![CDATA[monkey patch]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[patch]]></category> <category><![CDATA[RSS]]></category> <category><![CDATA[ruby]]></category> <category><![CDATA[Ruby on Rails]]></category> <category><![CDATA[Snippet]]></category> <category><![CDATA[Web]]></category><guid isPermaLink="false">http://kevin.deldycke.com/?p=236</guid> <description><![CDATA[While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I&#8217;ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I&#8217;ve started to check that the original feed was encoded in the &#8230; <a href="http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><img src="http://kevin.deldycke.com/wp-content/uploads/2008/07/feedtools-logo-150x150.png" alt="" title="feedtools-logo" width="150" height="150" class="alignleft size-thumbnail wp-image-237" /></p><p>While playing with <a href="http://sporkmonger.com/projects/feedtools/">FeedTools</a>, a ruby library to parse RSS (or other) feeds, I&#8217;ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I&#8217;ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong&#8230; So I dug in the <a href="http://feedtools.rubyforge.org/svn/trunk/">FeedTools source code</a>, and what I found is particularly disappointing&#8230;</p><p>FeedTools do a really nice job to detect the charset and handle feed&#8217;s data. So when it encounter <a href="http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references">HTML entities</a>, it decode them to plain text. That&#8217;s good as at the end you get ready-to-use strings. Unfortunately, the method it use, <a href="http://www.noobkit.com/show/ruby/ruby/standard-library/cgi/unescapehtml.html">CGI::unescapeHTML</a>, stick too much to the <a href="http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">W3C specification</a>, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds&#8230;</p><p>To fix that, I&#8217;ve recoded the <a href="http://rubyfurnace.com/docs/feedtools-0.2.26/classes/FeedTools/HtmlHelper.html#M007308">FeedTools::HtmlHelper.unescape_entities()</a> method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the <code>environment.rb</code> file of all my <a href="http://www.rubyonrails.org">Ruby on Rails</a> projects:</p><pre class="brush: ruby; title: ; notranslate">
require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: &lt;description&gt;Téléchargements et Multim&amp;#233;dia&lt;/description&gt;
module FeedTools::HtmlHelper
  class &lt;&lt; self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&amp;(.*?);/n) do
        $KCODE = &quot;UTF8&quot;
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) &lt; 256
            [Integer($1)].pack(&quot;U&quot;)
          else
            &quot;&amp;##{$1};&quot;
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex &lt; 256
            [$1.hex].pack(&quot;U&quot;)
          else
            &quot;&amp;#x#{$1};&quot;
          end
        else
          &quot;&amp;#{match};&quot;
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end
</pre><p>Ok, so this fix the issue.</p><p>But I&#8217;m not comfortable about this problem not solved cleanly. I still don&#8217;t have a clue about which component should solve the problem definitively. But I have some ideas&#8230; Here are my propositions:</p><ol><li>Submit my monkey patch to FeedTools project for integration, or</li><li>Merge my monkey patch upstream in legacy ruby CGI library, or</li><li>Do not allow usage of HTML entities in feeds.</li></ol> ]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk: basic
Page Caching using disk: enhanced
Database Caching 2/12 queries in 0.007 seconds using apc
Object Caching 510/527 objects using apc

Served from: kevin.deldycke.com @ 2012-02-08 10:52:36 -->
