<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Kevin Deldycke &#187; parsing</title> <atom:link href="http://kevin.deldycke.com/tag/parsing/feed/" rel="self" type="application/rss+xml" /><link>http://kevin.deldycke.com</link> <description>Free software engineer &#38; wannabe videomaker</description> <lastBuildDate>Fri, 03 Feb 2012 19:08:27 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>How-to fix ruby&#8217;s FeedTools latin-1 parsing</title><link>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/</link> <comments>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/#comments</comments> <pubDate>Thu, 31 Jul 2008 18:48:22 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[feed]]></category> <category><![CDATA[FeedTools]]></category> <category><![CDATA[monkey patch]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[patch]]></category> <category><![CDATA[RSS]]></category> <category><![CDATA[ruby]]></category> <category><![CDATA[Ruby on Rails]]></category> <category><![CDATA[Snippet]]></category> <category><![CDATA[Web]]></category><guid isPermaLink="false">http://kevin.deldycke.com/?p=236</guid> <description><![CDATA[While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I&#8217;ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I&#8217;ve started to check that the original feed was encoded in the &#8230; <a href="http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><img src="http://kevin.deldycke.com/wp-content/uploads/2008/07/feedtools-logo-150x150.png" alt="" title="feedtools-logo" width="150" height="150" class="alignleft size-thumbnail wp-image-237" /></p><p>While playing with <a href="http://sporkmonger.com/projects/feedtools/">FeedTools</a>, a ruby library to parse RSS (or other) feeds, I&#8217;ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I&#8217;ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong&#8230; So I dug in the <a href="http://feedtools.rubyforge.org/svn/trunk/">FeedTools source code</a>, and what I found is particularly disappointing&#8230;</p><p>FeedTools do a really nice job to detect the charset and handle feed&#8217;s data. So when it encounter <a href="http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references">HTML entities</a>, it decode them to plain text. That&#8217;s good as at the end you get ready-to-use strings. Unfortunately, the method it use, <a href="http://www.noobkit.com/show/ruby/ruby/standard-library/cgi/unescapehtml.html">CGI::unescapeHTML</a>, stick too much to the <a href="http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">W3C specification</a>, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds&#8230;</p><p>To fix that, I&#8217;ve recoded the <a href="http://rubyfurnace.com/docs/feedtools-0.2.26/classes/FeedTools/HtmlHelper.html#M007308">FeedTools::HtmlHelper.unescape_entities()</a> method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the <code>environment.rb</code> file of all my <a href="http://www.rubyonrails.org">Ruby on Rails</a> projects:</p><pre class="brush: ruby; title: ; notranslate">
require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: &lt;description&gt;Téléchargements et Multim&amp;#233;dia&lt;/description&gt;
module FeedTools::HtmlHelper
  class &lt;&lt; self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&amp;(.*?);/n) do
        $KCODE = &quot;UTF8&quot;
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) &lt; 256
            [Integer($1)].pack(&quot;U&quot;)
          else
            &quot;&amp;##{$1};&quot;
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex &lt; 256
            [$1.hex].pack(&quot;U&quot;)
          else
            &quot;&amp;#x#{$1};&quot;
          end
        else
          &quot;&amp;#{match};&quot;
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end
</pre><p>Ok, so this fix the issue.</p><p>But I&#8217;m not comfortable about this problem not solved cleanly. I still don&#8217;t have a clue about which component should solve the problem definitively. But I have some ideas&#8230; Here are my propositions:</p><ol><li>Submit my monkey patch to FeedTools project for integration, or</li><li>Merge my monkey patch upstream in legacy ruby CGI library, or</li><li>Do not allow usage of HTML entities in feeds.</li></ol> ]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2008/07/how-to-fix-rubys-feedtools-latin-1-parsing/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>How-to add proxy support to Feedalizer ruby library</title><link>http://kevin.deldycke.com/2008/07/how-to-add-proxy-support-to-feedalizer-ruby-library/</link> <comments>http://kevin.deldycke.com/2008/07/how-to-add-proxy-support-to-feedalizer-ruby-library/#comments</comments> <pubDate>Wed, 16 Jul 2008 20:40:53 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[feed]]></category> <category><![CDATA[feedalizer]]></category> <category><![CDATA[hpricot]]></category> <category><![CDATA[HTTP]]></category> <category><![CDATA[monkey patch]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[proxy]]></category> <category><![CDATA[RSS]]></category> <category><![CDATA[ruby]]></category> <category><![CDATA[Ruby on Rails]]></category> <category><![CDATA[Snippet]]></category> <category><![CDATA[Web]]></category><guid isPermaLink="false">http://kevin.deldycke.com/?p=233</guid> <description><![CDATA[Here is a little code snippet that monkey-patch Feedalizer to let it grab web content through a HTTP proxy: This fix, written for a Ruby on Rails-based project, lay in the environment.rb file, but I wonder if this is the &#8230; <a href="http://kevin.deldycke.com/2008/07/how-to-add-proxy-support-to-feedalizer-ruby-library/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><img src="http://kevin.deldycke.com/wp-content/uploads/2008/07/feedalizer-150x32.png" alt="" title="feedalizer" width="150" height="32" class="alignleft size-thumbnail wp-image-234" /></p><p>Here is a little code snippet that <a href="http://en.wikipedia.org/wiki/Monkey_patch">monkey-patch</a> <a href="http://termos.vemod.net/feedalizer">Feedalizer</a> to let it grab web content through a HTTP proxy:</p><pre class="brush: ruby; title: ; notranslate">
# HTTP proxy settings
HTTP_PROXY_HOST = &quot;123.456.78.90&quot;
HTTP_PROXY_PORT = 8080

# Calculate proxy URL
HTTP_PROXY_URL = &quot;http://#{HTTP_PROXY_HOST}:#{HTTP_PROXY_PORT}&quot;

# Monkey patch feedalizer to support page grabbing through a proxy
require 'feedalizer'
class Feedalizer
  # Backup original grab_page method
  alias_method :grab_page_orig, :grab_page
  # Define new grab_page() method with proxy support
  def grab_page(url)
    open(url, :proxy =&gt; HTTP_PROXY_URL) { |io| Hpricot(io) }
  end
end
</pre><p>This fix, written for a <a href="http://www.rubyonrails.org">Ruby on Rails</a>-based project, lay in the <code>environment.rb</code> file, but I wonder if this is the right place and the right way of doing it&#8230; Anyway, it works for me ! <img src='http://kevin.deldycke.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p><p><ins datetime="2008-07-16T20:49:58+00:00"><strong>Update</strong>: A <a href="http://www.strictlyuntyped.com/2008/06/rails-where-to-put-other-files.html">post from Matthew Higgins&#8217; blog that answer my question</a> above has just shown up in my feed aggregator. What&#8217;s he telling us ? That I&#8217;m a naughty programmer :</p><blockquote><p>Previous to 2.0, naughty developers pasted code at the bottom of <code>environment.rb</code>, and the <code>config/initializer</code> folder was a welcome convention to help organize this madness.</p></blockquote><p>For your instance, the code in this post is extracted from an &#8220;old&#8221; (prior to RoR 2.0) project, thus explaining my naughtyness&#8230; <img src='http://kevin.deldycke.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /><br /> </ins></p> ]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2008/07/how-to-add-proxy-support-to-feedalizer-ruby-library/feed/</wfw:commentRss> <slash:comments>6</slash:comments> </item> <item><title>Ultimate Regular Expression for HTML tag parsing with PHP</title><link>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/</link> <comments>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/#comments</comments> <pubDate>Fri, 23 Mar 2007 22:27:09 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[HTML]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[PCRE]]></category> <category><![CDATA[PHP]]></category> <category><![CDATA[regexp]]></category><guid isPermaLink="false">http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/</guid> <description><![CDATA[Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library. Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on &#8230; <a href="http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><em><strong>Disclaimer</strong>: this is a dirty hack ! To parse HTML or XML, <a href="#comment-4740">use a dedicated library</a>.</em></p><p>Tonight I found the ultimate <a href="http://en.wikipedia.org/wiki/Regular_expression">regex</a> to get HTML tags out of a string. It was <a href="http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx">written a year ago by Phil Haack on his blog</a>. His regex is quite bullet-proof: it&#8217;s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).</p><p>Unfortunately his regular expression was designed for Microsoft .NET, so I&#8217;ve spend some time to convert it to PHP. Here is the result:</p><pre class="brush: php; title: ; notranslate">
$regex = &quot;/&lt;\/?\w+((\s+\w+(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/i&quot;;
</pre><p>And finally, my version based on the one above:</p><pre class="brush: php; title: ; notranslate">
$regex = &quot;/&lt;\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/i&quot;;
</pre><p>The latter include the following enhancement:</p><ul><li>accept hyphens as attribute&#8217;s middle characters (<a href="http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/#comment-3167">thanks Ged</a>)</li></ul> ]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/feed/</wfw:commentRss> <slash:comments>31</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk: basic
Page Caching using disk: enhanced
Database Caching 2/20 queries in 0.014 seconds using apc
Object Caching 649/692 objects using apc

Served from: kevin.deldycke.com @ 2012-02-08 10:35:08 -->
