Monthly Archive for July, 2008

How-to fix ruby’s FeedTools latin-1 parsing

While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I’ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I’ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong… So I dug in the FeedTools source code, and what I found is particularly disappointing…

FeedTools do a really nice job to detect the charset and handle feed’s data. So when it encounter HTML entities, it decode them to plain text. That’s good as at the end you get ready-to-use strings. Unfortunately, the method it use, CGI::unescapeHTML, stick too much to the W3C specification, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds…

To fix that, I’ve recoded the FeedTools::HtmlHelper.unescape_entities() method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the environment.rb file of all my Ruby on Rails projects:

require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: <description>Téléchargements et Multim&#233;dia</description>
module FeedTools::HtmlHelper
  class << self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&(.*?);/n) do
        $KCODE = "UTF8"
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) < 256
            [Integer($1)].pack("U")
          else
            "&##{$1};"
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex < 256
            [$1.hex].pack("U")
          else
            "&#x#{$1};"
          end
        else
          "&#{match};"
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end

Ok, so this fix the issue.

But I’m not comfortable about this problem not solved cleanly. I still don’t have a clue about which component should solve the problem definitively. But I have some ideas… Here are my propositions:

  1. Submit my monkey patch to FeedTools project for integration, or
  2. Merge my monkey patch upstream in legacy ruby CGI library, or
  3. Do not allow usage of HTML entities in feeds.

Heroic journey to RAID-5 data recovery

Last week there was a power grid failure which break down my server’s RAID array. I have no UPS (as I’m a skinflint) and no automatic email alerts (because I’m too lazy to set it up). As a result, for 5 days, my 3-disk RAID-5 array was relying on only 2 disks until I noticed the issue…

By using a combination of following commands, I was soon aware of the gravity of the situation:

cat /proc/mdstat
mdadm --examine /dev/sda1

My /dev/sda1 disk was kicked out of the array, so I did the right stuff which consisted of reconstructing the array:

mdadm /dev/md0 -a /dev/sda1

Then, in an unlucky combination of cosmic ray bombardment, spooky action at a distance and astrological misalignment, half-way to the end of the rebuilding process (which can take up to 5 hours), another disk failed ! It was late, I was tired and utterly worried about losing 1.5 To of precious data. In such a bad shape, I was afraid to worsen the situation. So I decided to shutdown the server and sleep on the problem.

The next day I tried to boot my server to find it (surprise !) stuck in the middle of the boot process, with the famous message:

hit control-D to continue or give root password to fix manually

This is “normal” as my server tried to mount the ext3 filesystem from the /dev/md0 partition that was just assembled by mdadm. Of course md0, if assembled and available to the system, was not running because only one disk, out of three, was in a clean state.

I skip here the epic substory in which I wasted days in a search of a working keyboard, but I let you imagine how such adventures makes my week…

Eventually, I was able to analyze the situation in details. My first reflex ? Check that disks are not physically dead:

fdisk -l /dev/sda
fdisk -l /dev/sdb
fdisk -l /dev/sdc

“Linux raid partitions” (type code “fd“) are still there. Good. I assumed here that disks where not physically damaged. Maybe I should have looked at S.M.A.R.T. datas and statistics (via smartmontools). But remember, I’m lazy (and a bit crazy).

The next step was to get informations about the RAID array itself using:

mdadm --detail /dev/md0

which output the status table below (probably inaccurate as I reconstructed it afterwards):

Number   Major   Minor   RaidDevice State
   0       0        0        0      removed
   1       0        0        1      faulty removed
   2       8       33        2      active sync   /dev/sdc1
   3       8       17        3      spare

What this table told us ?

  • The array is up, but not running. One of its device (sdc1) was clean and active, but it’s not enough to get a working RAID-5.
  • My first attempt to rebuild the array lead to an unexpected result: it added sda1 as a spare device (in slot #3).
  • It confirm that sdb1 unexpectedly failed and is now in a bad state (“faulty removed“).

Then I stopped the array and tried to fearlessly (re)assemble it using 3 differents methods:

mdadm -S /dev/md0
mdadm -A /dev/md0
mdadm --assemble /dev/md0 --verbose /dev/sd[abc]1
mdadm --assemble --force --scan /dev/md0 --verbose

It always failed with messages like:

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
mdadm: /dev/md0 assembled from 1 drives and 1 spare - not enough to start the array.

So I examined each drive from mdadm‘s point of view:

mdadm -E /dev/sda1
mdadm -E /dev/sdb1
mdadm -E /dev/sdc1
mdadm -E /dev/sd[abc]1 | grep Event

The lastest command compare the “Event” attribute of all devices. It output something like:

Events : 0.53120
Events : 0.53108
Events : 0.53120

which indicate that sda1 and sdc1 are somewhat synced (share the same number) and sdb1 “late” (lower number).

Here I’ve got the idea of recreating the raid array without sdb1, relying only on sda1 and sdc1, by using the “magic” (hence dangerous) --assume-clean option. The latter doesn’t build, erase or initialize a new array. It just try to assemble it “as is”. Here is the command:

mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=3 /dev/sda1 missing /dev/sdc1

And it worked ! :D

I mounted the md0 partition and cleaned it up:

fsck.ext3 -v /dev/md0
mount /dev/md0

I updated my mdadm configuration before rebooting my server:

mdadm --detail --scan >> /etc/mdadm/mdadm.conf
vi /etc/mdadm/mdadm.conf
reboot

But history repeat itself, and again, the system hang up during boot. Except this time I knew what was happening: the boot process detected the remaining sdb1 device as part of the old array (the one before the regeneration I did above) and tried to run it. Remembering my last year post, I zero-ized the superblock of sdb1:

mdadm -S /dev/md0
mdadm --zero-superblock /dev/sdb1

A server reboot proved I was right and my md0 partition was automagically mounted in altered state:

localhost:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sda1[0] sdc1[2]
      1465143808 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]

unused devices: <none>

I just had to re-add sdb1 to fill the available slot and update the mdadm configuration to get back my array in its initial state:

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
vi /etc/mdadm/mdadm.conf

How-to add proxy support to Feedalizer ruby library

Here is a little code snippet that monkey-patch Feedalizer to let it grab web content through a HTTP proxy:

# HTTP proxy settings
HTTP_PROXY_HOST = "123.456.78.90"
HTTP_PROXY_PORT = 8080

# Calculate proxy URL
HTTP_PROXY_URL = "http://#{HTTP_PROXY_HOST}:#{HTTP_PROXY_PORT}"

# Monkey patch feedalizer to support page grabbing through a proxy
require 'feedalizer'
class Feedalizer
  # Backup original grab_page method
  alias_method :grab_page_orig, :grab_page
  # Define new grab_page() method with proxy support
  def grab_page(url)
    open(url, :proxy => HTTP_PROXY_URL) { |io| Hpricot(io) }
  end
end

This fix, written for a Ruby on Rails-based project, lay in the environment.rb file, but I wonder if this is the right place and the right way of doing it… Anyway, it works for me ! :)

Update: A post from Matthew Higgins’ blog that answer my question above has just shown up in my feed aggregator. What’s he telling us ? That I’m a naughty programmer :

Previous to 2.0, naughty developers pasted code at the bottom of environment.rb, and the config/initializer folder was a welcome convention to help organize this madness.

For your instance, the code in this post is extracted from an “old” (prior to RoR 2.0) project, thus explaining my naughtyness… ;)

Proxad / Free.fr killed my RPM repository !

My ISP, Proxad/Free, hardened its policy and do not allow any longer the use of the 10 GiB hosting (that they gracefully provide for free to their customer) for things other than “pure” website. I think my RPM repository, which I moved there last year, fall in the “static storage” category, leading them to erase it some days ago.

Fortunately I have backups and I recently get a RPS from OVH. I’ve spent the majority of the week-end restoring and relocating the repository. Good news everyone, everything seems OK now. If you still use my repositories, please check that it responds.

Python ultimate regular expression to catch HTML tags

1 year and 3 months ago I’ve came with a PHP regexp to parse HTML tag soup. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It also support attributes without value so it’s closer to the HTML specification, but doesn’t strictly stick to it in order to catch tag soup and malformatted tags.

ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"

And here is it applied in a trivial example (in a python shell):

>>> import re
>>>
>>> content = """This is the <strong>content</strong> in which we want to
<em>find</em> <a href="http://en.wikipedia.org/wiki/Html">HTML</a> tags."""
>>>
>>> ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"
>>>
>>> for match in re.finditer(ultimate_regexp, content):
...   print repr(match.group())
...
'<strong>'
'</strong>'
'<em>'
'</em>'
'<a href="http://en.wikipedia.org/wiki/Html">'
'</a>'
>>>