How-to generate PDF from Markdown

Pandoc

The first tool you can use to convert a Markdown file to PDF is Pandoc.

To install Pandoc and all its dependencies on my Ubuntu 11.04, I used the following command:

$ aptitude install pandoc nbibtex texlive-latex-base texlive-latex-recommended texlive-latex-extra preview-latex-style dvipng texlive-fonts-recommended

Then I applied the PDF transformation on the README.md file from my openerp.buildout GitHub project:

$ wget https://raw.github.com/kdeldycke/openerp.buildout/master/README.md
$ markdown2pdf README.md -o readme-pandoc.pdf

The result is good, but not perfect. For example code blocks with long lines don’t break at the end of the page:

While trying to solve this issue, I stumble upon another tool…

Gimli

Gimli is an utility that was explicitly written with GitHub in mind.

Gimli is written in Ruby, so let’s install it the Ruby way:

$ aptitude install rubygems wkhtmltopdf
$ gem install gimli

Then we can convert our Markdown file to a PDF. The following will generate a README.pdf file in the current folder:

$ /var/lib/gems/1.8/bin/gimli -f ./README.md

The resulting PDF is really close to how GitHub renders Markdown content on its website. And it solve the bad code block style of Pandoc:

Feed Tracking Tool released under an Open-Source license

I’ve just open-sourced the Feed Tracking Tool project (aka “FTT”), my first (and only) Ruby on Rails experience.

This tool was developed within Uperto, the company I currently work for, for its internal needs. The project had an ancestor written in 2006 that was based on Pylons. It was a prototype and was barely working. Iterating over the abandoned Python code base was considered a waste of time. So in summer 2007, it was decided to rewrite this application from scratch.

As my co-worker was available and already played with Ruby on Rails, he was tasked to create the initial code base. I joined the project early on, as it was a great opportunity to play with the (then really trendy) Ruby on Rails framework.

At the end FTT was essentially a test project to explore Ruby on Rails. It was never deployed on a production server and was never used.

After roting for more than 3 years, and representing absolutely no business value in itself, I decided to release it under a GPLv2 license (with Uperto’s approval of course). My intention with this open-source release is to share back knowledge and code with the community.

FTT was living in a private Subversion repository at Uperto, but we unfortunately lost it. During the last few weeks I tried to rebuild the code history from old and partial backups. I then used my Git-based reconstruction method to consolidate everything in a Git repository. The code is now available on GitHub.

I don’t plan to maintain this project. But I may reboot it in the future if I need feed-related features, or if I need an excuse to play with Ruby on Rails again. But for now beware: the code is quite outdated and is only running on old Rails 1.2.x. This project should be considered as an ugly legacy code base. So please be indulgent while looking at FTT’s code: it was the work of unexperienced RoR developers ! ;)

How-to fix ruby’s FeedTools latin-1 parsing

While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I’ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I’ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong… So I dug in the FeedTools source code, and what I found is particularly disappointing…

FeedTools do a really nice job to detect the charset and handle feed’s data. So when it encounter HTML entities, it decode them to plain text. That’s good as at the end you get ready-to-use strings. Unfortunately, the method it use, CGI::unescapeHTML, stick too much to the W3C specification, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds…

To fix that, I’ve recoded the FeedTools::HtmlHelper.unescape_entities() method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the environment.rb file of all my Ruby on Rails projects:

require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: <description>Téléchargements et Multim&#233;dia</description>
module FeedTools::HtmlHelper
  class << self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&(.*?);/n) do
        $KCODE = "UTF8"
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) < 256
            [Integer($1)].pack("U")
          else
            "&##{$1};"
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex < 256
            [$1.hex].pack("U")
          else
            "&#x#{$1};"
          end
        else
          "&#{match};"
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end

Ok, so this fix the issue.

But I’m not comfortable about this problem not solved cleanly. I still don’t have a clue about which component should solve the problem definitively. But I have some ideas… Here are my propositions:

  1. Submit my monkey patch to FeedTools project for integration, or
  2. Merge my monkey patch upstream in legacy ruby CGI library, or
  3. Do not allow usage of HTML entities in feeds.

How-to add proxy support to Feedalizer ruby library

Here is a little code snippet that monkey-patch Feedalizer to let it grab web content through a HTTP proxy:

# HTTP proxy settings
HTTP_PROXY_HOST = "123.456.78.90"
HTTP_PROXY_PORT = 8080

# Calculate proxy URL
HTTP_PROXY_URL = "http://#{HTTP_PROXY_HOST}:#{HTTP_PROXY_PORT}"

# Monkey patch feedalizer to support page grabbing through a proxy
require 'feedalizer'
class Feedalizer
  # Backup original grab_page method
  alias_method :grab_page_orig, :grab_page
  # Define new grab_page() method with proxy support
  def grab_page(url)
    open(url, :proxy => HTTP_PROXY_URL) { |io| Hpricot(io) }
  end
end

This fix, written for a Ruby on Rails-based project, lay in the environment.rb file, but I wonder if this is the right place and the right way of doing it… Anyway, it works for me ! :)

Update: A post from Matthew Higgins’ blog that answer my question above has just shown up in my feed aggregator. What’s he telling us ? That I’m a naughty programmer :

Previous to 2.0, naughty developers pasted code at the bottom of environment.rb, and the config/initializer folder was a welcome convention to help organize this madness.

For your instance, the code in this post is extracted from an “old” (prior to RoR 2.0) project, thus explaining my naughtyness… ;)