How-to: e107 autogallery to Zenphoto migration

These past few days I was working on the Cool Cavemen’s photo gallery to move it to a shiny new one, powered by Zenphoto. In this post I will roughly describe how I’ve done it, code and commands included.

The old gallery was based on autogallery, a e107 plugin. We assume here that both e107 and Zenphoto are well configured and installed at the root of you web hosting space (/www in this case).

The first step is to copy the autogallery album structure, with all its content, to Zenphoto:

cd /www
cp -ax ./e107_plugins/autogallery/Gallery/* ./zenphoto/albums/

Then we delete all previews, thumbnails and XML metadatas, to keep in Zenphoto original assets only:

find ./zenphoto/albums/ -iname "*.xml" | xargs rm -f
find ./zenphoto/albums/ -iname "pv_*" | xargs rm -f
find ./zenphoto/albums/ -iname "th_*" | xargs rm -f

By now, you should be able to play with your medias using Zenphoto’s admin interface.

But if you’re unlucky as I was, you will find a strange bug which break down drag’n'drop album sorting. The fix I found was to remove, in photo filenames, the numerical prefix (and the following dot) set by autogallery to define the sort order. This operation should be performed, before the copy from autogallery to Zenphoto (= the first command in this post). By the way, if you know a one-liner to do this, please, please… share ! :)

To migrate comments, I have no automatic solution. I choose to do this manually, editing the database by hand. In my case it was the quickest way as I only had a dozen of comments to migrate.

And last but not least, if you care about measuring the popularity of your photos, you should consider migrating the view counter associated with each of your media. Don’t worry, this time I wrote a script to take care of it automagically. It will generate a bunch of SQL statements you’ll have to execute on your Zenphoto MySQL database. Here is my “e107 autogallery to Zenphoto hit counter migration script” (nice name isn’t it ? ;) ) that do the job:

#!/usr/bin/python

##############################################################################
#
# Copyright (C) 2008 Kevin Deldycke <kevin@deldycke.com>
#
# This program is Free Software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
#
##############################################################################

"""
  Last update: 2008 aug 21
"""

########### User config ###########

AUTOGALLERY_ALBUM_PATH = "/www/e107_plugins/autogallery/Gallery"
ZENPHOTO_ALBUM_PATH    = "/www/zenphoto/albums"
ZENPHOTO_TABLE_PREFIX  = "zenphoto_"

######## End of user config #######

import os, hashlib
import xml.etree.ElementTree as ET

# Calculate hash of a given file
def getHash(path):
  # Calculate the hash from file raw data
  if not os.path.isfile(path):
    return None
  try:
    file_object = open(path, 'r')
    data = file_object.read()
  except:
    return None
  if not len(data):
    return None
  return hashlib.sha224(data).hexdigest()

# Associate each autogallery photo having a hitcounter greater than 0 with its MD5 hash
def populateHashTable(arg, dirname, names):
  global hash_table
  for name in names:
    file_path = os.path.join(dirname, name)
    # print "Get hit count for %s" % file_path
    # Check that the file as a positive hit counter associated with
    xml_file_path = "%s.xml" % file_path
    if not os.path.isfile(xml_file_path):
      continue
    try:
      tree = ET.parse(xml_file_path)
    except:
      continue
    node = tree.find("viewhits")
    if node is None:
      continue
    try:
      hits = int(node.text)
    except:
      continue
    if not hits > 0:
      continue
    # Update hash table with data we care about
    file_hash = getHash(file_path)
    if file_hash is None:
      continue
    hash_table[file_hash] = hits + hash_table.get(file_hash, 0)

# Generate hitcount SQL request for each matching file
def generateSQL(arg, dirname, names):
  global sql
  for name in names:
    file_path = os.path.join(dirname, name)
    # print "Search hitcounter matching file %s" % file_path
    file_hash = getHash(file_path)
    if file_hash is None:
      continue
    if file_hash in hash_table:
      sql += "UPDATE `%simages` SET `hitcounter`=`hitcounter`+%d WHERE `filename`=%r;\n" % (ZENPHOTO_TABLE_PREFIX, hash_table[file_hash], name)

# Core of the script
hash_table = {}
sql        = ""
# Normalize path
source_path = os.path.abspath(AUTOGALLERY_ALBUM_PATH)
dest_path   = os.path.abspath(ZENPHOTO_ALBUM_PATH)

os.path.walk(source_path, populateHashTable, None)
# print repr(hash_table)
os.path.walk(dest_path, generateSQL, None)
print sql

I think code and comments are self-explainatory. And do not forget to update constants at the top of the script to match your installation paths and database’s tables prefix.

And finally, for your information, I tested all of this on following versions:

  • e107 0.7.11
  • autogallery 2.61
  • Zenphoto 1.2
  • Python 2.5.2
  • Linux server

How-to add Google Analytics tracking to Zenphoto

This is the patch I apply on each Zenphoto I install and upgrade. This little hack add Google Analytics tracking for all users except administrators.

Why ? As you can see in ticket #441 in Zenphoto bugtracker, there is no intention of adding support of GA in Zenphoto, even as an optional plugin. Hence my tiny hack. And for the non-admin stuff, I like having unbiased statistics: on low-audience websites, administrators can generate more traffic than legitimate users (if not all…).

Here is the downloadable patch file, and its content:

diff -ru ./zenphoto-orig/zp-core/template-functions.php ./zenphoto/zp-core/template-functions.php
--- ./zenphoto-orig/zp-core/template-functions.php  2008-08-15 07:43:05.000000000 +0200
+++ ./zenphoto/zp-core/template-functions.php 2008-08-16 17:08:03.000000000 +0200
@@ -147,7 +147,16 @@

    echo "<li><a href=\"".$zf."/admin.php?logout$redirect\">".gettext("Logout")."</a></li>\n";
    echo "</ul></div>\n";
- }
+ } else {
+    echo "<script type=\"text/javascript\">
+var gaJsHost = ((\"https:\" == document.location.protocol) ? \"https://ssl.\" : \"http://www.\");
+document.write(unescape(\"%3Cscript src='\" + gaJsHost + \"google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E\"));
+</script>
+<script type=\"text/javascript\">
+var pageTracker = _gat._getTracker(\"UA-XXXXXX-Y\");
+pageTracker._trackPageview();
+</script>";
+  }
 }

 /**

This patch was generated from a Zenphoto v1.2 and will likely not work with any other version.

Do not forget to update the dummy Google Analytics account ID above (UA-XXXXXX-Y) by yours.

And finally, to apply the patch, invoke the classic patch command:

patch -p0 < ./google-analytics-tracking-for-non-admin-users.patch

How-to fix ruby’s FeedTools latin-1 parsing

While playing with FeedTools, a ruby library to parse RSS (or other) feeds, I’ve spotted a strange behavior, that at first looks like typical unicode parsing issue. So I’ve started to check that the original feed was encoded in the right format, and that its charset was clearly set to the right value. But I found nothing wrong… So I dug in the FeedTools source code, and what I found is particularly disappointing…

FeedTools do a really nice job to detect the charset and handle feed’s data. So when it encounter HTML entities, it decode them to plain text. That’s good as at the end you get ready-to-use strings. Unfortunately, the method it use, CGI::unescapeHTML, stick too much to the W3C specification, which state that some of the HTML entities (if not all) are the expression of latin-1 characters. Hence the presence of latin-1 characters in pure UTF-8 RSS feeds…

To fix that, I’ve recoded the FeedTools::HtmlHelper.unescape_entities() method to convert each HTML entity it encounter to pure unicode. Here is the monkey patch I call by default from the environment.rb file of all my Ruby on Rails projects:

require 'feed_tools'

# Monkey patch feed tool.
# Use case mixed UTF-8 chars and html entities: <description>Téléchargements et Multim&#233;dia</description>
module FeedTools::HtmlHelper
  class << self

    # Force UTF-8 conversion of HTML entities with number lower than 256.
    # Based on CGI::unescapeHTML method.
    def convert_html_entities_to_unicode(string)
      string.gsub(/&(.*?);/n) do
        $KCODE = "UTF8"
        match = $1.dup
        case match
        when /\A#0*(\d+)\z/n       then
          if Integer($1) < 256
            [Integer($1)].pack("U")
          else
            "&##{$1};"
          end
        when /\A#x([0-9a-f]+)\z/ni then
          if $1.hex < 256
            [$1.hex].pack("U")
          else
            "&#x#{$1};"
          end
        else
          "&#{match};"
        end
      end
    end

    # Patch unescape_entities() method
    alias_method :unescape_entities_orig, :unescape_entities
    def unescape_entities(html)
      return unescape_entities_orig(convert_html_entities_to_unicode(html))
    end

  end
end

Ok, so this fix the issue.

But I’m not comfortable about this problem not solved cleanly. I still don’t have a clue about which component should solve the problem definitively. But I have some ideas… Here are my propositions:

  1. Submit my monkey patch to FeedTools project for integration, or
  2. Merge my monkey patch upstream in legacy ruby CGI library, or
  3. Do not allow usage of HTML entities in feeds.

How-to add proxy support to Feedalizer ruby library

Here is a little code snippet that monkey-patch Feedalizer to let it grab web content through a HTTP proxy:

# HTTP proxy settings
HTTP_PROXY_HOST = "123.456.78.90"
HTTP_PROXY_PORT = 8080

# Calculate proxy URL
HTTP_PROXY_URL = "http://#{HTTP_PROXY_HOST}:#{HTTP_PROXY_PORT}"

# Monkey patch feedalizer to support page grabbing through a proxy
require 'feedalizer'
class Feedalizer
  # Backup original grab_page method
  alias_method :grab_page_orig, :grab_page
  # Define new grab_page() method with proxy support
  def grab_page(url)
    open(url, :proxy => HTTP_PROXY_URL) { |io| Hpricot(io) }
  end
end

This fix, written for a Ruby on Rails-based project, lay in the environment.rb file, but I wonder if this is the right place and the right way of doing it… Anyway, it works for me ! :)

Update: A post from Matthew Higgins’ blog that answer my question above has just shown up in my feed aggregator. What’s he telling us ? That I’m a naughty programmer :

Previous to 2.0, naughty developers pasted code at the bottom of environment.rb, and the config/initializer folder was a welcome convention to help organize this madness.

For your instance, the code in this post is extracted from an “old” (prior to RoR 2.0) project, thus explaining my naughtyness… ;)

Python ultimate regular expression to catch HTML tags

1 year and 3 months ago I’ve came with a PHP regexp to parse HTML tag soup. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It also support attributes without value so it’s closer to the HTML specification, but doesn’t strictly stick to it in order to catch tag soup and malformatted tags.

ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"

And here is it applied in a trivial example (in a python shell):

>>> import re
>>>
>>> content = """This is the <strong>content</strong> in which we want to
<em>find</em> <a href="http://en.wikipedia.org/wiki/Html">HTML</a> tags."""
>>>
>>> ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"
>>>
>>> for match in re.finditer(ultimate_regexp, content):
...   print repr(match.group())
...
'<strong>'
'</strong>'
'<em>'
'</em>'
'<a href="http://en.wikipedia.org/wiki/Html">'
'</a>'
>>>