Fixing messed-up encoding in MySQL

Currently working on my e107 Importer plugin, I was confronted today with badly-encoded data coming from my databases.

e107 migrated to full UTF-8 years ago, but I must have messed the upgrade process at the time. That was my conclusion when I took a close look to my tables: all of them seems to be set to Latin-1 but contain UTF-8 data. Here are screenshots from SQLBuddy (a great light-weight MySQL manager) showing just that:

To fix this, I first tried to use the following command I found on the web:

mysql --database=e107db -B -N -e "SHOW TABLES"  | awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;"}' | mysql --database=e107db

But this doesn’t work, as it not only change the encoding of the table, but also transcode the data inside the table.

Let’s try something else. First, we’ll export the database to a dump file, of which the encoding is forced to Latin-1:

mysqldump -a -c -e --no-create-db --add-drop-table --default-character-set=latin1 --databases 'e107db' > ./e107-data.sql

Now the trick is to change the CHARSET parameter of all CREATE TABLE directives to UTF-8:

sed -i 's/CHARSET=latin1/CHARSET=utf8/g' ./e107-data.sql

We’ll also change the NAMES directive to force MySQL to handle imported data as UTF-8:

sed -i 's/SET NAMES latin1/SET NAMES utf8/g' ./e107-data.sql

Then we’re free to import the result in a new UTF-8 database:

sed -i 's/USE `e107db`;/#USE `e107db`;/g' ./e107-data.sql
mysql --execute="CREATE DATABASE e107db_new CHARACTER SET=utf8"
mysql --database=e107db_new < ./e107-data.sql

And now, accentuated characters appears as they should in our database, meaning we’ve fixed all the mess ! :)


PS: I found another alternative method (look at the end of the linked page) which consists of temporarily handling TEXT fields as BLOB, to have MySQL treat them as binary content (thus skipping character transcoding). Haven’t tested this but sounds tricky.

Python commands

  • Add a Python’s debugger break point:
    import pdb; pdb.set_trace()
    
  • Replace accentuated characters by their ASCII equivalent in a unicode string:
    import unicodedata
    unicodedata.normalize('NFKD', u"éèàçÇÉȲ³¼ÀÁÂÃÄÅËÍÑÒÖÜÝåïš™").encode('ascii', 'ignore')
    
  • Lambda function to transform a string to a URL-friendly ID:
    getSafeURL = lambda s: '-'.join([w for w in ''.join(1).split('-') if w])
    
  • Sort a list of dicts by dict-key (source):
    import operator
    [dict(a=1, b=2, c=3), dict(a=2, b=2, c=2), dict(a=3, b=2, c=1)].sort(key=operator.itemgetter('c'))
    
  • Set urllib2 timeout (source):
    import socket
    socket.setdefaulttimeout(10)
    
  • Start a dumb HTTP server on port 8000 (source):
    python -m SimpleHTTPServer 8000
    
  • Generate a binary distribution of the current package:
    python ./setup.py sdist
    
  • Register, generate and upload to PyPi the current package as a source package, an egg and a dumb binary:
    python ./setup.py register sdist bdist_egg bdist_dumb upload
    
  • Here is how my ~/.pypirc looks like:
    [pypirc]
    servers = pypi
    [server-login]
    username:kdeldycke
    password:XXXXXXX
    

A fix to xkcd’s emoticons + parentheses problem

xkcd-emoticons-with-parentheses-problem-exposed xkcd comic #541 (titled “TED Talk”) expose a problem I’ve long been aware of: how do you end parenthetical statements with emoticons ?

Randall reveal two possible solutions:

  1. … Linux (or BSD :) would…
  2. … Linux (or BSD :) ) would…

Of course, none of these are acceptable ! But today, I think I’ve found the definitive solution…

As you may know, unicode comes with loads of weird characters. The weirdest are called dingbats. Beside this family, you’ll find the weirdest of the weirdest, which contain 2 interesting symbols:

So using the latest, our statement become:

… Linux (or BSD ☺) would…

Problem solved ! :D