Currently working on my e107 Importer plugin, I was confronted today with badly-encoded data coming from my databases.

e107 migrated to full UTF-8 years ago, but I must have messed the upgrade process at the time. That was my conclusion when I took a close look to my tables: all of them seems to be set to Latin-1 but contain UTF-8 data. Here are screenshots from SQLBuddy (a great light-weight MySQL manager) showing just that:

To fix this, I first tried to use the following command I found on the web:

$ mysql --database=e107db -B -N -e "SHOW TABLES"  | awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;"}' | mysql --database=e107db

But this doesn’t work, as it not only change the encoding of the table, but also transcode the data inside the table.

Let’s try something else. First, we’ll export the database to a dump file, of which the encoding is forced to Latin-1:

$ mysqldump -a -c -e --no-create-db --add-drop-table --default-character-set=latin1 --databases 'e107db' > ./e107-data.sql

Now the trick is to change the CHARSET parameter of all CREATE TABLE directives to UTF-8:

$ sed -i 's/CHARSET=latin1/CHARSET=utf8/g' ./e107-data.sql

We’ll also change the NAMES directive to force MySQL to handle imported data as UTF-8:

$ sed -i 's/SET NAMES latin1/SET NAMES utf8/g' ./e107-data.sql

Then we’re free to import the result in a new UTF-8 database:

$ sed -i 's/USE `e107db`;/#USE `e107db`;/g' ./e107-data.sql
$ mysql --execute="CREATE DATABASE e107db_new CHARACTER SET=utf8"
$ mysql --database=e107db_new < ./e107-data.sql

And now, accentuated characters appears as they should in our database, meaning we’ve fixed all the mess ! :)

PS: I found another alternative method (look at the end of the linked page) which consists of temporarily handling TEXT fields as BLOB, to have MySQL treat them as binary content (thus skipping character transcoding). Haven’t tested this but sounds tricky.

Related content