Convert WordPress MySQL Tables

Verify MySQL Client and Server Using utf8

* On old server
mysql> show variables like ‘char%’;
+————————–+—————————-+
| Variable_name            | Value                      |
+————————–+—————————-+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.01 sec)
* On new server
mysql> show variables like ‘char%’;
+————————–+—————————-+
| Variable_name            | Value                      |
+————————–+—————————-+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

*Sample my.cnf file to get above results
[mysqld]
init_connect=’SET collation_connection = utf8_general_ci’
init_connect=’SET NAMES utf8′
default-character-set=utf8
character-set-server=’utf8′
collation-server=’utf8_general_ci’

[client]

default-character-set=utf8

Make sure Apache using UTF-8 as default character set

* Edit httpd.conf
”’AddDefaultCharset UTF-8”’

Make sure PHP uses utf-8

*Edit php.ini
”’default_charset = “utf-8″”’

Convert your Latin-1 collated tables to UTF-8

Make a backup

Since this step is the first place we’re actually going to be changing data within the database, the very first thing we’ll want to do here is make a backup of the database. Note that since in MySQL Character Set Hell, a mysqldump of the database is likely to have syntax errors in it, I recommend making a backup of the raw database files. In order for this backup to have clean data in it, you’ll need to shut down your mysql daemon while making the backup. If you’re server is a RedHat- or CentOS-like operating system using vendor-supplied RPMs for the MySQL packages (and default paths in your my.cnf), the following set of commands ought to do this for you:

# service mysql stop
# cd /var/lib
# tar zcvf mysql-backup.tar.gz mysql
# service mysql start
You’ll want to make sure your database server has enough space under /var/lib on it to store the above backup, of course.

Convert those tables

Please note that if your data set does not contain a mix of single- and double-encoded UTF-8 characters in a Latin-1 character set (ie. you either have one or the other) then there are safer conversion techniques than what I’m about to document below. (Specifically, see this page: http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html ). This step assumes that your data set contains tables with a mix of single- and double- (and triple- and quadruple-) encoded UTF-8 characters in a single Latin-1 collated table.

The procedure for converting your tables to UTF-8 collation will vary depending on which sources you’ll read, but the way that worked for me was to do a carefully-crafted mysqldump, alter the dump slightly, then re-import the dump back over the database itself. To minimize the chances of clobbering data in tables that are already properly UTF-8 collated, this really only needs to be done on tables using the Latin-1 collation to begin with. Do discover which tables these are, you can run:

mysql> show table status where collation like ‘%latin1%’ \G
*************************** 1. row ***************************
Name: UserInfo
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 2
Avg_row_length: 8192
Data_length: 16384
Max_data_length: 0
Index_length: 65536
Data_free: 0
Auto_increment: 3
Create_time: 2008-06-01 22:48:11
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment: InnoDB free: 0 kB
9 rows in set (0.00 sec)

mysql>

(Yes, “latin1_swedish_ci” is the name MySQL uses for Latin-1 collation.) You’ll need to run the above for all databases on the MySQL server, and compile a list of affected tables. For each of these tables, run:

1. # mysqldump -u dbuser -pdbpass –opt –skip-set-charset –default-character-set=latin1 –skip-extended-insert databasename–tables  tablename > database.table.sql
2. # sed -i.bak -pe ‘s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/’ database.table.sql
3. # cat database.table.sql | mysql -u dbuser -pdbpass databasename

Clean up double-encoded characters in your current data set

Before going any further in these instructions, I highly recommend taking a close look at your application. At this point, everything everywhere should be speaking UTF-8, and all your tables should have the UTF-8 collation. In an ideal world, especially if you started with a relatively clean data set, this may be all you need to do to make everything peachy in your application. So take a moment to look at your application (and specifically content generated from rows with non-ASCII characters in them) and verify you really need to go further in this how-to.

Ok, so from here on out, we’re going to assume that you do still have problems displaying characters as they should be in your application, and that this is due to having double-encoded characters in your UTF-8 collated tables. (Note that if your terminal and mysql command-line client are speaking UTF-8, you can verify the problems lie with the data itself with a few SELECT statements targeted at rows containing the data.)

Caveats

As a few final notes before moving on just to drive the point home:

The remainder of these instructions will not work if you are not speaking UTF-8 everywhere, or if your tables are not using the UTF-8 collation.

* If you have a relatively small number of rows affected by character encoding issues at this point, it may just be simpler and safer to fix these by hand.
* If you have a relatively small number of different types of double-encoded UTF-8 characters it may be simpler to convert these characters directly using techniques detailed by Derek Sivers in his excellent blog article on the same here: http://www.oreillynet.com/lpt/wlg/9022
These instructions assume you won’t have single- and double-encoded UTF-8 characters in the same field of a single row of any given table. This is a relatively safe assumption, unless you’ve already partially-completed manually fixing the data as described by Derek Sivers in the above point. If you do have such fields in your data set, these instructions will definitely corrupt them. (ie. You’re better off either finishing your conversion using Derek’s techniques, or reverting the data back to the state they were in before attempting Derek’s techniques.)

Scope of these instructions
The instructions in this section of this document will detail how to fix double- (and triple- and quadruple-)-encoded character issues for one column in all rows of one table in the database. These steps will need to be repeated for any affected columns in all tables in your database. Scripting this is left as an exercise for the reader (or as an incentive to become a customer of Blue Box Group. 😉 )

For the examples I’ll be using below, we’ll use the following example name conventions. Obviously, you’ll need to replace these with the actual names that apply to your database:

Table: mytable
Field / Column: myfield
Type of the above field / column: mytype
Temporary table (name is arbitrarily chosen): temptabl

Find tables with fields containing multi-byte characters in them

At this point, only those rows which have non-ASCII characters in text-containing fields (like CHAR, VARCHAR, TEXT, etc.) should have problems with double-encoded UTF-8 characters in them. The really nice part about UTF-8 here (and tables containing UTF-8 characters) is that any non-ASCII characters are by design multi-byte characters. We can exploit this fact to find tables and fields with rows containing non-ASCII characters.

mysql> select count(*) from mytable where LENGTH(myfield) != CHAR_LENGTH(myfield);
+———-+
| count(*) |
+———-+
|      669 |
+———-+
1 row in set (0.28 sec)

If the above query returns a count of zero, this means that either:

* There are no non-ASCII characters in any rows of the table for this field. Therefore, this field of this table does not need any correction and can be ignored.
* Your table is not using UTF-8 collation (which is bad, because if you’ve been following my instructions thus far it should be).
* Non-ASCII characters in this field in this table have already been munged by previous attempts at correcting character encoding issues and have been turned into single-byte characters (more than likely a question mark). If you have any rows with fields like this, the information needed to automatically convert them back into the multi-byte characters they should be has already been lost, and these rows will need to be fixed by hand.

Anyway, you will want to go through all text-like fields in all tables of your database and find any which contain rows that have multi-byte characters in them like this. These rows can potentially have double-encoded UTF-8 characters in them and will need conversion as detailed below. I suggest compiling these into a list you can automatically parse with a script which will do the remainder of the steps below.

Also note that it is very important at this stage to make note of the field type and any associated parameters
(eg. the length of the VARCHAR, etc.). I suggest taking a close look at the table’s description to get this information.
mysql> describe table;
+———-+————-+——+—–+———+—————-+
| Field    | Type        | Null | Key | Default | Extra          |
+———-+————-+——+—–+———+—————-+
| msno     | char(3)     | YES  |     | NULL    |                |
| title    | varchar(50) | YES  |     | NULL    |                |
| subtitle | text        | YES  |     | NULL    |                |
| author   | varchar(30) | YES  |     | NULL    |                |
| pubdate  | date        | YES  |     | NULL    |                |
| series   | varchar(50) | YES  |     | NULL    |                |
| journal  | varchar(50) | YES  |     | NULL    |                |
| arttext  | text        | YES  |     | NULL    |                |
| id       | smallint(6) | NO   | PRI | NULL    | auto_increment |
+———-+————-+——+—–+———+—————-+
9 rows in set (0.00 sec)

mysql> select count(*) from table where LENGTH(myfield) != CHAR_LENGTH(myfield);
+———-+
| count(*) |
+———-+
|       63 |
+———-+
1 row in set (0.01 sec)

Convert double-encoded UTF-8 characters to proper UTF-8 characters

Here’s where I stopped following the instructions in the link above. I couldn’t get the commands listed to work with my version of MySQL so I used phpMyAdmin.

1. Using phpMyAdmin convert the table collation to latin_swedish_ci
2. Change the type from whatever type (ie. text) it is to BLOB
3. Change the table’s type back to whatever it was (ie. text) and the collation to utf8_general_ci

Let me explain what the above does:

* The first alter table command tells mysql to convert the UTF-8 encoded characters into Latin-1. At first, this appears to be the exact opposite of what we were trying to accomplish above by making everything everywhere speak UTF-8. But please remember that by “double-encoded” UTF-8 characters, I really mean properly-encoded UTF-8 characters that were run through MySQL’s Latin-1 to UTF-8 conversion algorithm one too many times. By converting the UTF-8 collated field to Latin-1, this effectively tells MySQL to run those characters through the reverse algorithm: Convert UTF-8 encoded characters to Latin-1. This has the effect of taking those double-encoded UTF-8 characters and turning them into single-encoded or properly-encoded UTF-8 characters.

So what about those rows which already had properly-encoded UTF-8 characters in them? Well, that’s what some of those warnings were. But we’ll get to that in the next step sub-part.
* The second alter table command just converts the textual field into a BLOB (or raw binary) field. This does no change to the actual binary data underneath and no conversion algorithm is run.
* The third alter table command takes that BLOB field and converts it back to a textual field with the UTF-8 encoding. Since MySQL has no algorithms defined for character conversion when going from a BLOB to any other data type, these last two steps effectively allow us to fool MySQL into converting the field from Latin-1 to UTF-8 collation, but skip the character set conversion algorithm that would otherwise re-double-encode our now pristine properly-encoded UTF-8 characters. Pretty sneaky, eh.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s