After upgrading a client's Rails application and all its gems, it was finally time to export the MySQL database. For everyone's surprise, characters beyond ASCII were completely broken, even though the original version of the client's application was perfectly working with characters with diacritics, chinese or even cyrillic.
This post talks about the real problem going underneath the cushy MySQL cover, and how we solved it.
I advise you to avoid jumping directly into the solution and to be perfectly sure that this is also your problem. A "diagnosing method" is described at the end of this document.
Everyone that has struggled with character encodings, whether while exporting MySQL databases or simply while dealing with exotic strings in Python (the famous
'ascii' codec can't encode character), knows how hellish it can be: from a simple encoding mismatch to double-encoded data, the possibilities are endless. And the
gibberish is usually useless to target the source of the problem.
And when it comes to MySQL charset and collation problems... It becomes even worse. Because MySQL
* ... has a default database charset and collation. * ... has a per table charset and collation. * ... has a per text field charset and collation.
Obviously, this degree of the specification provides MySQL with a great yet troublesome power.
As more and more web applications are built with a global user base in mind, UTF-8 has achieved the status of de facto standard when it comes to choosing the encoding to use; however, for applications already out there, the solution usually implies migrating the database from a charset such as
latin1, and, in particular, its collation
latin1_swedish_ci. Why? Because
latin1_swedish_ci is the default collation for the
latin1 character set, which in turn is the default character set for MySQL (as of MySQL 5.0).
Most databases are created using these defaults. So was our client's database. But
latin1 is a single-byte charset, so it didn't make much sense that our client's app was rendering an enormous range of alphabets correctly.
As a sidenote, I should clarify that MySQL's
latin1 is not ISO-8859-1 as one may think, but is in fact Windows-1252. This choice is justified in MySQL Reference Manual West European Character Sets entry:
This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.
The real problem was that UTF-8 data had been inserted inside
latin1 fields. Not a single row, obviously: a major part of the database was, in fact, UTF-8. But even when we came to this conclusion, some questions remained: how could UTF-8 data have gotten there in the first place? And how could UTF-8 data inside
latin1 fields be rendered correctly (which made this encoding mismatch silent)?
The answer is simple: Rails did it. Well, not Rails per se. Imagine the following sequence of events:
The database is created using MySQL default options: a database with a
latin1charset is born;
A Rails application is developed using the already existent database, where international users post UTF-8 data;
UTF-8 data is treated as a multibyte character sequence, but erroneously sent to MySQL as
latin1(due to a misconfiguration on the charset Rails used to handle the database connection);
MySQL receives the "supposedly"
latin1data, handling it as a single-byte multicharacter sequence, that is stored as is (meaning it suffers no conversion), since the data is supposedly using the same charset as the database.
This also explains why the user could see UTF-8 data: since it is also retrieved as
latin1by Rails, the supposedly single-byte multicharacter sequence will be correctly interpreted as the UTF-8 multibyte character it represents (since no conversion is performed), ending with an UTF-8 webpage.
mysqldump with multiple flags, overriding charsets, but nothing really worked. Some solutions worked partially (showing diacritics but not alphabets other than the latin one). The solutions about to be described are the combined result of several solutions with our specific needs.
Update: We were told through Twitter by @fhe that MySQL's
utf8 charset was breaking emojis and that he had to use
utf8mb4. It turns out MySQL's
utf8 is NOT UTF-8. While valid UTF-8 multi-byte sequences may use up to 4 8-bit bytes MySQL's
utf8 charset supports a maximum of 3 bytes per sequence.
Please replace every occurence of
utf8 in the commands below with
utf8mb4 for full UTF-8 support.
He also pointed out you may also need to use tables with dynamic row format and use large index prefixes since
utf8mb4 uses more space than plain
Mathias Bynens wrote a really nice article on how to fully support UTF-8 on MySQL.
If you are 100% sure this is your problem, you may be wondering how to fix it (and I'm sure you know by now that simply doing a
mysqldump won't do the trick). Let's consider your options:
You want to do it during the dumping process, since you can't afford to do it in production (or you're simply migrating and it's just good timing);
You want to do it on the database itself (be sure to have a full backup just in case).
Since we couldn't afford to do it in production, we chose to do it while exporting the database.
Fixing data on the database itself (inside of the damaged database)
We chose not to perform this kind of solution on our client's database, but most of the solutions seen the online point on this direction. Wordpress Codex has one of the best documented solutions I've seen of this.
The reasoning behind it is that data is actually on a different charset than MySQL thinks it is, and changing it will lead MySQL into an erroneous attempt of conversion into the new charset. By previously changing the field type to a binary field type (which has no charset), we avoid this attempt so that our data is left intact to be correctly interpreted according to its new charset.
It is worth mentioning that this solution needs to be applied for every
latin1 field, which on our case would be impractical given the number of fields and tables.
Fixing data outside of Production (outside of the damaged database)
The reasoning behind this solution is related to why the import of a simple
- The dump includes both the schema and data; the schema will create the tables as
latin1, while data will be UTF-8! We'll end up where we started!
- MySQL will try to convert your data if it thinks it should! Several solutions pointed to the same command sequence:
--skip-set-charset --default-character-set=latin1flags, to avoid MySQL attempt of reconversion and setting a charset.
- Replacing every occurrence of
CHARSET=utf8(whether by using
perlor any text editor that correctly supports UTF-8 [some editors may break the dump]). This ensures the tables are created with the correct charset.
- Restoring the dump.
With this rationale in mind, a combination of these solutions was achieved, with a small adjustment. Since the replacing step should be done on a file or through pipelining, I believe the best way to do it is by separating the schema dump from the data itself. That way, the replace can be performed on the schema dump only while pipelining the data directly to its destiny, without running
sed over the data (which should be greater than the schema).
This led to the final sequence:
mysqldumpof the database schema, either to a file or pipelined with the next 2 steps.
- Replace every occurrence of
CHARSET=utf8on the schema dump.
- Restoring the schema dump on a newly created database (with UTF-8 as default!).
mysqldumpof the data with
--skip-set-charset --default-character-set=latin1flags, directly pipelined to the future's destination
which translates into:
# Login into your future database host to create a new database with an UTF-8 charset $ mysql -h FUTURE_HOST -u FUTURE_USER -p mysql> CREATE DATABASE `FUTURE_DB` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; # Flush the current database schema on the future host, replacing all CHARSET=latin1 occurrences along the way mysqldump -h CURRENT_HOST -u CURRENT_USER -p CURRENT_DB --no-data --skip-set-charset --default-character-set=latin1 \ | sed 's/CHARSET=latin1/CHARSET=utf8/g' \ | mysql -h FUTURE_HOST -u FUTURE_USER -p FUTURE_DB --default-character-set=utf8 # Flush the current database data on the future host mysqldump -h CURRENT_HOST -u CURRENT_USER -p --no-create-db --no-create-info --skip-set-charset --default-character-set=latin1 CURRENT_DB \ | mysql -h FUTURE_HOST -u FUTURE_USER -p FUTURE_DB --default-character-set=utf8
I left out usual performance flags such as
--opt, as well as
--max_allowed_packet, but you should consider combining these with your dumps/imports.
I recommend reading Blue Box's post on this matter: Blue Box Blog: Getting out of MySQL Character Set Hell: they analyze the problem in fine detail and suggest several solutions, even addressing the problem of double and triple encoding. Socialcast blogs: Rails 2, MySQL, and Character Sets is also a nice reading on the subject.
This post should be seen as a mere attempt of providing the reader with a simple explanation, a diagnose method and two solutions, according to the specific needs.
Diagnosing the problem (a.k.a. "How to tell if this is happening to your database")
Querying the database for data will usually produce garbled text; if that is the case, try using the following command on a
mysql session, before querying the database again.
SET SESSION CHARACTER_SET_RESULTS = latin1;
If you're now seeing your data correctly, this is a strong indicator that your
latin1 database suffers from this disease; since the results are now being interpreted as
latin1, no conversion is performed and UTF-8 data is now being shown on your terminal.
Attention: I advise you to use the
mysql client for querying the data and avoid using phpMyAdmin or any elaborate client for this, since these may override the client's and results' charsets, hiding the problem.
Reproducing the problem
Run these commands on a
CREATE DATABASE `latin1_test_db` DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci; USE `latin1_test_db`; CREATE TABLE IF NOT EXISTS `latin1_test_tbl` ( `id` int(11) NOT NULL AUTO_INCREMENT, `field` text NOT NULL, PRIMARY KEY (`id`) ) DEFAULT CHARSET=latin1 AUTO_INCREMENT=1; SHOW VARIABLES LIKE 'collation_database'; SHOW VARIABLES LIKE 'character_set%'; /* The commands below will allow your mysql client to act as a misconfigured Rails application */ SET SESSION CHARACTER_SET_CLIENT = latin1; SET SESSION CHARACTER_SET_RESULTS = latin1; INSERT INTO `latin1_test_tbl` ( `id`, `field` ) VALUES ( NULL, 'Coimbra diz Olá! 日本語 is Japanese in Japanese, obviously...' ); SELECT * FROM `latin1_test_tbl`;
Your data was correctly inserted, as the
SELECT command shows. Now end your MySQL client session, login again and query the table for its data (using the last command from above), but this time without setting the session's character set. Is that garbish familiar?
- Blue Box Blog: Getting out of MySQL Character Set Hell
- Socialcast blogs: Rails 2, MySQL, and Character Sets
- Collation chart for latin1_swedish_ci (MySQL 6.0.4)
- Oneiroi: Converting mySQL Latin1 to Utf8
- Character encoding hell: Converting MySQL latin1 to utf8 for use in Django