Tuesday, January 13, 2015

The case of the missing smily face


For an application that I work on I encountered the problem that when users added an emoji character it did not get stored properly in the database. It turns out that MySQL 5.6 has a limitation on what characters can be stored in utf-8 fields. It only accepts characters that can be stored in 3 bytes. Anything that required 4 bytes needs the character set of utf8mb4. That worked great in tests, but when I tried to store values using OurSQL and SQLAlchemy. After some research I found that had to make two workarounds to get things working.

There are two parts to the solution. For SQLAlchemy you need to add the charset argument to the connection string.

connetion_string = 'mysql+oursql://<user>:<password>@localhost/<dbname>?charset=utf8mb4'


The second step is to register a function that return the normal utf8 codec when an app looks up utf8mb4. This is needed for OurSQL otherwise it will complain that it does not recognize utf8bm4. NOTE: After pasting this code I realize that the else should return codecs.lookup(name), but I have not tested this so I will leave this.

import codecs
codecs.register(lambda name: codecs.lookup('utf8') if name == 'utf8mb4' else None) 

As I understand it (not a MySQL person) in the future MySQL will support 4 byte characters with plain old UTF8. I will make an update when that happens.