Emoji for Unicode: Open Source Data for the Encoding Proposal

By Markus Scherer, Google Internationalization Engineering

Emoji (絵文字), or "picture characters", the graphical versions of :-) and its friends, are widely used and especially popular among Japanese cell phone users. Just last month, they became available in Gmail ― see the team's announcement: A picture is worth a thousand words.

These symbols are encoded as custom (carrier-specific) symbol characters and sent as part of text messages, emails, and web pages. In theory, they are confined to each cell phone carrier's network unless there is an agreement and a converter in place between two carriers. In practice, however, people expect emoji just to work - what they put into a message will get to all the recipients; what they see on a web page will be seen by others; if they search for a character they'll find it. For that to really work well, these symbol characters need to be part of the Unicode Standard (the universal character set used in modern computing).

There are active, on-going efforts to standardize a complete set of emoji as regular symbols characters in Unicode. This involves determining which symbols are already covered in Unicode, and which new symbols would be needed. We're trying to help this effort along by sharing all of our mapping data and tools in the form of the "emoji4unicode" open source project. The goal is more effective collaboration with other members of the Unicode Consortium and review by the cell phone carriers and other interested parties. By making these tools and mappings available, we hope to assist and accelerate the encoding process. Take a look at the documentation, browse the data and tools and let us know what you think.

opensource.google.com

Google Open Source Blog

Emoji for Unicode: Open Source Data for the Encoding Proposal

Wednesday, November 26, 2008