opensource.google.com

Menu

Which languages convey the most information in the least space? Introducing the Unimorph dataset.

Monday, August 8, 2016

Several years ago a science journalist asked me which languages could pack the most information into a 140-character Tweet. Because Twitter defines a character roughly as a single Unicode code point, this turns out to be an easy question to answer. Chinese almost certainly rates as the most “compact” language from that point of view because a single Chinese character represents a whole morpheme (in linguist terminology, a minimal unit of meaning) whereas an English letter only represents a part of a morpheme. The Chinese equivalent of I don’t eat meat, which in English takes 16 characters including spaces is 我不吃肉, which takes just four.

But this question relates to a broader question that as a linguist I have often been asked: which languages are the most “efficient” at conveying information? Or, which languages can convey the same information in the smallest amount of space? Untethered by the idiosyncrasies of Twitter, this question becomes quite difficult to answer. What do you mean by “space”? Number of characters? Number of bytes? Number of syllables? Each of these has its own problems. And perhaps more crucially, what do you mean by “information”? The Shannon notion of information does not straightforwardly apply here.

A group of us at Google set out to answer this question, or at least to provide the form that an answer would have to take. We had the resources and experience needed to annotate data in multiple languages, and we were able to divert some of those resources to this task. The results were published in a paper presented at the 2014 International Conference on Language Resources and Evaluation in Reykjavík, Iceland.

We are now releasing the data on GitHub. The data consist of 85 sentences typical of the kinds of sentences generated by Google Now, translated into eight typologically diverse languages: English, French, Italian, German, Russian, Arabic, Korean, Chinese, which include some highly inflected and uninflected languages, and various types of morphology including inflectional and agglutinative. The data were annotated by one to three annotators depending on the language, with morphological information, counts of the marked features and other information. The main data file is in HTML, color coded by language, which makes it easy to browse but also easy to extract into other formats.

Since the basic information conveyed by each sentence can be assumed to be the same across languages, the main focus of the research was on the additional information that each language marks, and cannot avoid marking. For example, the English sentence:

Use my location for the search results and other services.

has the French translation:

Utilisez ma position pour les résultats de recherche et d'autres services.

The verb ending -ez, in boldface above marks “addressee respect”, a bit of information that is missing from the English original.  One could have used a different ending on the French verb, but then that would not avoid this bit of information—it would be choosing to mark lack of respect, or familiarity with the addressee.

In the paper we tried various ways of measuring the differing information content of the languages relative to various definitions of “space”. Considering all the factors together, we concluded that the languages that conveyed the most information in a given amount of space were highly inflected languages like Russian, with uninflected languages like Chinese actually being the “least efficient” at conveying information.

We don’t expect this to be the final answer, which is why we are releasing the data as open source in the hopes that others will find it useful and maybe can even extend it to more sentences or a wider variety of languages. Ultimately though, any answer to the question of which languages convey the most information in the smallest amount of space must seriously address what is meant by “information”, and must pay heed to the famous maxim by the Russian linguist Roman Jakobson (1959) that “languages differ essentially in what they must convey and not in what they may convey.”

By Richard Sproat, Research Scientist
.