Budou: Automatic Japanese line breaking tool

Friday, October 21, 2016

Today we are pleased to introduce Budou, an automatic line breaking tool for Japanese. What is a line breaking tool and why is it necessary? English uses spacing and hyphenation as cues to allow for beautiful, aka more legible, line breaks. Japanese, which has none of these, is notoriously more difficult. Breaks occur randomly, usually in the middle of a word.

This is a long standing issue in Japanese typography on the web, and results in degradation of readability. We can specify the place which line breaks can occur with CSS coding, but this is a non-trivial manual process which requires Japanese vocabulary and knowledge of grammar.

Budou automatically translates Japanese sentences into organized HTML code with meaningful chunks wrapped in non-breaking markup so as to semantically control line breaks. Budou uses Cloud Natural Language API to analyze the input sentence, and it concatenates proper words in order to produce meaningful chunks utilizing PoS (part-of-speech) tagging and syntactic information. Budou outputs HTML code by wrapping the chunks in a SPAN tag. By specifying their display property as inline-block in CSS, semantic units will no longer be split at the end of a line.

Budou is a simple Python script that runs each sentence through the Cloud Natural Language API. It can easily be extended as a custom filter for template engines, or as a task for runners such as Grunt and Gulp. The latest version also caches the response so no duplicate requests are sent. If you are using Budou for a static website, you can process your HTML code before deployment.

Budou is aimed to be used in relatively short sentences such as titles and headings. Screen readers may read a sentence by splitting the chunks wrapped by SPAN tag or split by WBR tag, so it is discouraged to use Budou for body paragraphs.

As of October 2016, the Cloud Natural Language API supports English, Spanish, and Japanese, and Budou currently only supports Japanese. Support for other Asian languages with line break issues, such as Chinese and Thai, will be added as the API adds support.

Any comments and suggestions are welcome. You can find us on GitHub.

By Shuhei Iitsuka, UX Engineer