Google Open Source Blog: November 2009

Posts from November 2009

The Apertium Project's First Google Summer of Code

Friday, November 27, 2009

The Apertium Project works on open-source machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalized languages, but also work with larger languages. To date, we have released translators for 21 language pairs, covering languages spoken by 1.1 billion people, ranging from English (est. 500m speakers) to Aranese (est. 4,000 speakers). A similar number of additional language pairs are in development. The Apertium software is licensed under the GPL, but in addition (a rarer situation in the machine translation field) so is the data for all these language pairs. This means that the data can be re-used by other language projects (e.g. in developing spelling or grammar checkers, thesauri, etc).

This was our first year in Google Summer of Code and we were very fortunate to receive nine student slots. We filled them with some great students and are pleased to report that out of the nine projects, eight were successful.

The completed project were:

A translator for Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)

This project was accepted as part of our "adopt a language pair" idea from our ideas page. Some work had already been done on the translator but it was a long way from finished. Kevin Unhammer from the University of Bergen was mentored by Trond Trosterud from the University of Tromsø. The final result, after an epic effort, is a working translator (and the first free software translator for nb-nn) that makes a mistake in only 11 words out of every 100 translated, making using the system for post-edition feasible.

One of the key aspects of Kevin's work was the re-use and adaptation of existing open source resources. Much of the bilingual dictionary was statistically inferred from the existing translations in KDE, using ReTraTos and GIZA++ (created by Franz Och). In addition to this, Kevin used the Oslo-Bergen Constraint Grammer, contributing fixes not only to that, but to the VISL CG3 software itself. After the GSoC deadline, Kevin has continued his work, including incorporating some changes from feedback from the Nynorsk Wikipedia.

A translator for Swedish (sv) to Danish (da)

Another language pair adoption, Michael Kristensen, who had previously done some work on this translator, was mentored by Jacob Nordfalk, the author of our English to Esperanto translator. As there are very few free linguistic resources for Swedish and Danish the work was pretty much started from scratch, although we took great advantage of the Swedish Wiktionary. The translator is only unidirectional, from Swedish to Danish, and it has an error rate of around 20%.

The completion of this translator is something of a triumph for Apertium. Begun back in 2005, the project had been neglected for many years. This was the first translator for the Apertium platform that focused on non-Romance languages.

Multi-engine machine translation (MEMT)

Gabriel Synnaeve was mentored by Francis Tyers to work on a module to improve the quality of machine translation by taking translations from different systems and merging their strengths and discarding their weaknesses. The two systems focused on in the initial prototype are Apertium (rule-based MT) and Moses (statistical MT) but it can easily be extended to more. The idea behind the system is that for some languages there is often not one MT system which is better than all others, but some are better at some phrases and some are better at others. Thus, if we can combine the output of two or more systems with different strengths/weaknesses, we can make better translations.

Perhaps the most exciting aspect of the MEMT project is its potential for use as a research platform for future work on hybrid machine translation, by allowing the researcher to focus only on the algorithms they wish to implement. During the project, Gabriel was joined by Francis in person for a 'mini-hackathon', which, despite something of a farcical start involving requests made on IRC for phone calls across Europe on behalf of two people who were in the same city, lead to a greater degree of functionality and modularization in the code.

Highly scalable web service architecture for Apertium

Víctor Manuel Sánchez Cartagena worked with mentor Juan Antonio Perez-Ortiz on a highly-scalable web service architecture, or, Apertium for Cloud computing. Initially targeting Amazon's EC2, as well as standalone servers, the scalable web service allows the use of multiple translation services on multiple physical or virtual servers, scaling to meet the translation demands of users, from a single user-facing service, which implements the Google Language API.

The core of the system is the translation router, which controls the flow between user and translation server, based on a variety of factors, including the availability of the language pair, the current load on the server, as well as providing a framework to allow these factors to have different priorities on a per-user basis. It also takes into account the cost of each translation request. The project is a complete package; as well as the router, it includes a translation daemon, and convenience scripts to ease the rollout of server instances.

In addition to his work on his project, Víctor is also serving as an organiser for the FreeRBMT workshop.

Conversion of Anubadok

Abu Zaher was mentored by Kevin Donnelly and Francis Tyers to convert Anubadok, an open-source MT system for English to Bengali to work with the Apertium engine. This was an ambitious project and not all of the goals were realised, but we were able to make the first wide-coverage morphological analyser / generator for Bengali and a substantial amount of lexical transfer, so the project was a great success.

Zaher is also looking at improving the Ankur spell checker with information from his analyser / generator, so the work done is already being reused; there is also interest in using the data to create a Bengali stemmer, for more efficient searching/indexing of Bengali texts, and a number of tools which were created to model the various aspects of Bengali inflection will certainly prove useful in other areas of NLP for Bengali.

Apertium going SOA

Pasquale Minervini's work was motivated by the needs of Informatici senza Frontiere to have a translation engine that would fit into a Service-Oriented architecture. To this end, Pasquale, mentored by Jimmy O'Regan, designed an XML-RPC-based server that efficiently contains the Apertium pipeline, and layered it with JSON (still under development), SOAP, and CORBA services, which, as well as making Apertium more buzzword compliant, gives a greater range of options to programmers wishing to integrate Apertiums translation services into a wider range of architectures. This is undoubtedly a popular project idea: Alexa's keywords for Apertium show 'apertium going soa' and 'deadbeef apertium' (deadbeef is Pasquale's IRC nick) in 2nd and 4th place for search keywords leading to Apertium.

Because of the potential overlap between their projects, in the first weeks of their GSoC work, Pasquale and Víctor agreed on the Google Language API as a standard for their projects to communicate; Pasquale took this agreement one step further by implementing the 'language detection' feature of the API - something previously unavailable in Apertium. In addition to that, Pasquale also contributed memory leak checks against the Apertium platform, as well as other fixes, and has helped another (non-GSoC) student in the goal of porting Apertium to Windows.

Trigram part-of-speech tagging

Zaid Md. Abdul Wahab Sheikh was mentored by Felipe Sánchez Martínez to improve our part-of-speech tagging module to use trigrams instead of bigrams, as well as implementing changes to the training tools to create data for it.

Apertium was originally designed for closely related languages, but is growing to meet the challenges of translating between more distant languages. One of the unique aspects of Dr. Sanchez's work on Part-of-Speech tagging is the use of target language information which allows an accurate tagger to be trained using much less data than usual. Zaid's work builds on Dr. Sanchez's work with first-order Hidden Markov Models, extending it to second-order HMMs, similarly to TnT. This enables more accurate translation between more distant languages, using the same methods, so that the rest of the Apertium system can continue to grow.

Java port of lttoolbox

Raphaël Laurent worked with Sergio Ortiz Rojas to port lttoolbox to Java. lttoolbox is the core component of the Apertium system; as well as providing morphological analysis and generation, it also provides pattern matching and dictionary lookup to the rest of Apertium, so a Java port is the first step towards a version of Apertium for Java-based devices. Raphaël finished an earlier line-for-line port contributed by Nic Cotrell, first making it work; then making it binary compatible.

As it stands currently, lttoolbox-java can be integrated into other Java-based tools, facilitating the re-use of our software and our extensive repository of morphological analysers. Tools such as LanguageTool, the open source proofreading tool, also make extensive use of morphological analysis, but OmegaT, the open source CAT tool, could use it for dictionary look-up of inflected words; it could even be used with our own apertium-morph tool: a plugin for Lucene that allows linguistically-rich document indexing.

FreeRBMT

On the 2nd and 3rd of November, we held the first FreeRBMT workshop, which was heavily inspired by the Google Summer of Code program, both as a way for students and mentors to meet in person, and to provide the students with an opportunity to present peer-reviewed papers about the work they completed during the program. The entire proceedings are available from the University of Alicante; in particular, we would like to highlight the papers which were successfully presented by the students who took part in GSoC:

Apertium goes SOA: an efficient and scalable service based on the Apertium rule-based machine translation platform; Minervini, Pasquale

Development of a morphological analyser for Bengali; Faridee, Abu Zaher Md.; Tyers, Francis M.

An open-source highly scalable web service architecture for the Apertium machine translation engine; Sánchez-Cartagena, Víctor M.; Pérez-Ortiz, Juan Antonio

Reuse of free resources in machine translation between Nynorsk and Bokmål; Unhammer, Kevin; Trosterud, Trond

A trigram part-of-speech tagger for the Apertium free/open-source machine translation platform; Sheikh, Zaid Md Abdul Wahab; Sánchez-Martínez, Felipe

In addition, the following paper was presented by the mentors of a successful project (Michael, the student, was unfortunately too busy to participate in its writing):

Shallow-transfer rule-based machine translation for Swedish to Danish; Tyers, Francis M.; Nordfalk, Jacob

We would like to thank Google for providing us with the opportunity to participate in the Summer of Code program; in particular, Leslie, Cat, and Ellen, for making it run so smoothly. We would also like to make special mention of two students: Ankitha Rao and Daniel Beck, who, despite being unsuccessful in their applications, continued to work on their proposed projects (an English to Hindi translator, and a module for multi-word units, respectively). Finally, we would like to thank all of the students, mentors, and administrators who contributed their time and skill to Apertium.

By Francis Tyers and Jimmy O'Regan, Summer of Code Mentors for the Apertium Project

SWIG's Second Summer of Code

Monday, November 23, 2009

SWIG is a programmer's tool designed to make it easier to use C and C++ code from other popular programming languages such as Python, Perl, Ruby, PHP, Java, and C#. 2009 was SWIG's second Summer of Code, and this year we mentored five projects related to SWIG. All five students were very active over the summer period and produced some great new features. In no particular order:

Matevž Jekovec has been busy working at the coal face of SWIG to add support for C++0x, the forthcoming C++ standard. Matevž has managed to achieve close to full support for C++0x. The C++0x Wikipaedia article details the numerous planned new features and Matevž has put together a SWIG C++0x page documenting the new SWIG support for each of these. In summary the enhanced C++ language can now be parsed by SWIG, which in itself is a great step. There is much more than just this though, as most of the information parsed is used to create useful wrappers of C++0x code. The work can be tried out on the C++0x branch which should be merged fairly soon into a forthcoming release.

Miklos Vajna has been working on SWIG's PHP support to implement an advanced SWIG feature already supported for most other target languages, but not PHP. The feature is called "directors" and allows cross-language polymorphism - wrapped C++ classes can be subclassed in PHP and virtual method calls work in the natural way, whether they're made from PHP or C++ code. You can read more in the new PHP Director documentation. Miklos made such great progress that we were able to merge this support into SWIG 1.3.40, which was released even before the Summer of Code finished. Miklos also spent some time working on improving SWIG's test suite for PHP, and fixing bugs in the PHP support.

Ashish Sharma spent the summer adding support for Objective-C as a new target language. Objective-C is a major language on the Mac OS X platform. This means that now SWIG can be used to generate Objective-C wrappers over C++ code. In particular the wrappers include proxy classes, which preserve the class hierarchy from the C++ code. Ultimately this means that from the user's perspective, proxy objects look no different to objects originally written in Objective-C. Adding a new target language is quite a considerable task and Ashish is keen to add plenty more improvements over the coming months. Ashish's work is in Subversion and can be accessed in the ashishs99 branch.

Baozeng Ding has also added a new target language, in this case for the Scilab language, a free numerical computing package. He has coded up support for all the C features: variables, functions, constants, enums, structs, unions, pointers and arrays and also intends to develop it further in the near future. Documentation for SWIG and Scilab can be viewed online direct from Baozeng's Subversion branch.

Kosei Moriyama has been working on Perl bindings for the Xapian library using SWIG, to replace some existing bindings implemented by hand. He's achieved almost complete compatibility with the API of the existing bindings (the only real omission is callbacks which are waiting for completion of director support for Perl in SWIG). He has also wrapped features which weren't previously accessible from Perl. You can view Kosei's work online in his Subversion branch.

Finally, many thanks to Google for sponsoring the Summer of Code and a special thanks for all the hard work done by the students, mentors and Olly Betts, the co-administrator.

By William Fulton, SWIG administrator

Chromium OS Now Open Sourced

Wednesday, November 18, 2009

In July we announced that we were working on a project called Google Chrome OS, an open source operating system based on the Google Chrome browser and built for today's web. For the past few months we have been working hard on developing a solid foundation and today we are excited to announce the Chromium OS open source project.

You can read more about our open source announcement at the Chromium Blog, or get involved directly at chromium.org. We look forward to working with the open source community to help shape the future of personal computing.

By Martin Bligh, Software Engineer

Hey! Ho! Let's Go!

Tuesday, November 10, 2009

Here at Google, we believe programming should be fast, productive, and most importantly, fun. That's why we're excited to open source an experimental new language called Go. Go combines the development speed of working in a dynamic language like Python with the performance and safety of a compiled language like C or C++. Typical builds feel instantaneous; even large binaries compile in just a few seconds. And the compiled code runs close to the speed of C. Go lets you move fast.

Go is a great language for systems programming with support for multi-processing, a fresh and lightweight take on object-oriented design, plus some cool features like true closures and reflection.

Want to write a server with thousands of communicating threads? Want to spend less time reading blogs while waiting for builds? Feel like whipping up a prototype of your latest idea? Go is the way to go! Check out the video for more information or visit golang.org.

By Robert Griesemer, Rob Pike, Ken Thompson, Ian Taylor, Russ Cox, Jini Kim and Adam Langley - The Go Team

London Open Source Jam 14

Wednesday, November 4, 2009

We held the 14th Google London Open Source Jam at our Victoria HQ on September 24th. The topic this time was "Video and Sound", and our Jammers had some real treats to share.

Steven Goodwin told us how his open source SGX 3D graphics engine deals with three key problems of other computer game engines. On a similar theme, Themis Bourdenas discussed the vine engine, a modular game engine for 2d and 3d games.

Borys Musielak presented Filmaster, an open source film recommendation engine. Neil Harris told us about an attempt by the Kendra Initiative to foster a common meta data format for content discovery on the semantic web.

In an Open Source Jam first, Jagannathan gave a performance of his Din software musical instrument. Din is designed for playing live Indian music, is based on Bezier curves and really has to be heard to be fully appreciated.

Sam Mbale gave us an update on his projects to help Africans build online communities using open source. Mike Mahemoff discussed some web tools frameworks for intranets, bookmarklets and trails in Scrumptious.

The UK government has plans to introduce a law to allow content-owners to force ISPs to disconnect the internet connection of users suspected of file sharing, without any proof. Glyn Wintle gave us an overview of how the proposed law will affect us, how the Open Rights Group is campaigning against it, and how we can help.

Douglas Squirrel talked about the difficulty blind people have in finding information on websites, and presented BlindPages.com - a new project to reformat the web in a screen-reader friendly way. He also demoed a prototype telephone interface to the service.

Much pizza was eaten and free beer drunk, and we all ended up in the pub next door to continue our discussions. A big thank you to all our speakers and attendees, and we hope to see you at the next Jam!

By Matt Godbolt, Mobile Engineering