Posts from November 2008

Emoji for Unicode: Open Source Data for the Encoding Proposal

Wednesday, November 26, 2008

Emoji (絵文字), or "picture characters", the graphical versions of :-) and its friends, are widely used and especially popular among Japanese cell phone users. Just last month, they became available in Gmail ― see the team's announcement: A picture is worth a thousand words.

These symbols are encoded as custom (carrier-specific) symbol characters and sent as part of text messages, emails, and web pages. In theory, they are confined to each cell phone carrier's network unless there is an agreement and a converter in place between two carriers. In practice, however, people expect emoji just to work - what they put into a message will get to all the recipients; what they see on a web page will be seen by others; if they search for a character they'll find it. For that to really work well, these symbol characters need to be part of the Unicode Standard (the universal character set used in modern computing).

There are active, on-going efforts to standardize a complete set of emoji as regular symbols characters in Unicode. This involves determining which symbols are already covered in Unicode, and which new symbols would be needed. We're trying to help this effort along by sharing all of our mapping data and tools in the form of the "emoji4unicode" open source project. The goal is more effective collaboration with other members of the Unicode Consortium and review by the cell phone carriers and other interested parties. By making these tools and mappings available, we hope to assist and accelerate the encoding process. Take a look at the documentation, browse the data and tools and let us know what you think.

WHOPR - A scalable whole program optimizer for GCC

Thursday, November 20, 2008

Traditional compilation proceeds one file at a time. The compiler optimizes and generates code for each file in isolation and then the final executable is created by linking all the individual files together.

This model of compilation has the advantage that all the files can be compiled concurrently, which greatly reduces build time. This is particularly useful with current multiprocessor machines and applications consisting of hundreds and even thousands of files. However, this model presents a fairly significant barrier for optimization.

Consider this program:

for (;;) {
x += g (f (i, j), f (j, i));

float f(float i, float j)
return i * (i - j);

float g(float x, float y)
return x - y;

From an optimization perspective, inlining f and g inside foo is likely going to provide significant performance improvements. However, the compiler never sees both files at the same time. When it is compiling foo.c it does not have access to the functions in bar.c and vice-versa. Therefore, inlining will never take place.

One could get around this problem by moving the bodies of f and g to a common header file and declaring them inline. But that is not always desirable or even possible. So, several optimizing compilers introduce a new feature called Link-Time Optimization (LTO) that allows these kinds of cross-file manipulations by the compiler.

The LTO model essentially splits code generation in two major phases

1. Generation of Intermediate Representation (IR). The original source code is parsed as usual and an intermediate representation (IR) for the program is generated. The IR is a succinct representation of the original program, symbols and types. This contains all the information that the compiler needs to generate final code, except that instead of using it to generate final code for each file, the compiler saves it to an intermediate file for later processing.

2. Once the IR has been emitted for all the source files, the intermediate files generated in the previous step are loaded in memory and the whole set is analyzed and optimized at once.

To visualize this process, imagine that you simply concatenated all the source files together and compiled the result. Since the compiler has visibility over every function in every compilation unit, decisions that would normally use conservative estimates can instead be based on data-flow information crossing file boundaries. Additionally, the compiler is able to perform cross-language optimizations that are not possible when the compilation scope is restricted to individual files.

The LTO model of compilation is useful but it has severe scalability issues. A basic implementation has the potential to incur massive memory consumption during compilation. Since every function body in every file may be needed in memory, only relatively small programs will be able to be compiled in whole program mode.

At Google, we deal with several massively large applications, so we are working on a scalable alternative to traditional LTO called WHOPR (WHOle Program optimizeR), which introduces parallelism and distribution to be able to handle arbitrarily large programs. The basic observation is that to do many whole program optimizations, the compiler rarely needs to have all the functions loaded in memory, and final code generation can be parallelized by partitioning the program into independent sets.

WHOPR then proceeds in 3 phases

1. Local Generation (LGEN). This is the same as traditional LTO. Every source file is parsed and its IR saved to disk. This phase is trivially parallelizable using make -j or distcc or any other similar technique.

2. Whole program analysis (WPA). After all the IR files are generated, they are sent to the linker, but in this case the linker will not know what to do with them (there is no object code in them). So, the linker turns around and passes them back to the compiler which will collect summary information from every function in every file. This per-function summary information contains things like number of instructions, symbols accessed, functions called, functions that call it, etc. It is used to decide what optimizations to apply, but no optimizations are applied at this time. The compiler simply decides what to do and partitions the input files into new files that contain the original IR plus an optimization plan for each new file.

3. Local transformations (LTRANS). The new IR files generated by the previous phase are now compiled to object code using the optimization plan decided by WPA. Since each file contains everything needed to apply the optimization plan, it can also proceed in parallel.

This diagram shows the process. The only sequential step during optimization is the WPA phase, which does not need to operate on too much data and it is not computationally expensive. Everything else proceeds in parallel, suitable for multiprocessors or distributed machines.

After all the LTRANS processes are finished, the final object files are returned to the linker and the final executable is generated.

We are currently in the initial stages of implementation. The work is being implemented in the LTO branch in GCC. We expect to have an initial prototype by summer 2009. The branch can currently deal with some applications, but there are the usual rough spots. You can read more information about the project at

Serious Geeking Going on in Oxford Over Online Publishing

Thursday, November 13, 2008

On Wednesday 22 October, over a hundred geeks attended the ninth Oxford Geek Night, upstairs at the Jericho Tavern. After the musical theme of the previous OGN, this one had a distinct flavour of online publishing.

Jeremy Ruston of BT Osmosoft demonstrated TiddlyWiki (an open-source wiki application that works offline) and revealed its offshoot Project Cecily, a prototype ZUI (Zooming User Interface). Adrian Hon of Six to Start then explained the ideas and tech behind We Tell Stories, a complex Django-based site of interactive fiction, built for publishers Penguin UK.

Continuing the Django-ish theme, Rami Chowdhury discussed WSGI—the server/application web standard—in one of the more technical microslot talks (five minutes each, from local volunteers). In another, David Sheldon took us through the steps required to hack a CurrentCost electricity meter, to get at the regular XML packets it emits from a serial port.

In the microslot sessions we also covered moving your business mail to Google Mail, protection—or otherwise—against socially engineered virus vectors, and how to use an interlocking stack of Python, Ruby on Rails and Java to crawl the web for comparisons of mobile-phone tariffs. We also had a short talk from the Oxfordshire branch of the British Computing Society about their forthcoming IT-industry events.

As usual, the evening was rounded off by a book raffle, this time courtesy of Pearson Education. Many of the night’s talks—especially the keynotes and the microslot on antivirus protection—had generated heated debate among the geeks in the room, and this carried on for some time after proceedings had officially finished.

The Oxford Geek Nights are free events, thanks to Torchbox and the Google open-source team. But even the generosity of our sponsors couldn’t prevent the upstairs bar staff from tapping their watches, as we all headed downstairs into the main room of the pub to continue arguing.

Mixxx's Google Summer of Code 2008 Roundup Report

Wednesday, November 12, 2008

Google Summer of Code 2008 has been a great opportunity to bring fresh new talent into the Mixxx development team. For those not familiar with Mixxx, it's software that allows DJs to create live beatmixes. This year, Mixxx was supported by four students, each with a new project to help improve some aspect of Mixxx. As this was our second Summer of Code, we helped plan our students' projects better this year, which led to our students producing more maintainable code with clear paths for integration into our trunk.

Zach Elko worked on session saving and crash recovery. Session saving allows a DJ to save various aspects of Mixxx's state (such as the knob positions) for easy recall later. Early on in his project, Zach realized that a good starting point for a crash recovery system would be to allow Mixxx sessions to be saved and restored. His project's focus was shifted towards creating a rock solid session saving system, and Zach has made significant inroads toward this goal.

Russell Ryan rewrote Mixxx's waveform viewer widget. The waveform viewer widget renders a song's waveform in realtime and scrolls through it as playback proceeds. It also allows a user to seek through a song by dragging the widget. Russell's new waveform widget provides improved performance, better modularity, and is much more extensible than our previous widget. The new waveform viewer was merged into trunk in late July and was featured in our recent 1.6.0 final release.

Tom Care worked on improving Mixxx's support for hardware MIDI controllers, which are popular with DJs. MIDI controllers are hardware control devices that mimic the look and feel of real DJ mixers, and can make mixing much easier. Tom's work has yielded an easy MIDI binding interface so DJs can use any MIDI device with Mixxx, as well as overall improvements to the structure and modularity of our MIDI code.

And finally, Wesley Stessens continued his project to add Shoutcasting capabilities, which he began earlier in the year. Shoutcast support allows DJs to broadcast their mixes live through internet radio stations. Unfortunately, due to personal circumstances Wesley had to leave GSoC at the midterm. We're hopeful that he will rejoin our development team in the future.

By the end of August, the three remaining projects were in good shape. The two projects which haven't yet been merged into trunk have time-lines for being merged, and we're pleased with the outcome of these projects. We'd like to thank Google for their gracious support through Summer of Code this year. It's been a fantastic experience for us, and we're happy that we were able to introduce some students to open source development.

plop: Probabilistic Learning of Programs

Tuesday, November 11, 2008

Cross-posted with the Google Research blog

Traditional machine learning systems work with relatively flat, uniform data representations, such as feature vectors, time-series, and probabilistic context-free grammars. However, reality often presents us with data which are best understood in terms of relations, types, hierarchies, and complex functional forms. The best representational scheme we computer scientists have for coping with this sort of complexity is computer programs. Yet there are comparatively few machine learning methods that operate on programmatic representations, due to the extreme combinatorial explosions involved and the semantic complexities of programs.

The plop project, launched early Monday, November 10th, is unusual in addressing these issues directly - its long-goals are quite ambitious! Plop is being implemented in Common Lisp, an equally unusual programming language that is uniquely suited to constructing and transforming complex programmatic representations.

C++ Standards Meeting Finalizes Feature-Complete Draft Standard

Monday, November 10, 2008

In late September, Google hosted the 44th meeting of the ISO C++ Standards Committee in San Francisco, California. Approximately 50 members from seven countries met six days non-stop to nail down details of the new standard.

The new standard, "C++0x", will be a major upgrade to the language—the first major upgrade since C++ first became an International Standard in 1998. It will include support for concurrent programming, better abstraction power and efficiency, simpler programming, enhanced functional programming, upgraded generic programming, optional garbage collection, significant new library components (including TR1), and many other additions and cleanups. C++0x will still be recognizably the same language as today's C++, and it will be almost 100% compatible, but working programmers will find the new standard a much improved tool for serious application development.

All of the features in C++0x have been on the table for years, but this meeting was the one when the committee finally voted to commit to them. Among the long-awaited features added at this meeting were user-defined literals, symbol attributes, simplified iteration with Python-like for loops, library thread safety, and improved generic programming with "concepts".

This was an unusually busy meeting, and it achieved a major milestone: this is the meeting where the committee voted to advance the draft standard to Committee Draft. Or, in less bureaucratic language, we've shipped our beta. The language is now feature complete. The committee will still fix some bugs before the final version is officially released in 2010, and some features might get tweaked or even dropped, but you shouldn't expect major changes. Interested programmers can try the partial g++ implementation.

SWIG's First Google Summer of Code

Friday, November 7, 2008

SWIG is a programmers tool for semi-automating the calls to C or C++ code from almost any other programming language. The idea is to feed C/C++ header files into SWIG and SWIG then generates the 'glue' code so that your C/C++ library can be used from another language such as Python, Java, C#, Ruby, Perl etc. In fact there are implementations for supporting over 20 different of these target languages. The participating students have had a productive summer and have extended the number of languages and features supported in SWIG's first Google Summer of Code™.

Haoyu Bai has added support for the upcoming Python 3 release. Python is the most popular target language amongst SWIG users and no doubt this addition will be much appreciated by those who are thinking of upgrading to Python 3. Also Haoyu has provided new Python 3 features which make coding faster and simpler when using Python extension code. The main features added are function annotations, buffer interfaces and abstract base classes and are outlined in more detail here: Python 3 Support.

Jan Jezabek has added a new 'language' module providing Windows Component Object Model (COM) support. This new module makes it possible for any COM enabled language to easily call into C or C++ libraries. The COM module in SWIG is more powerful than most as it ultimately provides support for more than one language as there are numerous languages that can call into COM libraries. Compiled languages such as Visual Basic and scripting languages, such as JScript, VBA and VBScript that can run on the Windows Scripting Host are probably the most popular to benefit. A great use will be the ease of making C/C++ libraries available in applications supporting the various Basic dialects, such as and Microsoft Office. SWIG makes it easy to utilise more advanced C++ code, such as templates, and the COM module is no different here as Jan has added in very comprehensive coverage of the C and C++ languages, full details here: SWIG COM Module.

Maciej Drwal has added a module for calling C++ code from C code. It is now possible to automatically create a flattened API of C++ classes so that the C++ functionality is available in the form of easy to use C structs and global functions. For example, features such as C++ template classes / functions are easily callable from C. One cool part of this project is the graceful handling of C++ exceptions in the calling C code. Some introductory documentation is available here: SWIG C Module.

Cheryl Foil has added an interesting feature to improve code documentation in the target language. This works when C/C++ code is documented using the industry standard Doxygen tool for annotating methods, classes, variables etc. The new feature extracts the Doxygen comments from the code for use by one of the many target languages. Cheryl has added initial support for Java so that the Doxygen comments are turned into JavaDoc comments embedded into the generated Java wrappers, see Doxygen support in SWIG for details.

Lastly, a great big thanks to the other mentors involved in making this happen, Ian Appru, Olly Betts, and Richard Boulton and finally to Google for funding a great programme.

OpenCog and GSoC

Thursday, November 6, 2008

This summer OpenCog was chosen by Google to participate in the Google Summer of Code™ project: Google funded 11 students from around the world to work under the supervision of experienced mentors associated with the OpenCog project, and the associated OpenBiomind project.

OpenCog is a large AI software project with hugely ambitious goals (you can't get much more ambitious than "creating powerful AI at the human level and beyond") and a lot of "moving parts" -- and the most successful OpenCog GSoC projects seemed to be the ones that successfully split off "summer sized chunks" from the whole project, which were meaningful and important in themselves, and yet also formed part of the larger OpenCog endeavor ... moving toward greater and greater general intelligence.

Many of the GSoC projects were outstanding but perhaps the most dramatically successful (in my own personal view) was Filip Maric's project (mentored by Predrag Janicic) which involved pioneering an entirely new approach to natural language parsing technology. The core parsing algorithm of the link parser, a popular open-source English parser (that is used within OpenCog's RelEx language processing subsystem), was replaced with a novel parsing algorithm based on a Boolean satisfaction solver: and the good news is, it actually works ... getting the best parses of a sentence faster than the old, standard parsing algorithm; and, most importantly, providing excellent avenues for future integration of NL parsing with semantic analysis and other aspects of language-utilizing AI systems. This work was very successful but needs a couple more months effort to be fully wrapped up and Filip, after a brief break, has resumed working on it recently and will continue throughout November and December.

Cesar Maracondes, working with Joel Pitt, made a lot of progress on porting the code of the Probabilistic Logic Networks (PLN) probabilistic reasoning system from a proprietary codebase to the open-source OpenCog codebase, resolving numerous software design issues along the way. This work was very important as PLN is a key aspect of OpenCog's long-term AI plans. Along the way Cesar helped with porting OpenCog to MacOS.

There were also two extremely successful projects involving OpenBiomind, a sister project to OpenCog:

* Bhavesh Sanghvi (working with Murilo Queiroz) designed and implemented a Java user interface to the OpenBiomind bioinformatics toolkit, an important step which should greatly increase the appeal of the toolkit within the biological community (not all biologists are willing to use command-line tools, no matter how powerful)
* Paul Cao (working with Lucio Coelho) implemented a new machine learning technique within OpenBiomind, in which recursive feature selection is combined with OpenBiomind's novel "model ensemble based important features analysis." The empirical results on real bio datasets seem good. This is novel scientific research embodied in working open-source code, and should be a real asset to scientists doing biological data analysis.

And the list goes on and on: in this short post I can't come close to doing justice to all that was done, but please see our site for more details!

All in all, we are very grateful to Google for creating the GSoC program and including us in it. Thanks to Google, and most of all to the students and mentors involved.

Nmap's Fourth GSoC: Success Stories and Lessons Learned

Wednesday, November 5, 2008

The Nmap Security Scanner Project was honored to participate in our fourth Google Summer of Code(tm)! The pencils-down date was two months ago, but so much code was produced that we're just now finishing the integration process. I finally have time to reflect on these last four years, what GSoC has brought us, and the lessons it has taught us.

In 2005 (detailed writeup), 70% (7 out of 10) students succeeded, and they tackled some wonderful projects! This year we begin work on our new Zenmap GUI (then named Umit), Ncat network communication utility, and 2nd generation OS detection system. Doug Hoyte first made major contributions that summer, and continues helping to this day. I was the mentor for all 10 students, and I had them all send me patches rather than providing SVN access. Nmap didn't even have a public SVN tree back then.

In 2006 (full writeup), I had a better idea of what works and what doesn't and was able to improve the success rate to 80% (8 out of 10). Perhaps the most exciting project was the Nmap Scripting Engine (NSE), which has become one of Nmap's most compelling features. It allows users to write (and share) simple scripts to automate a wide variety of networking tasks. We also finished and integrated the 2nd generation OS detection system, and Zenmap (Umit) continued to improve. I again mentored the students myself without providing SVN access.

In 2007 (full writeup), our success rate grew again to 83% (5 of 6)! I attribute part of the success to me being less of a control freak. For example, I took only 4 students compared to 10 the previous year. The remaining two 2006 students were mentored by Diman Todorov, who created NSE as a 2006 SoC student. I also made the Nmap SVN server public and provided commit access to the students. This year we formally integrated Zenmap into the Nmap build system and packages, making massive improvements along the way. This Summer also introduced David Fifield to the Nmap project and was the first SoC for Kris Katterjohn. Both of them have been prolific developers ever since then.

Enough with the history—let's take a look at our 2008 results! I'm happy to report that we had an 86% (6 out of 7) success rate. In other words, our success rate has increased every single year! I like to credit improved processes and interaction based on what we've learned before, but it also helps that we invite the best students back in later years. We've never had a 2nd year (or more) student fail. This year we expanded to three mentors, all of whom (except for me) were former SoC students. Now let's look in detail at our 2008 SoC accomplishments:

  • Vladimir Mitrovic spent the summer improving the Zenmap GUI, under David Fifield's expert mentorship. They made huge usability and stability improvements, but the pinnacle of their summer achievement was clearly the scan aggregation and topology features! Scan aggregation allows you to conduct multiple scans at different times and add them seamlessly to your existing results. Topology draws a beautiful interactive diagram like this of the discovered network:

  • Jurand Nogiec also worked with David on Zenmap, and was responsible for many key UI improvements which now seem obvious in hindsight. For example, he added a cancel button for aborting a scan in progress without clearing the Nmap output, and he added context-sensitive help to the many dozens of options in the Profile Editor. He also made numerous improvements to the command entry field for people who like to type Nmap command directly, while still benefiting from Zenmap's visual and searchable presentation of results.

  • Patrick Donnelly made substantial NSE infrastructure improvements. He added mutex support and an NSE Standard Library, fixed some serious bugs, and rewrote and optimized a substantial amount of code (particularly the nse_init system). But his crowning accomplishment was the NSEDoc system, which uses special comments and variables in script and library code to generate a comprehensive documentation portal.

  • Kris Katterjohn, who already had hundreds of useful Nmap patches to his name, returned for 2008 to write hundreds more! There is no way I can list everything he did here, particularly as his contributions ranged all over the map from writing NSE libraries (such as the username/password module and the standardized communication library) to improving Windows support (adding IPv6 and OpenSSL). His biggest project has been finishing up Ncat, our advanced Netcat replacement (which began as a 2005 SoC project by Chris Gibson). Ncat is now integrated with Nmap in our latest SVN revision.

  • Michael Pattrick was David's third student, and he accomplished a wide variety of tasks. For example, he created a new OSAssist application for testing and integrating the thousands of Nmap OS detection submissions sent in by Nmap users all over the world. With OSAssist, integration is more accurate and much less tedious. Michael also built two prototypes (one in Perl and then another in C++) for an Ndiff application which compares two or more scan output files and prints out any changes. The prototypes proved so popular that David wrote a final version in Python which is now integrated with Nmap in our latest SVN revision.

  • Philip Pickering spent the summer working on NSE scripts and libraries. We've already incorporated his libraries for binary data manipulation, DNS queries, Base64 encoding, SNMP, POP3, and cryptographic hashes. We've also incorporated several scripts he wrote utilizing these new libraries.

In addition to these core Nmap projects, 5 students were sponsored to work on the UMIT Nmap GUI (now a separate project led by Adriano Marques). Four of their five students passed, as described here.

Please join me in congratulating all these students for their excellent work! I'm particularly pleased that many of the SoC students have continued contributing even though the summer has ended. I'm looking forward to GSoC 2009 (assuming it is held again and they invite us), but 2008 will be a tough year to top!

GitTogether '08

Tuesday, November 4, 2008

Last week Google played host to the first Git developer conference at its Mountain View headquarters. The 3-day conference was well attended, with almost 25 major contributors and users coming out to discuss the past and future of the Git distributed version control system.

Several major topics were presented, leading to some highly interesting new topics starting on the Git mailing list. A true Git library is now being planned, to provide native bindings into scripting languages such as Perl and Python. Major user interface improvements to git send-email and the overall user experience were also introduced and are well under way. A Google Tech Talk, Contributing With Git, was also given by Johannes Schindelin, and is now available to the public on YouTube.

More details about the sessions, including slides and notes, are available on the git wiki.

A big thanks to Google for supporting open source projects by offering meeting space for the conference attendees.

Gerrit and Repo, the Android Source Management Tools

Monday, November 3, 2008

A couple weeks ago, we announced the Android open source release. Beside it, we silently released the tools that we wrote in order to make handling a large multi-repository project manageable in git. If you had a chance to look through the Android open source website, you'll notice references to a tool called repo. Why did we write this? With approximately 8.5 million lines of code (not including things like the Linux Kernel!), keeping this all in one git tree would've been problematic for a few reasons:

* We want to delineate access control based on location in the tree.
* We want to be able to make some components replaceable at a later date.
* We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
* We don't want our most technical people to spend their time as patch monkeys.

The repo tool uses an XML-based manifest file describing where the upstream repositories are, and how to merge them into a single working checkout. repo will recurse across all the git subtrees and handle uploads, pulls, and other needed items. repo has built-in knowledge of topic branches and makes working with them an essential part of the workflow.

The gerrit code review tool is based off of rietveld. Gerrit is itself split into two components: Half that runs on Google App Engine to provide front-end web service, and half that runs on a machine to handle attempted merges into the "upstream" branch, and the various code review branches. When we integrate the auto-builders into the system, that will also be handled by Gerrit.

We have a workflow diagram that shows how code gets into the system for Android. If you're looking to switch to git, but don't want to lose the ability for multiple people to commit into an upstream tree, this is one solution for you to consider. Interested? Find us at