Training a Toy Elephant with Google Summer of Code

Monday, March 22, 2010

Google has redefined many many things. It has redefined scalability. When the entire world was racing towards high performance computing, Google came up with MapReduce and the Google File System that allowed them to process the whole web in a matter of hours across thousands of cheap computers. With its education on MapReduce, with its contributions to Open Source in terms of code, infrastructure and innovative initiatives like the Google Summer of Code™, Google has taken openness to a whole new level. Through its initiatives, Google also allows you to export your private data outside outside the Google server. Google also liberates public user data like the MapMaker annotations, which was exported within hours of the Chilean earthquake. When you are inspired by the technology, the data liberation, the Open Source and have two amazing years in Google Summer of Code, you end up with a great open tool like Apache Mahout.I talked about the different algorithms in Mahout and was thrilled by the enthusiasm of the students there.

Mahout is an Apache Software Foundation project, which aims to create scalable machine-learning libraries using a variety of techniques including leveraging Apache Hadoop. Unlike other Open Source machine-learning libraries, Mahout was built with one thing in mind: the ability to scale over large sized data. We are not talking about the whole Internet here, just a small fraction of it, but large enough that processing them is near to impossible on one machine. Mahout is becoming more and more relevant in a world where gigabytes and terabytes of data are coming into the hands of the public. The latest release of Mahout has really solid and scalable implementations of recommendation, clustering, classification, pattern mining, and genetic algorithms.

Two years ago, I had a chance to join the project along with Deneche Abdel Hakim and David Hall when we were selected in the Google Summer of Code program. With help from our mentors and other committers on Mahout, we were able to contribute a lot of algorithms to the project. After two amazing years in Google Summer of Code and on the verge of the third one, the project looks like its about to break free. We have more contributors coming in, more algorithms, improvements in quality and performance. Mahout is also being made a top-level project under Apache. The latest release of Mahout contains the Colt high performance collections. This has given a great boost to the performance of the core data-structures. Mahout can create vectors from the entire articles of Wikipedia in English in under an hour on an 8 node Hadoop cluster. This is just the beginning, as more interesting things are being planned for future releases and I see a big role of Summer of code students in it. Mahout is a great platform for students and professors in universities to use for their research work in machine learning to get results quickly for large data-sets.

Recently, I went to the India Hadoop Summit at Bangalore, India to help spread awareness of Apache Mahout and Google Summer of Code. I had the good fortune of presenting Mahout in the un-conference to a big group of cloud computing lovers from India.

Many people including cloud computing adopters and students were hearing about the Google Summer of Code program for the very first time and I am happy that I helped spread the awareness of the same.

Mahout has grown, and so have I, from a Google Summer of Code student to a committer at Mahout, to a Googler and hopefully to being a mentor this year. I am also co-authoring a book on Mahout with Manning publications. Google Summer of Code has opened up many doors for me. It helped me hone my coding skills, helped me get in touch with cutting edge research work, helped me find great peers in the Open Source community whose help I will always cherish. Many thanks to Google and the Google Summer of Code program for giving me this opportunity and for helping thousands of students and hundreds of Open Source projects worldwide and for ensuring that the world and its information stays open.

You can find more about Mahout Project and the usages of various algorithms on the Mahout wiki. If you are a student interested in implementing a data-mining or a machine-learning algorithm, Mahout is the right place to be this summer. Take a look at our GSOC project ideas here and please come and discuss your proposal with us on the Mahout mailing list.

Pictures courtesy of Dave Nielson, Co-Founder, Cloudcamp