We have all watched with excitement as Google unfolded the Google Knowledge Graph, giving insight into answers for questions that we never thought to ask. Similar "knowledge graph" initiatives from researchers in academia and industry have been underway to develop a global graph of Linked Data, where structured data on the Web is directly available for programmatic access in standard ways.
One of the most prominent Linked Data sources is DBpedia, a data set built by sharing (as structured data) facts extracted from Wikipedia. DBpedia has been serving as a nucleous for this evolving Web of Linked Data, connecting cross-domain information from numerous data sources on the Web, including Freebase.com and, by transitivity, the Google Knowledge Graph.
DBpedia Spotlight is a tool for connecting this new Web of structured information to the good old Web of documents. It takes plain text (or HTML) as input, and looks for 3.8M things of 360 different types, interconnecting structured data in 111 different languages in DBpedia. The output is a set of links where ambiguous phrases such as "Washington" are automatically "disambiguated" to their unambiguous identifiers (URIs) Washington, D.C. or George Washington, for example
During Google Summer of Code 2012, we had the pleasure and honor to work with four students to enhance DBpedia Spotlight in time performance, accuracy and extra functionality.
The core model we use for automatic disambiguation is based on a large vector space model of words. One student project by Chris Hokamp, included processing all the data on Hadoop, as well as analyzing the dimensions of this model using such techniques as Latent Semantic Analysis and Explicit Semantic Analysis.
Joachim Daiber implemented a probabilistic interpretation of the disambiguation model and provided a key-value store implementation that allows for efficiency and flexibility in modifying the scoring techniques.
Dirk Weissenborn spent his summer developing topical classification in our model and live updating/training of the models as Wikipedia changes (or news items are released) so that DBpedia Spotlight can be kept up to date with the world, as soon as events happen.
Finally, the fourth project by Liu Zhengzhong, provided an implementation of collective disambiguation. In this approach, each of the things that are found in the input text contribute to finding the meaning of the other things in the same text through graph algorithms that benefit from the structure of our knowledge base.
Together, these four projects will greatly enhance DBpedia Spotlight towards achieving its objective of serving as a flexible tool that can cater to many different applications interested in connecting documents to structured data. By the way, through links between DBpedia and Freebase you can use DBpedia Spotlight to obtain and use links from Web documents to the Google Knowledge Graph. How exciting is that?
By Pablo Mendes and Max Jakob, DBpedia Spotlight co-creators and Google Summer of Code 2012 Organization Administrators
Sigmah is a free software project for the integrated management of humanitarian projects, run by an open group of eleven NGOs facilitated by Groupe URD. Sigmah was created following a needs assessment carried out in 2008-2009, commissioned by a group of French NGOs who, like many, were suffering from infoxication (information overload).
Sigmah has continued to grow and in 2012, through the Google Summer of Code, some of its goals are going to be met:
Sigmah v1.0 released in June 2011 was solely a solution to enter and structure your data. With the highly skillful help of Google Summer of Code student Sherzod Muratov, we will have a new feature, as part of Sigmah’s core, to export data in spreadsheet format (.xls/.ods). With this increased capability to analyze all information collected in Sigmah, humanitarian workers will be able to more easily learn lessons from their experiences and improve the quality of their work. The Sigmah project is young and its community continues to grow. The website needed to be improved in many ways. Sharada Mohanty has tackled a couple of Sigmah’s immediate needs: improving tools for inner governance and deploying the means to enforce a community-driven culture for the user guide.
With all of this work, our project is getting stronger by responding to the needs of our users and our community is attracting more users to take part in the project. Icing on the cake: both our students have expressed interest to continue to contribute to our young project aiming to make life easier for humanitarian project management. For Sigmah the Google Summer of Code has been the best part of 2012!
By Olivier Sarrat, Sigmah project facilitator
Twitter is a simple real-time information network where the unit currency is 140 character messages called Tweets. Twitter connects you to the latest stories, ideas, opinions and news about what you find interesting. To run this service, we produce and consume a lot of open source software. Last year, we established our Open Source Office (@TwitterOSS) to support a variety of open source organizations that are important to us. We’re grateful to the open source community for their contributions, and want to maintain a healthy, reciprocal relationship.
We were thrilled to have a chance to participate in Google Summer of Code this year. We had three students work on a variety of projects:
Federico Brubacher spent time adding machine learning capabilities to Twitter Storm.
Kirill Lashuk added more internationalization and localization capabilities to Ruby via the TwitterCLDR project. This should help anyone in the Ruby community that needs to provide robust internationalization support for their application.
Ruben Oanta worked on adding MySQL codec support to Finagle, which is a network stack for the JVM set of protocol-independent tools.
What was unique this year for us is that we also worked with Blake Matheny from Tumblr on mentoring the Finagle MySQL project. From my perspective, it’s great to see multiple companies helping students get involved with open source. Thanks again to Google for providing a medium to do so.
By Chris Aniszczyk (@cra), Manager of Open Source at Twitter