Google Summer of Code wrap-up: Apache Flink (previously Stratosphere)

Friday, September 26, 2014

We continue our Friday Google Summer of Code wrap-up series with Apache Flink (previously Stratosphere) who was a first time participant in the program. Organization Administrator Robert Metzger talks below about their two successful student participants as well as their project’s transition to the Apache Software Foundation incubator program. 
Apache Flink is a system for expressive, declarative, fast, and efficient data analysis. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases.

We were accepted to this year’s Google Summer of Code (GSoC) under our former project name “Stratosphere”. But during the summer our project entered the incubator of the Apache Software Foundation (ASF). Incubation is a process for new projects to enter the umbrella of the ASF. As part of the process our project name was subsequently changed from Stratosphere to Flink.

Our move to the ASF also meant quite a few changes for us and our students during the course of their projects. Both mentors and students were able to learn together about the new processes required by the ASF and in the end this transition worked out quite well for everyone involved.

The acceptance of our project into GSoC was a huge, exciting accomplishment for all of the Flink / Stratosphere developers and especially thrilling to a new, first time organization. We had two students this summer: Artem Tsikiridis and Frank Wu.

Artem worked on a full Hadoop MapReduce compatibility layer for Flink. Both Hadoop and Flink are distributed systems for processing huge amounts of data. Hadoop is an open source implementation of the MapReduce algorithm published by Google. It is widely used for a broad range of data intensive computing applications. Flink offers a broad range of operators and can be used to execute MapReduce-style applications.

Artem’s summer project concerned the implementation of a compatibility layer that exposes exactly the same APIs as Apache Hadoop. This feature allows existing Hadoop users to run their Hadoop jobs with Flink. Consequently, users are now able to utilize a faster execution engine for their existing code! Artem worked closely with the community and succeeded in bringing his changes into our main code line. His work will be available with the 0.7-incubating release of Apache Flink.

Frank Wu, our second GSoC student, worked on a large sub-project of Flink called Support for Streaming (Stratosphere Streaming). Frank initiated the development of the mini-batch processing API of Stratosphere Streaming, enabling operations on windows of tuples. Additionally, he contributed to both the iterative and stateful streaming solutions, two of the most challenging applications of streaming. Frank also provided numerous code examples for the topics he was working on. Like Artem, his work will be available with the 0.7-incubating release of Apache Flink.

I would like to thank the mentors, Fabian Hüske and Marton Belassi, as well as our second organization administrator, Ufuk Celebi, for their help with Stratosphere/Flink’s GSoC participation in the summer of 2014.

By Robert Metzger, Organization Administrator for Apache Flink