Google Open Source Blog: November 2017

Posts from November 2017

Google Summer of Code 2017 Mentor Summit

Thursday, November 30, 2017

This year Google brought over 320 mentors from all over the world (33 countries!) to Google's offices in Sunnyvale, California for the 2017 Google Summer of Code Mentor Summit. This year 149 organizations were represented, which provided the perfect opportunity to meet like-minded open source enthusiasts and discuss ways to make open source better and more sustainable.

Group photo by Dmitry Levin used under a CC BY-SA 4.0 license.

The Mentor Summit is run as an unconference in which attendees create and join sessions based on their interests. “I liked the unconference sessions, that they were casual and discussion based and I got a lot out of them. It was the place I connected with the most people,” said Cassie Tarakajian, attending on behalf of the Processing Foundation.

Attendees quickly filled the schedule boards with interesting sessions. One theme in this year’s session schedule was the challenging topic of failing students. Derk Ruitenbeek, part of the phpBB contingent, had this to say:

“This year our organisation had a high failure rate of 3 out of 5 accepted students. During the Mentor Summit I attended multiple sessions about failing students and rating proposals and got a lot [of] useful tips. Talking with other mentors about this really helped me find ways to improve student selection for our organisation next time.”

This year was the largest Mentor Summit ever – with the exception of our 10 Year Reunion in 2014 – and had the best gender diversity yet. Katarina Behrens, a mentor who worked with LibreOffice, observed:

“I was pleased to see many more women at the summit than last time I participated. I'm also beyond happy that now not only women themselves, but also men engage in increasing (not only gender) diversity of their projects and teams.”

We've held the Mentor Summit for the past 10+ years as a way to meet some of the thousands of mentors whose generous work for the students makes the program successful, and to give some of them and the projects they represent a chance to meet. This year was their first Mentor Summit for 52% of the attendees, giving us a lot of fresh perspectives to learn from!

We love hosting the Mentor Summit and attendees enjoy it, as well, especially the opportunity to meet each other. In fact, some attendees met in person for the first time at the Mentor Summit after years of collaborating remotely! According to Aveek Basu, who mentored for The Linux Foundation, the event was an excellent opportunity for “networking with like minded people from different communities. Also it was nice to know about people working in different fields from bioinformatics to robotics, and not only hard core computer science.”

You can browse the event website and read through some of the session notes that attendees took to learn a bit more about this year’s Mentor Summit.

Now that Google Summer of Code 2017 and the Mentor Summit have come to a close, our team is busy gearing up for the 2018 program. We hope to see you then!

By Maria Webb, Google Open Source

Google Code-in contest for teenagers starts today!

Tuesday, November 28, 2017

Today marks the start of the 8th consecutive year of Google Code-in (GCI). It’s the biggest contest ever and we hope you’ll come along for the ride!

The Basics

What is Google Code-in?

Our global, online contest introducing students to open source development. The contest runs for 7 weeks until January 17, 2018.

Who can register?

Pre-university students ages 13-17 that have their parent or guardian’s permission to register for the contest.

How do students register?

Students can register for the contest beginning today at g.co/gci. Once students have registered and the parental consent form has been submitted, students can choose which task they want to work on first. Students choose the task they find interesting from a list of hundreds of available tasks created by 25 participating open source organizations. Tasks take an average of 3-5 hours to complete. The task categories are:

Coding
Documentation/Training
Outreach/Research
Quality Assurance
User Interface

Why should students participate?

Students not only have the opportunity to work on a real open source software project, thus gaining invaluable experience, but they also have the opportunity to be a part of the open source community. Mentors are readily available to help answer their questions while they work through the tasks.

Google Code-in is a contest so there are prizes! Complete one task and receive a digital certificate. Three completed tasks and you’ll also get a fun Google t-shirt. Finalists get a hoodie. Grand Prize winners receive an all expense paid trip to Google headquarters in California!

Details

Over the last 7 years, more than 4,500 students from 99 countries have successfully completed over 23,000 tasks in GCI. Intrigued? Learn more about GCI by checking out our rules and FAQs. And please visit our contest site and read the Getting Started Guide.

Teachers, if you are interested in getting your students involved in Google Code-in we have resources available to help you get started.

By Stephanie Taylor, Google Open Source

Adopting a Community-Oriented Approach to Open Source License Compliance

Monday, November 27, 2017

Today Google joins Red Hat, Facebook, and IBM alongside the Linux Kernel Community in increasing the predictability of open source license compliance and enforcement.

We are taking an approach to compliance enforcement that is consistent with the Principles of Community-Oriented GPL Enforcement. We hope that this will encourage greater collaboration on open source projects, and foster discussion on how we can all continue to work closely together.

You can learn more about today’s announcement in Red Hat’s press release and in our GPL Enforcement Statement.

By Chris DiBona, Director of Open Source

Introducing container-diff, a tool for quickly comparing container images

Thursday, November 16, 2017

The Google Container Tools team originally built container-diff, a new project to help uncover differences between container images, to aid our own development with containers. We think it can be useful for anyone building containerized software, so we’re excited to release it as open source to the development community.

Containers and the Dockerfile format help make customization of an application’s runtime environment more approachable and easier to understand. While this is a great advantage of using containers in software development, a major drawback is that it can be hard to visualize what changes in a container image will result from a change in the respective Dockerfile. This can lead to bloated images and make tracking down issues difficult.

Imagine a scenario where a developer is working on an application, built on a runtime image maintained by a third-party. During development someone releases a new version of that base image with updated system packages. The developer rebuilds their application and picks up the latest version of the base image, and suddenly their application stops working; it depended on a previous version of one of the installed system packages, but which one? What version was it on before? With no currently existing tool to easily determine what changed between the two base image versions, this totally stalls development until the developer can track down the package version incompatibility.

Introducing container-diff

container-diff helps users investigate image changes by computing semantic diffs between images. What this means is that container-diff figures out on a low-level what data changed, and then combines this with an understanding of package manager information to output this information in a format that’s actually readable to users. The tool can find differences in system packages, language-level packages, and files in a container image.

Users can specify images in several formats - from local Docker daemon (using the prefix `daemon://` on the image path), a remote registry (using the prefix `remote://`), or a file in the .tar in the format exported by "docker save" command. You can also combine these formats to compute the diff between a local version of an image and a remote version. This can be useful when experimenting with new builds of an image that you might not be quite ready to push yet. container-diff supports image tarballs and the registry protocol natively, enabling it to run in environments without a Docker daemon.

Examples and Use Cases

Here is a basic Dockerfile that installs Python inside our Debian base image. Running container-diff on the base image and the new one with Python, users can see all the apt packages that were installed as dependencies of Python.

And below is a Dockerfile that inherits from our Python base runtime image, and then installs the mock and six packages inside of it. Running container-diff with the pip differ, users can see all the Python packages that have either been installed or changed as a result of this:

This can be especially useful when it’s unclear which packages might have been installed or changed incidentally as a result of dependency management of Python modules.

These are just a few examples. The tool currently has support for Python and Node.js packages installed via pip and npm, respectively, as well as comparison of image filesystems and Docker history. In the future, we’d like to see support added for additional runtime and language differs, including Java, Go, and Ruby. External contributions are welcome! For more information on contributing to container-diff, see this how-to guide.

Now that we’ve seen container-diff compare two images in action, it’s easy to imagine how the tool may be integrated into larger workflows to aid in development:

Changelog generation: Given container-diff’s capacity to facilitate investigation of filesystem and package modifications, it can do most of the heavy lifting in discerning changes for automatic changelog generation for new releases of an image.
Continuous integration: As part of a CI system, users can leverage container-diff to catch potentially breaking filesystem changes resulting from a Dockerfile change in their builds.

container-diff’s default output mode is “human-readable,” but also supports output to JSON, allowing for easy automated parsing and processing by users.

Single Image Analysis

In addition to comparing two images, container-diff has the ability to analyze a single image on its own. This can enable users to get a quick glance at information about an image, such as its system and language-level package installations and filesystem contents.

Let’s take a look at our Debian base image again. We can use the tool to easily view a list of all packages installed in the image, along with each one’s installed version and size:

We could use this to verify compatibility with an application we’re building, or maybe sort the packages by size in another one of our images and see which ones are taking up the most space.

For more information about this tool as well as a breakdown with examples, uses, and inner workings of the tool, please take a look at documentation on our GitHub page. Happy diffing!

Special thanks to Colette Torres and Abby Tisdale, our software engineering interns who helped build the tool from the ground up.

By Nick Kubala, Container Tools team

Tangent: Source-to-Source Debuggable Derivatives

Wednesday, November 8, 2017

Crossposted on the Google Research Blog

Tangent is a new, free, and open source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models.

This post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.

Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy.

Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it.

Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.

As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.

You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.

Automatic differentiation of Python code

How do we automatically generate derivatives of plain Python code? Math functions like tf.exp or tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.

Tangent has a one-function API:

import tangent
df = tangent.grad(f)

Here’s an animated graphic of what happens when we call tangent.grad on a Python function:

If you want to print out your derivatives, you can run

import tangent
df = tangent.grad(f, verbose=1)

Under the hood, tangent.grad first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as TensorFlow Eager functions. The function tangent.grad then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end of the derivative function. This reverse-order processing gives the technique its name: reverse-mode automatic differentiation.

The function df above only works for scalar (non-array) inputs. Tangent also supports

Although we started with TensorFlow Eager support, Tangent isn’t tied to one numeric library or another—we would gladly welcome pull requests adding PyTorch or MXNet derivative recipes.

Next Steps

Tangent is open source now at github.com/google/tangent. Go check it out for download and installation instructions. Tangent is still an experiment, so expect some bugs. If you report them to us on GitHub, we will do our best to fix them quickly.

We are working to add support in Tangent for more aspects of the Python language (e.g., closures, inline function definitions, classes, more NumPy and TensorFlow functions). We also hope to add more advanced automatic differentiation and compiler functionality in the future, such as automatic trade-off between memory and compute (Griewank and Walther 2000; Gruslys et al., 2016), more aggressive optimizations, and lambda lifting.

We intend to develop Tangent together as a community. We welcome pull requests with fixes and features. Happy deriving!

By Alex Wiltschko, Research Scientist, Google Brain Team

Acknowledgments

Bart van Merriënboer contributed immensely to all aspects of Tangent during his internship, and Dan Moldovan led TF Eager integration, infrastructure and benchmarking. Also, thanks to the Google Brain team for their support of this post and special thanks to Sanders Kleinfeld and Aleks Haecky for their valuable contribution for the technical aspects of the post.