Introducing container-diff, a tool for quickly comparing container images

Thursday, November 16, 2017

The Google Container Tools team originally built container-diff, a new project to help uncover differences between container images, to aid our own development with containers. We think it can be useful for anyone building containerized software, so we’re excited to release it as open source to the development community.

Containers and the Dockerfile format help make customization of an application’s runtime environment more approachable and easier to understand. While this is a great advantage of using containers in software development, a major drawback is that it can be hard to visualize what changes in a container image will result from a change in the respective Dockerfile. This can lead to bloated images and make tracking down issues difficult.

Imagine a scenario where a developer is working on an application, built on a runtime image maintained by a third-party. During development someone releases a new version of that base image with updated system packages. The developer rebuilds their application and picks up the latest version of the base image, and suddenly their application stops working; it depended on a previous version of one of the installed system packages, but which one? What version was it on before? With no currently existing tool to easily determine what changed between the two base image versions, this totally stalls development until the developer can track down the package version incompatibility.

Introducing container-diff

container-diff helps users investigate image changes by computing semantic diffs between images. What this means is that container-diff figures out on a low-level what data changed, and then combines this with an understanding of package manager information to output this information in a format that’s actually readable to users. The tool can find differences in system packages, language-level packages, and files in a container image.

Users can specify images in several formats - from local Docker daemon (using the prefix `daemon://` on the image path), a remote registry (using the prefix `remote://`), or a file in the .tar in the format exported by "docker save" command. You can also combine these formats to compute the diff between a local version of an image and a remote version. This can be useful when experimenting with new builds of an image that you might not be quite ready to push yet. container-diff supports image tarballs and the registry protocol natively, enabling it to run in environments without a Docker daemon.

Examples and Use Cases

Here is a basic Dockerfile that installs Python inside our Debian base image. Running container-diff on the base image and the new one with Python, users can see all the apt packages that were installed as dependencies of Python.

And below is a Dockerfile that inherits from our Python base runtime image, and then installs the mock and six packages inside of it. Running container-diff with the pip differ, users can see all the Python packages that have either been installed or changed as a result of this:

This can be especially useful when it’s unclear which packages might have been installed or changed incidentally as a result of dependency management of Python modules.

These are just a few examples. The tool currently has support for Python and Node.js packages installed via pip and npm, respectively, as well as comparison of image filesystems and Docker history. In the future, we’d like to see support added for additional runtime and language differs, including Java, Go, and Ruby. External contributions are welcome! For more information on contributing to container-diff, see this how-to guide.

Now that we’ve seen container-diff compare two images in action, it’s easy to imagine how the tool may be integrated into larger workflows to aid in development:
  • Changelog generation: Given container-diff’s capacity to facilitate investigation of filesystem and package modifications, it can do most of the heavy lifting in discerning changes for automatic changelog generation for new releases of an image.
  • Continuous integration: As part of a CI system, users can leverage container-diff to catch potentially breaking filesystem changes resulting from a Dockerfile change in their builds.
container-diff’s default output mode is “human-readable,” but also supports output to JSON, allowing for easy automated parsing and processing by users.

Single Image Analysis

In addition to comparing two images, container-diff has the ability to analyze a single image on its own. This can enable users to get a quick glance at information about an image, such as its system and language-level package installations and filesystem contents.

Let’s take a look at our Debian base image again. We can use the tool to easily view a list of all packages installed in the image, along with each one’s installed version and size:

We could use this to verify compatibility with an application we’re building, or maybe sort the packages by size in another one of our images and see which ones are taking up the most space.

For more information about this tool as well as a breakdown with examples, uses, and inner workings of the tool, please take a look at documentation on our GitHub page. Happy diffing!

Special thanks to Colette Torres and Abby Tisdale, our software engineering interns who helped build the tool from the ground up.

By Nick Kubala, Container Tools team

Tangent: Source-to-Source Debuggable Derivatives

Wednesday, November 8, 2017

Crossposted on the Google Research Blog

Tangent is a new, free, and open source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models.
This post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.

Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy.

Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it.

Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.
As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.

You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.

Automatic differentiation of Python code

How do we automatically generate derivatives of plain Python code? Math functions like tf.exp or tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as  subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.

Tangent has a one-function API:
import tangent
df = tangent.grad(f)
Here’s an animated graphic of what happens when we call tangent.grad on a Python function:
If you want to print out your derivatives, you can run
import tangent
df = tangent.grad(f, verbose=1)
Under the hood, tangent.grad first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as TensorFlow Eager functions. The function tangent.grad then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end of the derivative function. This reverse-order processing gives the technique its name: reverse-mode automatic differentiation.

The function df above only works for scalar (non-array) inputs. Tangent also supports
Although we started with TensorFlow Eager support, Tangent isn’t tied to one numeric library or another—we would gladly welcome pull requests adding PyTorch or MXNet derivative recipes.

Next Steps

Tangent is open source now at Go check it out for download and installation instructions. Tangent is still an experiment, so expect some bugs. If you report them to us on GitHub, we will do our best to fix them quickly.

We are working to add support in Tangent for more aspects of the Python language (e.g., closures, inline function definitions, classes, more NumPy and TensorFlow functions). We also hope to add more advanced automatic differentiation and compiler functionality in the future, such as automatic trade-off between memory and compute (Griewank and Walther 2000; Gruslys et al., 2016), more aggressive optimizations, and lambda lifting.

We intend to develop Tangent together as a community. We welcome pull requests with fixes and features. Happy deriving!

By Alex Wiltschko, Research Scientist, Google Brain Team


Bart van Merriënboer contributed immensely to all aspects of Tangent during his internship, and Dan Moldovan led TF Eager integration, infrastructure and benchmarking. Also, thanks to the Google Brain team for their support of this post and special thanks to Sanders Kleinfeld and Aleks Haecky for their valuable contribution for the technical aspects of the post.

Welcoming 25 mentor organizations for Google Code-in 2017

Thursday, October 26, 2017

We’re thrilled to introduce 25 open source organizations that are participating in Google Code-in 2017. The contest, now in its eighth year, offers 13-17 year old pre-university students an opportunity to learn and practice their skills while contributing to open source projects.

Google Code-in officially starts for students on November 28. Students are encouraged to learn about the participating organizations ahead of time and can get started by clicking on the links below:

  • Apertium: rule-based machine translation platform
  • BRL-CAD: computer graphics, 2D and 3D geometry modeling and computer-aided design (CAD)
  • Catrobat: visual programming for creating mobile games and animations
  • CCExtractor: open source tools for subtitle generation
  • CloudCV: building platforms for reproducible AI research
  • coala: a unified interface for linting and fixing code, regardless of the programming languages used
  • Drupal: content management platform
  • FOSSASIA: developing communities across all ages and borders to form a better future with Open Technologies and ICT
  • Haiku: operating system specifically targeting personal computing
  • JBoss Community: a community of projects around JBoss Middleware
  • LibreHealth: aiming to bring open source healthcare IT to all of humanity
  • Liquid Galaxy: an interactive, panoramic and immersive visualization tool
  • MetaBrainz: builds community maintained databases
  • Mifos Initiative: transforming the delivery of financial services to the poor and the unbanked
  • MovingBlocks: a Minecraft-inspired open source game
  • OpenMRS: open source medical records system for the world
  • OpenWISP: build and manage low cost networks such as public wifi
  • OSGeo: building open source geospatial tools
  • Sugar Labs: learning platform and activities for elementary education
  • SCoRe: research lab seeking sustainable solutions for problems faced by developing countries
  • Systers: community for women involved in technical aspects of computing
  • Ubuntu: an open source operating system
  • Wikimedia: non-profit foundation dedicated to bringing free content to the world, operating Wikipedia
  • XWiki: a web platform for developing collaborative applications using the wiki paradigm
  • Zulip: powerful, threaded open source group chat with apps for every major platform

These mentor organizations are hard at work creating thousands of tasks for students to work on, including code, documentation, user interface, quality assurance, outreach, research and training tasks. The contest officially starts for students on Tuesday, November 28th at 9:00am PST.

You can learn more about Google Code-in on the contest site where you’ll find Contest Rules, Frequently Asked Questions and Important Dates. There you’ll also find flyers and other helpful information including the Getting Started Guide. Our discussion mailing list is a great way to talk with other students, mentors and organization administrators about the contest.

By Josh Simmons, Google Open Source