Announcing the S2 Library: Geometry on the Sphere

Tuesday, December 5, 2017

Google has always embraced new approaches to organizing all the world's information, and this includes all the world's geography. Today we are announcing the open source release of Google's S2 library, the core geometric library on which Google's global geographic database is built.

A unique feature of the S2 library is that unlike traditional geographic information systems, which represent data as flat two-dimensional projections (similar to an atlas), the S2 library represents all data on a three-dimensional sphere (similar to a globe). This makes it possible to build a worldwide geographic database with no seams or singularities, using a single coordinate system, and with low distortion everywhere compared to the true shape of the Earth. While the Earth is not quite spherical, it is much closer to being a sphere than it is to being flat!

Notable features of the library include:
  • Flexible support for spatial indexing, including the ability to approximate arbitrary regions as collections of discrete S2 cells. This feature makes it easy to build large distributed spatial indexes. (The image above illustrates the S2 space-filling curve, an important tool used for spatial indexing.)
  • Fast in-memory spatial indexing of collections of points, polylines, and polygons.
  • Robust constructive operations (such as intersection, union, and simplification) and boolean predicates (such as testing for containment).
  • Efficient query operations for finding nearby objects, measuring distances, computing centroids, etc.
  • A flexible and robust implementation of snap rounding (a geometric technique that allows operations to be implemented 100% robustly while using small and fast coordinate representations).
  • A collection of efficient yet exact mathematical predicates for testing relationships among geometric primitives.
  • Extensive testing on Google's vast collection of geographic data.
  • Flexible Apache 2.0 license.
The reference implementation of the S2 library is written in C++, and subsets have been ported to Go, Java, and Python. An early version of the code was released in 2011, but today's announcement represents a major update along with a commitment to maintain the library going forward. The code is under active development and new features will be released regularly. (The Java port is based on the 2011 code and does not have the same robustness, performance, or features as the current C++ version.)

Our C++ code repository is here:
And check out our documentation here:

To learn more, start by reading the overview and quick start documents, then explore the documentation site. The library also has extensive documentation in the header files, which is where the most authoritative information can be found. More introductions and tutorials will be added over time - contributions are welcome!

The S2 library was written primarily by Eric Veach. Other significant contributors include Jesse Rosenstock, Eric Engle (Java port lead), Robert Snedegar (Go port lead), Julien Basch, and Tom Manshreck.

By Eric Veach, Software Engineer

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Monday, December 4, 2017

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.

DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

Google Summer of Code 2017 Mentor Summit

Thursday, November 30, 2017

This year Google brought over 320 mentors from all over the world (33 countries!) to Google's offices in Sunnyvale, California for the 2017 Google Summer of Code Mentor Summit. This year 149 organizations were represented, which provided the perfect opportunity to meet like-minded open source enthusiasts and discuss ways to make open source better and more sustainable.
Group photo by Dmitry Levin used under a CC BY-SA 4.0 license.
The Mentor Summit is run as an unconference in which attendees create and join sessions based on their interests. “I liked the unconference sessions, that they were casual and discussion based and I got a lot out of them. It was the place I connected with the most people,” said Cassie Tarakajian, attending on behalf of the Processing Foundation.

Attendees quickly filled the schedule boards with interesting sessions. One theme in this year’s session schedule was the challenging topic of failing students. Derk Ruitenbeek, part of the phpBB contingent, had this to say:
“This year our organisation had a high failure rate of 3 out of 5 accepted students. During the Mentor Summit I attended multiple sessions about failing students and rating proposals and got a lot [of] useful tips. Talking with other mentors about this really helped me find ways to improve student selection for our organisation next time.”
This year was the largest Mentor Summit ever – with the exception of our 10 Year Reunion in 2014 – and had the best gender diversity yet. Katarina Behrens, a mentor who worked with LibreOffice, observed:
“I was pleased to see many more women at the summit than last time I participated. I'm also beyond happy that now not only women themselves, but also men engage in increasing (not only gender) diversity of their projects and teams.”
We've held the Mentor Summit for the past 10+ years as a way to meet some of the thousands of mentors whose generous work for the students makes the program successful, and to give some of them and the projects they represent a chance to meet. This year was their first Mentor Summit for 52% of the attendees, giving us a lot of fresh perspectives to learn from!

We love hosting the Mentor Summit and attendees enjoy it, as well, especially the opportunity to meet each other. In fact, some attendees met in person for the first time at the Mentor Summit after years of collaborating remotely! According to Aveek Basu, who mentored for The Linux Foundation, the event was an excellent opportunity for “networking with like minded people from different communities. Also it was nice to know about people working in different fields from bioinformatics to robotics, and not only hard core computer science.” 

You can browse the event website and read through some of the session notes that attendees took to learn a bit more about this year’s Mentor Summit.

Now that Google Summer of Code 2017 and the Mentor Summit have come to a close, our team is busy gearing up for the 2018 program. We hope to see you then!

By Maria Webb, Google Open Source