opensource.google.com

Menu

Posts from 2021

Actuating Google Production: How Google’s Site Reliability Engineering Team Uses Go

Tuesday, April 13, 2021

Google runs a small number of very large services. Those services are powered by a global infrastructure covering everything a developer needs: storage systems, load balancers, network, logging, monitoring, and much more. Nevertheless, it is not a static system—it cannot be. Architecture evolves, new products and ideas are created, new versions must be rolled out, configs pushed, database schema updated, and more. We end up deploying changes to our systems dozens of times per second.

Because of this scale and critical need for reliability, Google pioneered Site Reliability Engineering (SRE), a role that many other companies have since adopted. “SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity.” - Site Reliability Engineering (SRE).

Go Gopher logo
Credit to Renee French for the Go Gopher

In 2013-2014, Google’s SRE team realized that our approach to production management was not cutting it anymore in many ways. We had advanced far beyond shell scripts, but our scale had so many moving pieces and complexities that a new approach was needed. We determined that we needed to move toward a declarative model of our production, called "Prodspec," driving a dedicated control plane, called "Annealing."

When we started those projects, Go was just becoming a viable option for critical services at Google. Most engineers were more familiar with Python and C++, either of which would have been valid choices. Nevertheless, Go captured our interest. The appeal of novelty was certainly a factor of course. But, more importantly, Go promised a sweet spot between performance and readability that neither of the other languages were able to offer. We started a small experiment with Go for some initial parts of Annealing and Prodspec. As the projects progressed, those initial parts written in Go found themselves at the core. We were happy with Go—its simplicity grew on us, the performance was there, and concurrency primitives would have been hard to replace.

At no point was there ever a mandate or requirement to use Go, but we had no desire to return to Python or C++. Go grew organically in Annealing and Prodspec. It was the right choice, and thus is now our language of choice. Now the majority of Google production is managed and maintained by our systems written in Go.

The power of having a simple language in those projects is hard to overstate. There have been cases where some feature was indeed missing, such as the ability to enforce in the code that some complex structure should not be mutated. But for each one of those cases, there have undoubtedly been tens or hundred of cases where the simplicity helped.

For example, Annealing impacts a wide variety of teams and services meaning that we relied heavily on contributions across the company. The simplicity of Go made it possible for people outside our team to see why some part or another was not working for them, and often provide fixes or features themselves. This allowed us to quickly grow.

Prodspec and Annealing are in charge of some quite critical components. Go’s simplicity means that the code is easy to follow, whether it is to spot bugs during review or when trying to determine exactly what happened during a service disruption.

Go performance and concurrency support have also been key for our work. As our model of production is declarative, we tend to manipulate a lot of structured data, which describes what production is and what it should be. We have large services so the data can grow large, often making purely sequential processing not efficient enough.

We are manipulating this data in many ways and many places. It is not a matter of having a smart person come up with a parallel version of our algorithm. It is a matter of casual parallelism, finding the next bottleneck and parallelising that code section. And Go enables exactly that.

As a result of our success with Go, we now use Go for every new development for Prodspec and Annealing.In addition to the SRE team, engineering teams across Google have adopted Go in their development process. Read about how the Core Data Solutions, Firebase Hosting, and Chrome teams use Go to build fast, reliable, and efficient software at scale.

By Pierre Palatin, Site Reliability Engineer

Logica: organizing your data queries, making them universally reusable and fun

Monday, April 12, 2021

We present Logica, a novel open source Logic Programming language. A successor to Yedalog (a language developed at Google earlier) it is a Datalog-like logic programming language. Logica code compiles to SQL and runs on Google BigQuery (with experimental support for PostgreSQL and SQLite), but it is much more concise and supports the clean and reusable abstraction mechanisms that SQL lacks. It supports modules and imports, it can be used from an interactive Python notebook and it even makes testing your queries natural and easy.

“Data is the new oil”, they say, and SQL is so far the lingua franca for working with data. When SQL (or “Structured English Query Language”, as it was first named) was invented in the 1970s, its authors might not have imagined the popularity that it would reach half a century later. Today, systems ranging from tiny smart watch applications to enterprise IT solutions, read and write their data using SQL. Even the browser that you are using to read this post now might have a working built-in SQL database in it.

Despite the widespread adoption, SQL is not flawless. Constructing statements from long chains of English words (which are often capitalized to keep the old-fashioned COBOL spirit of the 70s alive!) can be very verbose—a single query spanning hundreds of lines is a routine occurrence. The main flaw of SQL, however, lies in its very limited support for abstraction.

Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.

This inherent resistance to decomposition of logic into bite-sized pieces is what leads into the contrived, lengthy queries, the copy-pasted chunks of code and, eventually, unmaintainable, unstructured (note the irony) SQL codebases. To make things worse, SQL code is rarely tested, because “testing SQL queries” sounds rather esoteric to most engineers, at best. Because of that, a number of alternative query languages and libraries have been developed. Of those, systems based on logic programming perhaps come the closest to addressing SQL’s limitations.

Logic programming languages solve problems of SQL by using syntax of mathematical propositional logic rather than natural English language. The language of formal logic was designed by mathematicians specifically to make expression of complex statements easier and suits this purpose much better than natural language. Logica extends classical Logic programming syntax further, most notably with aggregation, hence the name, which stands for

Logica = Logic + Aggregation.

Let us see how it all works. SQL operates with relations, which are sets of rows. In logic programming the analog of a relation is a predicate. While a predicate is a set of rows, we think of it as a logical condition, which describes the rows of a relation. Here is, for example, the definition of a simple predicate:

MagicNumber(x: 2);

MagicNumber(x: 3);

MagicNumber(x: 5);

The definition claims that the condition MagicNumber(x) must hold when X is precisely either 2, 3, or 5. That means, if we were to query this predicate (i.e. request all values of X that satisfy it), the output should be a “relation” with a single column X and rows 2, 3, and 5. The SQL equivalent would be:

SELECT 2 AS x

UNION ALL

SELECT 3 AS x

UNION ALL

SELECT 5 AS x;

Rather than listing the individual values, we could have defined the predicate by encoding a logical condition upon X as follows:

MagicNumber(x:) :-

  x in [2, 3, 5];

Now, here is where the magic starts. Firstly, any table in your database is itself already a predicate, so the following definition:

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 user_id == 5;

Defines a predicate MagicComment, which includes precisely those comment_text values, which are present in the comments table where user_id == 5. In SQL this would read:

SELECT comment_text FROM comments WHERE user_id = 5;

Observe what happens if we replace the condition “user_id == 5” in our predicate with MagicNumber(x: user_id):

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 MagicNumber(x: user_id);

Here, we are querying for comments of users whose ID is one of the “magic numbers” we just defined above. Note how easily we could reuse a previously defined piece of code without having to copy anything around. We could now even extract the MagicNumber to a common module and import it in wherever it is needed:

import my_project.magic.MagicNumber;

As a final example, let us mock the comments table, in a unittest of a query.

import my_project.magic.MagicComment;


MockComments(user_id: 1, comment_text: "Hello");

MockComments(user_id: 2, comment_text: "Logic");

MockComments(user_id: 3, comment_text: "Programming");


MagicCommentTest := MagicComment(`comments`: MockComments);

If we query the MagicComment predicate here, it will not try to read the comments table in the database. Instead, it will use the predicate we just defined, thus letting us verify its correctness by testing the output (it must include two rows “Logic” and “Programming”). Observe how natural and frictionless many of the good programming practices become with Logica, and compare that to what you would have to do to achieve the same using bare SQL.

There is much more to Logica, so make sure you give it a try—chances are, you will love it! Start with this tutorial to learn Logica. Even if you do not end up using it in your next project, learning a new powerful language may open your mind to new ideas and perspectives on data processing and computing in general.

The simple examples above are only a small sample of how concise Logica code can be over SQL for complex queries. In particular, we did not even touch the topic of aggregations in this article. For all of this see examples section of the Logica open source repository.

We also hope that some of the readers consider contributing to Logica development. That’s what open source is all about!

By Konstantin Tretyakov and Evgeny Skvortsov – Logica Open Source Project

Announcing the First Group of Google Open Source Peer Bonus winners in 2021!

Thursday, April 8, 2021

 

Google Open Source Peer Bonus logo


The Google Open Source Peer Bonus program is designed to reward external open source contributors nominated by Googlers for their exceptional contributions to open source. We are very excited to announce our first group of winners in 2021!

Our current winners have contributed to a wide range of projects including Apache Beam, Kubernetes, Tekton and many others. We reward open source enthusiasts not only for their code contributions, but also community work, documentation, mentorships and other types of engagement.

We have award recipients from 25 countries all over the world: Austria, Canada, China, Cyprus, Denmark, Finland, France, Germany, India, Isle of Man, Italy, Japan, Korea, Netherlands, Norway, Russia, Singapore, Spain, Sweden, Switzerland, Uganda, Taiwan, Ukraine, United Kingdom, and the United States.

Open source encourages innovation through collaboration and our modern world, and technology that we rely on, wouldn’t be the same without you—the contributors, who are in many cases volunteers. We would like to thank you for your hard work and congratulate you on receiving this award!

Below is the list of current winners who gave us permission to thank them publicly:

WinnerProject
Kashyap JoisAndroid FHIR SDK
David AllisonAnkiDroid
Chad DombrovaApache Beam
Jeff KlukasApache Beam
Steve NiemitzApache Beam
Yoshiki ObataApache Beam
Jaskirat SinghCHAOSS - Community Health Analytics Open Source Software
Eric AmordeCocoaPods
Subrata Banikcoreboot
Ned BatchelderCoverage.py & related CPython internals
Matthew BryantCursedChrome
Simon Legnerdevdocs.io
Dmitry GutovEmacs/company-mode
Brian JostFirebase
Joe HinkleFirebase iOS SDK
Lorenzo FiamigoFirebase iOS SDK
Mike GerasymenkoFirebase iOS SDK
Morten Bek DitlevsenFirebase iOS SDK
Angel PonsFlashrom
Ole André Vadla RavnåsFrida
Junegunn Choifzf
Alex SaveauGradle Play Publisher
Nate GrahamKDE
Amit SagtaniKDE Community
Niklas HanssonKubeflow Pipelines
William TeoKubeflow Pipelines
Antonio OjeaKubernetes
Dan MangumKubernetes
Jian ZengKubernetes
Darrell Commanderlibjpeg-turbo
James (purpleidea)mgmt
Kareem ErgawyMLIR
Lily BallardNix / Fish
Eelco DolstraNix, NixOS, Nixpkgs
Samuel Dionne-RielNixOS
Dmitry DemenskyOpen source TypeScript definitions for Google Maps Platform
Kay WilliamsOpenSSF
Hassan Kibirigeplotnine
Henry Schreinerpybind11
Paul MoorePython 'pip' project
Tzu-ping ChungPython 'pip' project
Alex GrönholmPython 'wheel' project
Ramon Santamariaraylib
Alexander Weissrestic
Michael Eischerrestic
Ben Leshrxjs
Takeshi Nakatanis3fs
Daniel Wee Soong LimSymbiFlow
Unai Martinez-CorralSymbiFlow, Surelog, Verible, more
Andrea FrittoliTekton
Priti DesaiTekton
Vincent DemeesterTekton
Chengyu Zhangtestsmt & testsmt/yinyang
Dominik Winterertestsmt & testsmt/yinyang
Tom RiniU-Boot

Thank you for your contributions to open source!

By Maria Tabak — Google Open Source Programs Office

Analyzing genomic data in families with deep learning

Wednesday, April 7, 2021

The Genomics team at Google Health is excited to share our latest expansion to DeepVariant - DeepTrio.

First released in 2017, DeepVariant is an open source tool that enables researchers and clinicians to analyze an individual’s genome sequencing data and identify genetic variants, such as those that may cause disease. Our continued work on DeepVariant has been recognized for its top-of-class accuracy. With DeepTrio, we have expanded DeepVariant to be able to consider the genetic variants in the sequence data of a mother-father-child trio.

Humans are diploid organisms, carrying two copies of the human genome. Every individual inherits one copy of the genome from their mother, and the other from their father. Parental inheritance informs analysis of traits and diseases that follow Mendelian inheritance. DeepTrio learns to use the properties of Mendelian inheritance directly from sequencing data in order to more accurately identify genetic variants in cases when both parent and a child sample can be co-analyzed.

Modifying DeepVariant to analyze trio samples

DeepVariant learns to classify positions in a genome as reference or variant using representations of data similar to the “genome browser” which experts use in analysis. “Improving the Accuracy of Genomic Analysis with DeepVariant 1.0” provides a good overview.

DeepVariant receives data as a window of the genome centered on a candidate variant which it is asked to classify as either reference (no variant), heterozygous (one copy of a variant) or homozygous (both copies are variant). DeepVariant sees the sequence evidence as channels representing features of the data (see: “Looking through DeepVariant’s eyes” for a deeper explanation).

We modified DeepTrio to represent the sequence data from a trio in a single image, with a fixed height for each sample and the child in the middle. Using gold standard samples from NIST Genome in a Bottle for truth labels, we train one model to call variants in the child and another to call variants in the top parent. To call both parents, we flip the position of the parent samples.

An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack.

conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Figure 1. (top) An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack. (bottom) conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Measuring DeepTrio’s improved accuracy

We show that DeepTrio is more accurate than DeepVariant for both parent and child variant detection, with an especially pronounced advantage at lower coverages. This enables researchers to either analyze samples at higher accuracy, or to maintain comparable accuracy at a substantially reduced expense.

To assess the accuracy of DeepTrio, we compare its accuracy to DeepVariant using extensively characterized gold standards made available by NIST Genome in a Bottle. In order to have an evaluation dataset which is never seen in training, we exclude chromosome 20 from training and perform evaluations on chromosome 20.

We train DeepVariant and DeepTrio for sequencing data from two different instruments, Illumina and Pacific Biosciences (PacBio), for more information on the differences between these technologies, please see our previous blog. These sequencers both randomly sample the genome in an error-prone manner. To accurately analyze a genome, the same region needs to be sampled repeatedly. The depth of sampling at a position is called coverage. Sequencing to greater coverage is more expensive in an approximately linear manner. This often forces trade-offs between cost, accuracy, and samples sequenced. As a result, in trios parents are often sequenced at lower depth.

In the charts below, we plot the accuracy of DeepTrio and DeepVariant across a range of coverages.

DeepTrio child accuracy

DeepTrio parent accuracy

Figure 2. F1-score for DeepTrio (solid line) and DeepVariant (dashed line) on a child sample (top) and a parent sample (bottom), sequenced with an Illumina (blue) and PacBio (black) instrument. F1 is measured for all types of small variants on chromosome 20, across samples with a range of sequencing coverage (x-axis).

DeepTrio’s performance on de novo variants

Each individual has roughly 5 million variants relative to the human reference genome. The overwhelming majority of these are inherited from their parents. A small number, around 100, are new (referred to as de novo), due to copying errors during DNA replication. We demonstrate that DeepTrio substantially reduces false positives for de novo variants. For Illumina data, this comes with a smaller decrease in recovery of true positives, while for PacBio data, this trade-off does not occur.

To assess accuracy we analyzed sites where both parents are called as non-variant, but the child is called as heterozygous variant. We observe that DeepTrio is more reluctant to call a variant as de novo, which is similar to how a human would require a higher level of evidence for sites violating Mendelian inheritance. This results in a much lower false positive rate for these de novo variants, but a slightly lower recall rate in DeepTrio Illumina. Usually when this occurs, the child is still called as a variant, but the parents are given “no-call” (the classifier is not confident enough to make a call).

Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events


Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of false positive de novo events

Figure 3. Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events (top) and false positive de novo events (bottom) for DeepTrio (solid line) and DeepVariant (dashed line) on Illumina (blue) and PacBio (black). Accuracy is measured on chromosome 20, across samples with a range of sequencing coverage (x-axis).

Contributing to rare disease research

By releasing DeepTrio as open source software, we hope to improve analysis of genomic data, by allowing scientists to more accurately analyze samples. We hope this will enable research and clinical pipelines, leading to better resolution of rare disease cases, and improve development of therapeutics.

In addition to the release of DeepTrio’s code as open source, we have also released the sequencing data that we generated in order to train these models. That data is described in our pre-print “An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development”. By releasing both this production model, and the data required to train models of similar complexity, we hope to contribute to methods development by the genomics community.

By Andrew Carroll, Product Lead Genomics and Howard Yang, Program Manager Genomics — Google Health

.