opensource.google.com

Menu

Open Sourcing the Hunt for Exoplanets

Friday, March 9, 2018



Recently, we discovered two exoplanets by training a neural network to analyze data from NASA’s Kepler space telescope and accurately identify the most promising planet signals. And while this was only an initial analysis of ~700 stars, we consider this a successful proof-of-concept for using machine learning to discover exoplanets, and more generally another example of using machine learning to make meaningful gains in a variety of scientific disciplines (e.g. healthcare, quantum chemistry, and fusion research).

Today, we’re excited to release our code for processing the Kepler data, training our neural network model, and making predictions about new candidate signals. We hope this release will prove a useful starting point for developing similar models for other NASA missions, like K2 (Kepler’s second mission) and the upcoming Transiting Exoplanet Survey Satellite mission. As well as announcing the release of our code, we’d also like take this opportunity to dig a bit deeper into how our model works.

A Planet Hunting Primer

First, let’s consider how data collected by the Kepler telescope is used to detect the presence of a planet. The plot below is called a light curve, and it shows the brightness of the star (as measured by Kepler’s photometer) over time. When a planet passes in front of the star, it temporarily blocks some of the light, which causes the measured brightness to decrease and then increase again shortly thereafter, causing a “U-shaped” dip in the light curve.
A light curve from the Kepler space telescope with a “U-shaped” dip that indicates a transiting exoplanet.
However, other astronomical and instrumental phenomena can also cause the measured brightness of a star to decrease, including binary star systems, starspots, cosmic ray hits on Kepler’s photometer, and instrumental noise.
The first light curve has a “V-shaped” pattern that tells us that a very large object (i.e. another star) passed in front of the star that Kepler was observing. The second light curve contains two places where the brightness decreases, which indicates a binary system with one bright and one dim star: the larger dip is caused by the dimmer star passing in front of the brighter star, and vice versa. The third light curve is one example of the many other non-planet signals where the measured brightness of a star appears to decrease.
To search for planets in Kepler data, scientists use automated software (e.g. the Kepler data processing pipeline) to detect signals that might be caused by planets, and then manually follow up to decide whether each signal is a planet or a false positive. To avoid being overwhelmed with more signals than they can manage, the scientists apply a cutoff to the automated detections: those with signal-to-noise ratios above a fixed threshold are deemed worthy of follow-up analysis, while all detections below the threshold are discarded. Even with this cutoff, the number of detections is still formidable: to date, over 30,000 detected Kepler signals have been manually examined, and about 2,500 of those have been validated as actual planets!

Perhaps you’re wondering: does the signal-to-noise cutoff cause some real planet signals to be missed? The answer is, yes! However, if astronomers need to manually follow up on every detection, it’s not really worthwhile to lower the threshold, because as the threshold decreases the rate of false positive detections increases rapidly and actual planet detections become increasingly rare. However, there’s a tantalizing incentive: it’s possible that some potentially habitable planets like Earth, which are relatively small and orbit around relatively dim stars, might be hiding just below the traditional detection threshold — there might be hidden gems still undiscovered in the Kepler data!

A Machine Learning Approach

The Google Brain team applies machine learning to a diverse variety of data, from human genomes to sketches to formal mathematical logic. Considering the massive amount of data collected by the Kepler telescope, we wondered what we might find if we used machine learning to analyze some of the previously unexplored Kepler data. To find out, we teamed up with Andrew Vanderburg at UT Austin and developed a neural network to help search the low signal-to-noise detections for planets.
We trained a convolutional neural network (CNN) to predict the probability that a given Kepler signal is caused by a planet. We chose a CNN because they have been very successful in other problems with spatial and/or temporal structure, like audio generation and image classification.
Luckily, we had 30,000 Kepler signals that had already been manually examined and classified by humans. We used a subset of around 15,000 of these signals, of which around 3,500 were verified planets or strong planet candidates, to train our neural network to distinguish planets from false positives. The inputs to our network are two separate views of the same light curve: a wide view that allows the model to examine signals elsewhere on the light curve (e.g., a secondary signal caused by a binary star), and a zoomed-in view that enables the model to closely examine the shape of the detected signal (e.g., to distinguish “U-shaped” signals from “V-shaped” signals).

Once we had trained our model, we investigated the features it learned about light curves to see if they matched with our expectations. One technique we used (originally suggested in this paper) was to systematically occlude small regions of the input light curves to see whether the model’s output changed. Regions that are particularly important to the model’s decision will change the output prediction if they are occluded, but occluding unimportant regions will not have a significant effect. Below is a light curve from a binary star that our model correctly predicts is not a planet. The points highlighted in green are the points that most change the model’s output prediction when occluded, and they correspond exactly to the secondary “dip” indicative of a binary system. When those points are occluded, the model’s output prediction changes from ~0% probability of being a planet to ~40% probability of being a planet. So, those points are part of the reason the model rejects this light curve, but the model uses other evidence as well - for example, zooming in on the centred primary dip shows that it's actually “V-shaped”, which is also indicative of a binary system.

Searching for New Planets

Once we were confident with our model’s predictions, we tested its effectiveness by searching for new planets in a small set 670 stars. We chose these stars because they were already known to have multiple orbiting planets, and we believed that some of these stars might host additional planets that had not yet been detected. Importantly, we allowed our search to include signals that were below the signal-to-noise threshold that astronomers had previously considered. As expected, our neural network rejected most of these signals as spurious detections, but a handful of promising candidates rose to the top, including our two newly discovered planets: Kepler-90 i and Kepler-80 g.

Find your own Planet(s)!

Let’s take a look at how the code released today can help (re-)discover the planet Kepler-90 i. The first step is to train a model by following the instructions on the code’s home page. It takes a while to download and process the data from the Kepler telescope, but once that’s done, it’s relatively fast to train a model and make predictions about new signals. One way to find new signals to show the model is to use an algorithm called Box Least Squares (BLS), which searches for periodic “box shaped” dips in brightness (see below). The BLS algorithm will detect “U-shaped” planet signals, “V-shaped” binary star signals and many other types of false positive signals to show the model. There are various freely available software implementations of the BLS algorithm, including VARTOOLS and LcTools. Alternatively, you can even look for candidate planet transits by eye, like the Planet Hunters.
A low signal-to-noise detection in the light curve of the Kepler 90 star detected by the BLS algorithm. The detection has period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days after 12:00 on 1/1/2009 (the year the Kepler telescope launched).
To run this detected signal though our trained model, we simply execute the following command:
python predict.py  --kepler_id=11442793 --period=14.44912 --t0=2.2
--duration=0.11267 --kepler_data_dir=$HOME/astronet/kepler 
--output_image_file=$HOME/astronet/kepler-90i.png 
--model_dir=$HOME/astronet/model
The output of the command is prediction = 0.94, which means the model is 94% certain that this signal is a real planet. Of course, this is only a small step in the overall process of discovering and validating an exoplanet: the model’s prediction is not proof one way or the other. The process of validating this signal as a real exoplanet requires significant follow-up work by an expert astronomer — see Sections 6.3 and 6.4 of our paper for the full details. In this particular case, our follow-up analysis validated this signal as a bona fide exoplanet, and it’s now called Kepler-90 i!
Our work here is far from done. We’ve only searched 670 stars out of 200,000 observed by Kepler — who knows what we might find when we turn our technique to the entire dataset. Before we do that, though, we have a few improvements we want to make to our model. As we discussed in our paper, our model is not yet as good at rejecting binary stars and instrumental false positives as some more mature computer heuristics. We’re hard at work improving our model, and now that it’s open sourced, we hope others will do the same!


By Chris Shallue, Senior Software Engineer, Google Brain Team

If you’d like to learn more, Chris is featured on the latest episode of This Week In Machine Learning & AI discussing his work.

A year full of new open source at Catrobat

Thursday, March 8, 2018

This is a guest post from Catrobat, an open source organization that participated in both Google Summer of Code and Google Code-in last year.


Catrobat was selected to participate in Google Summer of Code (GSoC) for the sixth time and Google Code-in (GCI) for the first time in 2017, which helped us reach new students and keep our mentors busy.

We tried something new in 2017 by steering GSoC students toward refactoring and performance, rather than developing new features. Implementing a crash tracking and analysis system, modularizing existing code, and rewriting our tests resulted in more lines of code being deleted than added – and we’re really happy about that!

This improved the quality and stability of oursoftware and both students and mentors could see progress immediately. The immediacy of the results kept students engaged - some weeks it almost seemed as if they had been working 24/7 (they weren’t :)! And we’re happy to say that most are still motivated to contribute after GSoC, and now they’re adding code more often than they are deleting it.

Although new features are exciting, we found that working on existing code offers a smooth entry for GSoC students. This approach helped students assimilate into the community and project more quickly, as well as receive rapid rewards for their work.

The quality improvements made by GSoC students also made things smoother for the younger, often less experienced GCI students. Several dozen students completed hundreds of tasks, spreading the love of open source and coding in their communities. It was our first time working with so many young contributors and it was fun!

We faced challenges in the beginning – such as language barriers and students’ uncertainty in their work – and quickly learned how to adapt our processes to meet the needs (and extraordinary motivation) of these new young contributors. We introduced them to open source through our project’s app Pocket Code, allowing them to program games and apps with a visual mobile coding framework and then share them under an open license. Students had a lot of fun starting this way and mentors enjoyed reviewing so many colorful and exciting games.

Students even asked how they could improve on quality work that we had already accepted, if they could do more work on it, and if they could share their projects with their friends. This was a great first experience of GCI for our organization and, as one of our mentors mentioned in the final evaluation phase, we would totally be up for doing it again!

By Matthias Mueller, Catrobat Org Admin

How Google uses Census internally

Wednesday, March 7, 2018

This post is the first in a series about OpenCensus, a set of open source instrumentation libraries based on what we use inside Google. This series will cover the benefits of OpenCensus for developers and vendors, Google’s interest in open sourcing instrumentation tools, how to get started with OpenCensus, and our long-term vision.

If you’re new to distributed tracing and metrics, we recommend Adrian Cole’s excellent talk on the subject: Observability Three Ways.

Gaining Observability into Planet-Scale Computing

Google adopted or invented new technologies, including distributed tracing (Dapper) and metrics processing, in order to operate some of the world’s largest web services. However, building analysis systems didn’t solve the difficult problem of instrumenting and extracting data from production services. This is what Census was created to do.

The Census project provides uniform instrumentation across most Google services, capturing trace spans, app-level metrics, and other metadata like log correlations from production applications. One of the biggest benefits of uniform instrumentation to developers inside of Google is that it’s almost entirely automatic: any service that uses gRPC automatically collects and exports basic traces and metrics.

OpenCensus offers these capabilities to developers everywhere. Today we’re sharing how we use distributed tracing and metrics inside of Google.

Incident Management

When latency problems or new errors crop up in a highly distributed environment, visibility into what’s happening is critical. For example, when the latency of a service crosses expected boundaries, we can view distributed traces in Dapper to find where things are slowing down. Or when a request is returning an error, we can look at the chain of calls that led to the error and examine the metadata captured during a trace (typically logs or trace annotations). This is effectively a bigger stack trace. In rare cases, we enable custom trigger-based sampling which allows us to focus on specific kinds of requests.

Once we know there’s a production issue, we can use Census data to determine the regions, services, and scope (one customer vs many) of a given problem. You can use service-specific diagnostics pages, called “z-pages,” to monitor problems and the results of solutions you deploy. These pages are hosted locally on each service and provide a firehose view of recent requests, stats, and other performance-related information.

Performance Optimization

At Google’s scale, we need to be able to instrument and attribute costs for services. We use Census to help us answer questions like:
  • How much CPU time does my query consume?
  • Does my feature consume more storage resources than before?
  • What is the cost of a particular user operation at a particular layer of the stack?
  • What is the total cost of a particular user operation across all layers of the stack?
We’re obsessed with reducing the tail latency of all services, so we’ve built sophisticated analysis systems that process traces and metrics captured by Census to identify regressions and other anomalies.

Quality of Service

Google also improves performance dynamically depending on the source and type of traffic. Using Census tags, traffic can be directed to more appropriate shards, or we can do things like load shedding and rate limiting.

Next week we’ll discuss Google’s motivations for open sourcing Census, then we’ll shift the focus back onto the open source project itself.

By Pritam Shah and Morgan McLean, Census team

The Building Blocks of Interpretability

Tuesday, March 6, 2018

Cross-posted on the Google Research Blog.

In 2015, our early attempts to visualize how neural networks understand images led to psychedelic images. Soon after, we open sourced our code as DeepDream and it grew into a small art movement producing all sorts of amazing things. But we also continued the original line of research behind DeepDream, trying to address one of the most exciting questions in Deep Learning: how do neural networks do what they do?

Last year in the online journal Distill, we demonstrated how those same techniques could show what individual neurons in a network do, rather than just what is “interesting to the network” as in DeepDream. This allowed us to see how neurons in the middle of the network are detectors for all sorts of things — buttons, patches of cloth, buildings — and see how those build up to be more and more sophisticated over the networks layers.
Visualizations of neurons in GoogLeNet. Neurons in higher layers represent higher level ideas.
While visualizing neurons is exciting, our work last year was missing something important: how do these neurons actually connect to what the network does in practice?

Today, we’re excited to publish “The Building Blocks of Interpretability,” a new Distill article exploring how feature visualization can combine together with other interpretability techniques to understand aspects of how networks make decisions. We show that these combinations can allow us to sort of “stand in the middle of a neural network” and see some of the decisions being made at that point, and how they influence the final output. For example, we can see things like how a network detects a floppy ear, and then that increases the probability it gives to the image being a “Labrador retriever” or “beagle”.

We explore techniques for understanding which neurons fire in the network. Normally, if we ask which neurons fire, we get something meaningless like “neuron 538 fired a little bit,” which isn’t very helpful even to experts. Our techniques make things more meaningful to humans by attaching visualizations to each neuron, so we can see things like “the floppy ear detector fired”. It’s almost a kind of MRI for neural networks.
We can also zoom out and show how the entire image was “perceived” at different layers. This allows us to really see the transition from the network detecting very simple combinations of edges, to rich textures and 3d structure, to high-level structures like ears, snouts, heads and legs.
These insights are exciting by themselves, but they become even more exciting when we can relate them to the final decision the network makes. So not only can we see that the network detected a floppy ear, but we can also see how that increases the probability of the image being a labrador retriever.
In addition to our paper, we’re also releasing Lucid, a neural network visualization library building off our work on DeepDream. It allows you to make the sort of lucid feature visualizations we see above, in addition to more artistic DeepDream images.

We’re also releasing colab notebooks. These notebooks make it extremely easy to use Lucid to reproduce visualizations in our article! Just open the notebook, click a button to run code — no setup required!
In colab notebooks you can click a button to run code, and see the result below.
This work only scratches the surface of the kind of interfaces that we think it’s possible to build for understanding neural networks. We’re excited to see what the community will do — and we’re excited to work together towards deeper human understanding of neural networks.

By Chris Olah, Research Scientist and Arvind Satyanarayan, Visiting Researcher, Google Brain Team

Congratulating the latest Open Source Peer Bonus winners

Thursday, March 1, 2018

We’re pleased to introduce 2018’s first round of Open Source Peer Bonus winners. First started by the Google Open Source team seven years ago, this program encourages Google employees to express their gratitude to open source contributors.

Twice a year Googlers nominate open source contributors outside of the company for their contributions to open source projects, including those used by Google. Nominees are reviewed by a team of volunteers and the winners receive our heartfelt thanks with a token of our appreciation.

So far more than 600 contributors from dozens of countries have received Open Source Peer Bonuses for volunteering their time and talent to over 400 open source projects. You can find some of the previous winners in these blog posts.

We’d like to recognize the latest round of winners and the projects they worked on. Listed below are the individuals who gave us permission to thank them publicly:

Name Project Name Project
Adrien Devresse Abseil C++ Friedel Ziegelmayer Karma
Weston Ruter AMP Plugin for WordPress Davanum Srinivas Kubernetes
Thierry Muller AMP Plugin for WordPress Jennifer Rondeau Kubernetes
Adam Silverstein AMP Project Jessica Yao Kubernetes
Levi Durfee AMP Project Qiming Teng Kubernetes
Fabian Wiles Angular Zachary Corleissen Kubernetes
Paul King Apache Groovy Reinhard Nägele Kubernetes Charts
Eric Eide C-Reduce Erez Shinan Lark
John Regehr C-Reduce Alex Gaynor Mercurial
Yang Chen C-Reduce Anna Henningsen Node.js
Ajith Kumar Velutheri Chromium Michaël Zasso Node.js
Orta Therox CocoaPods Michael Dalessio Nokogiri
Idwer Vollering coreboot Gina Häußge OctoPrint
Paul Ganssle dateutil Michael Stramel Polymer
Zach Leatherman Eleventy La Vesha Parker Progressive HackNight
Daniel Stone freedesktop.org Ian Stapleton Cordasco Python Code Quality Authority
Sergiu Deitsch glog Fabian Henneke Secure Shell
Jonathan Bluett-Duncan Guava Rob Landley Toybox
Karol Szczepański Heracles.ts Peter Wong V8
Paulus Schoutsen Home Assistant Timothy Gu Web platform & Node.js
Nathaniel Welch Fog for Google Ola Hugosson WebM
Shannon Coen Istio Dominic Symes WebM & AOMedia
Max Beatty jsPerf

To each and every one of you: thank you for your contributions to the open source community and congratulations!

By Maria Webb, Google Open Source
.