Google Open Source Blog: July 2017

Posts from July 2017

Professors from Around the World Get Their Students into HFOSS

Friday, July 21, 2017

Over the last four years instructors from around the world have gathered for the Professors’ Open Source Software Experience (POSSE) workshop to integrate open source concepts into their curriculum. At each event, professors make more progress toward providing students with hands on experience via contributions to humanitarian free and open source software (HFOSS).

This year Google was proud to not only host a workshop at our San Francisco office in April, but also to collaborate with the organizers to bring a POSSE workshop to Europe for the first time.

POSSE workshop leaders, from left to right: Clif Kussmaul (Muhlenburg College), Lori Postner (Nassau Community College), Stoney Jackson (Western New England University), Heidi Ellis (Western New England University), Greg Hislop (Drexel University), and Darci Burdge (Nassau Community College).

The workshop in Italy was led by Dr. Gregory Hislop from Drexel University, and Drs. Heidi Ellis and Stoney Jackson from Western New England University, and brought together 20 instructors from Germany, Hungary, India, Italy, Macedonia, Qatar, Spain, Swaziland, the United Kingdom, and the United States. This was the most geographically diverse workshop to date!


Group photos in San Francisco, USA on April 22, 2017 (left) and Bologna, Italy on July 1, 2017 (right).

What’s next for POSSE? University instructors from institutions in the US can apply now to participate in the next workshop, November 16-18 in Raleigh, NC and join their peers in the community of instructors weaving HFOSS into their curriculum.

By Helen Hu, Google Open Source

Facets: An Open Source Visualization Tool for Machine Learning Training Data

Monday, July 17, 2017

Cross-posted on the Google Research Blog

Getting the best results out of a machine learning (ML) model requires that you truly understand your data. However, ML datasets can contain hundreds of millions of data points, each consisting of hundreds (or even thousands) of features, making it nearly impossible to understand an entire dataset in an intuitive fashion. Visualization can help unlock nuances and insights in large datasets. A picture may be worth a thousand words, but an interactive visualization can be worth even more.

Working with the PAIR initiative, we’ve released Facets, an open source visualization tool to aid in understanding and analyzing ML datasets. Facets consists of two visualizations that allow users to see a holistic picture of their data at different granularities. Get a sense of the shape of each feature of the data using Facets Overview, or explore a set of individual observations using Facets Dive. These visualizations allow you to debug your data which, in machine learning, is as important as debugging your model. They can easily be used inside of Jupyter notebooks or embedded into webpages. In addition to the open source code, we've also created a Facets demo website. This website allows anyone to visualize their own datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer.

Facets Overview

Facets Overview automatically gives users a quick understanding of the distribution of values across the features of their datasets. Multiple datasets, such as a training set and a test set, can be compared on the same visualization. Common data issues that can hamper machine learning are pushed to the forefront, such as: unexpected feature values, features with high percentages of missing values, features with unbalanced distributions, and feature distribution skew between datasets.

Facets Overview visualization of the six numeric features of the UCI Census datasets[1]. The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top. Numbers in red indicate possible trouble spots, in this case numeric features with a high percentage of values set to 0. The histograms at right allow you to compare the distributions between the training data (blue) and test data (orange).

Facets Overview visualization showing two of the nine categorical features of the UCI Census datasets[1]. The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top. Notice in the “Target” feature that the label values differ between the training and test datasets, due to a trailing period in the test set (“<=50K” vs “<=50K.”). This can be seen in the chart for the feature and also in the entries in the “top” column of the table. This label mismatch would cause a model trained and tested on this data to not be evaluated correctly.

Facets Dive

Facets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features of a dataset. With Facets Dive, you control the position, color and visual representation of each data point based on its feature values. If the data points have images associated with them, the images can be used as the visual representations.

Facets Dive visualization showing all 16281 data points in the UCI Census test dataset[1]. The animation shows a user coloring the data points by one feature (“Relationship”), faceting in one dimension by a continuous feature (“Age”) and then faceting in another dimension by a discrete feature (“Marital Status”).

Facets Dive visualization of a large number of face drawings from the “Quick, Draw!” Dataset, showing the relationship between the number of strokes and points in the drawings and the ability for the “Quick, Draw!” classifier to correctly categorize them as faces.

Fun Fact: In large datasets, such as the CIFAR-10 dataset[2], a small human labelling error can easily go unnoticed. We inspected the CIFAR-10 dataset with Dive and were able to catch a frog-cat – an image of a frog that had been incorrectly labelled as a cat!

Exploration of the CIFAR-10 dataset using Facets Dive. Here we facet the ground truth labels by row and the predicted labels by column. This produces a confusion matrix view, allowing us to drill into particular kinds of misclassifications. In this particular case, the ML model incorrectly labels some small percentage of true cats as frogs. The interesting thing we find by putting the real images in the confusion matrix is that one of these "true cats" that the model predicted was a frog is actually a frog from visual inspection. With Facets Dive, we can determine that this one misclassification wasn't a true misclassification of the model, but instead incorrectly labeled data in the dataset.

Screen Shot 2017-07-14 at 2.59.13 PM.png

Can you spot the frog-cat?

We’ve gotten great value out of Facets inside of Google and are excited to share the visualizations with the world. We hope they can help you discover new and interesting things about your data that lead you to create more powerful and accurate machine learning models. And since they are open source, you can customize the visualizations for your specific needs or contribute to the project to help us all better understand our data. If you have feedback about your experience with Facets, please let us know what you think.

By James Wexler, Senior Software Engineer, Google Big Picture Team

Acknowledgments

This work is a collaboration between Mahima Pushkarna, James Wexler and Jimbo Wilson, with input from the entire Big Picture team. We would also like to thank Justine Tunney for providing us with the build tooling.

References

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets/Census+Income]. Irvine, CA: University of California, School of Information and Computer Science

[2] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky (2009).

After a "close call," a coding champion

Thursday, July 13, 2017

Cross-posted on The Keyword

Eighteen-year-old Cameroon resident Nji Collins had just put the finishing touches on his final submission for the Google Code-In competition when his entire town lost internet access. It stayed dark for two months.

“That was a really, really close call,” Nji, who prefers to be called Collins, tells the Keyword, adding that he traveled to a neighboring town every day to check his email and the status of the contest. “It was stressful.”

Google’s annual Code-In contest, an effort to introduce teenagers to the world of open source, invites high school students from around the world to compete. It’s part of our mission to encourage and inspire the next generation of computer scientists, and in turn, the contest allows these young people to play a role in building real technologies.

Over the course of the competition, participants complete open-source coding and design “tasks” administered by an array of tech companies like Wikimedia and OpenMRS. Tasks range from editing webpages to updating databases to making videos; one of Collins’ favorites, for example, was making the OpenMRS home page sensitive to keystrokes. This year, more than 1,300 entrants from 62 countries completed nearly 6,400 assignments.

While Google sponsors and runs the contest, the participating tech organizations, who work most closely with the students, choose the winners. Those who finish the most tasks are named finalists, and the companies each select two winners from that group. Those winners are then flown to San Francisco, CA for an action-packed week involving talks at the Googleplex in Mountain View, office tours, segway journeys through the city, and a sunset cruise on the SF Bay.

Collins with some of the other winners from Google Code-in 2016

“It’s really fun to watch these kids come together and thrive,” says Stephanie Taylor, Code-In’s program manager. “Bringing together students from, say, Thailand and Poland because they have something in common: a shared love of computer science. Lifelong friendships are formed on these trips.”

Indeed, many Code-In winners say the community is their main motivator for joining the competition. “The people are what brought me here and keep me here,” says Sushain Cherivirala, a Carnegie Mellon computer science major and former Code-In winner who now serves as a program mentor. Mentors work with Code-In participants throughout the course of the competition to help them complete tasks and interface with the tech companies.

Google Code-in winners on the Google campus

Code-In also acts as an accessible introduction to computer science and the open source world. Mira Yang, a 17-year-old from New Jersey, learned how to code for the first time this year. She says she never would have even considered studying computer science further before she dabbled in a few Code-In tasks. Now, she plans to major in it.

Google Code-in winners Nji Collins and Mira Yang

“Code-In changed my view on computer sciences,” she says. “I was able to learn that I can do this. There’s definitely a stigma for girls in CS. But I found out that people will support you, and there’s a huge network out there.”

That network extended to Cameroon, where Collins’ patience and persistence paid off as he waited out his town’s internet blackout. One afternoon, while checking his email a few towns away, he discovered he’d been named a Code-In winner. He had been a finalist the year prior, when he was the only student from his school to compete. This year, he’d convinced a handful of classmates to join in.

“It wasn’t fun doing it alone; I like competition,” Collins, who learned how to code by doing his older sister’s computer science homework assignments alongside her, says. “It pushes me to work harder.”

Learn more about the annual Code-In competition.

By Carly Schwartz, Editor-in-Chief, Google Internal News