GitHub on BigQuery: Analyze all the code

Wednesday, June 29, 2016

Google, in collaboration with GitHub, is releasing an incredible new open dataset on Google BigQuery. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks GitHub Archive project!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.

On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
To learn more, read GitHub's announcement and try some sample queries. Share your queries and findings in our and Hacker News posts. The ideas are endless, and I'll start collecting tips and links to other articles on this post on Medium.

More statistics from Google Summer of Code 2016

Tuesday, June 28, 2016

Google Summer of Code (GSoC) 2016 is officially at its halfway point. Mentors and students have just completed their midterm evaluations and it’s time for our second stats post. This time we take a closer look at our participating students.

First, we’d like to highlight the universities with the most student participants. Congratulations are due to the International Institute of Information Technology - Hyderabad for claiming the top spot for the third consecutive year!

Country School 2016 Accepted Students 2015 Accepted Students 12 Year Total
India International Institute of Information Technology - Hyderabad 50 62 252
Sri Lanka University of Moratuwa 29 44 320
Romania University POLITEHNICA of Bucharest 24 14 155
India Birla Institute of Technology and Science Pilani, Goa Campus 22 15 110
India Birla Institute of Technology and Science, Pilani Campus 22 18 116
India Indian Institute of Technology, Bombay 18 13 75
India Indian Institute of Technology, Kharagpur 15 8 92
India Indian Institute of Technology, Roorkee 15 8 57
India Indraprastha Institute of Information Technology Delhi 15 7 27
India Amrita School of Engineering, Amrita University, Amritapuri Campus 13 5 33
India Indian Institute of Technology, Guwahati 13 5 38
Cameroon University of Buea 12 10 26
India Delhi Technological University 12 9 60
India Indian Institute of Technology BHU Varanasi 12 12 37
Germany TU Munich 11 7 45

Next, we are proud to announce that 2016 marks the largest number of female GSoC participants to date — 12% of accepted students are female, up 2.2% from 2015. This is good progress, but we are certain we can do better in the future to diversify our program. The Google Open Source team will continue our outreach to many organizations, for example, Grace Hopper and Black Girls Code, to increase this number even more 2017. If you have any suggestions of organizations we should work with, please let us know in the comments.

Finally, each year we like to look at the majors of students. As expected, the most common area of study for our participants is Computer Science (approximately 78%), but this year we have a wide variety of studies including Linguistics, Law, Music Technology and Psychology.  The majority of our students this year are undergraduates (67%), followed by Masters (23%) and then PhD students (9%).

Although reviewing GSoC statistics each year is great fun, we want to stress that being “first place” is not the point of the program. Our goal is to get more and more students involved in creating free and open source software. We hope Google Summer of Code encourages contributions to projects that have the potential to make a difference worldwide. Congratulations to the students from all over the globe and keep up the good work!

By Mary Radomile, Open Source Programs Office