RProtoBuf & HistogramTools: Statistical Analysis Tools for Large Data Sets

Thursday, October 10, 2013

At Google, building, managing and safely securing some of the world’s largest storage systems requires complex analysis of filesystem metadata. This is an important part of making sure that the information stored within those systems is quickly accessible and always secure. We're always looking for ways to make our data storage systems more efficient, and often times, this requires understanding the age, size and access patterns of the data stored, the failure rates of servers and disks, and more. You can imagine how complex this becomes with each new data center added.

Given the number of files and servers that are relevant for this performance analysis, we bin the metadata into a compact histogram form. We use these output histograms for many purposes, such as (i) building Markov models of data availability, (ii) statistical forecasting of resource usage, and (iii) formulating and solving optimization problems to determine optimal allocation of flash devices.

We rely on several open source tools to make our work easier. The most common tool we use for statistical analysis of the performance, availability, and resource needs of our internal systems is the R programming language. We’ve released two package updates that make R particularly suitable for interacting with other distributed systems.
  • RProtoBuf is an R package for Google’s Protocol Buffer library that allows one to define simple data structures with intuitive getter and setter methods. These data structures can be serialized into an extremely compact format for sending to other distributed systems. Recent releases include improved support for 64-bit integers, protocol buffer extensions, and more.
  • HistogramTools is a new R package I have released that uses RProtoBuf to read in a compact protocol buffer representation of binned data and includes a number of helpful functions for manipulating, plotting, and measuring the statistical information loss due to the binning. In addition to protocol buffers, it also supports importing aggregate performance data directly from DTrace output.
Both packages are available on CRAN and include extensive documentation and examples.

If you're interested to learn more, we have shared some of our research findings at conferences such as OSDI, USENIX ATC, and JSM.

By Murray Stokely, Storage Analytics Team Lead