Given the number of files and servers that are relevant for this performance analysis, we bin the metadata into a compact histogram form. We use these output histograms for many purposes, such as (i) building Markov models of data availability, (ii) statistical forecasting of resource usage, and (iii) formulating and solving optimization problems to determine optimal allocation of flash devices.
We rely on several open source tools to make our work easier. The most common tool we use for statistical analysis of the performance, availability, and resource needs of our internal systems is the R programming language. We’ve released two package updates that make R particularly suitable for interacting with other distributed systems.
- RProtoBuf is an R package for Google’s Protocol Buffer library that allows one to define simple data structures with intuitive getter and setter methods. These data structures can be serialized into an extremely compact format for sending to other distributed systems. Recent releases include improved support for 64-bit integers, protocol buffer extensions, and more.
- HistogramTools is a new R package I have released that uses RProtoBuf to read in a compact protocol buffer representation of binned data and includes a number of helpful functions for manipulating, plotting, and measuring the statistical information loss due to the binning. In addition to protocol buffers, it also supports importing aggregate performance data directly from DTrace output.
If you're interested to learn more, we have shared some of our research findings at conferences such as OSDI, USENIX ATC, and JSM.
By Murray Stokely, Storage Analytics Team Lead