Last week we open sourced the TensorFlow Recorder project (also known as TFRecorder), which makes it possible for data scientists, data engineers, or AI/ML engineers to create image based TFRecords with just a few lines of code. Using TFRecords is incredibly important for creating efficient TensorFlow ML pipelines, but until now they haven’t been so easy to create. Before TFRecorder, in order to create TFRecords at scale you would have had to write a data pipeline that parsed your structured data, loaded images from storage, and serialized the results into the TFRecord format. TFRecorder allows you to write TFRecords directly from a Pandas dataframe or CSV without writing any complicated code.
You can see an example of TFRecoder below, but first let’s talk about some of the specific advantages of TFRecords.
How TFRecords Can Help
Using the TFRecord file format allows you to store your data in sets of files, each containing a sequence of protocol buffers serialized as a binary record that can be read very efficiently, which will help reduce the data loading bottleneck mentioned above.Data loading performance can be further improved by implementing prefetching and parallel interleave along with using the TFRecord format. Prefetching reduces the time of each model training step(s) by fetching the data for the next training step while your model is executing training on the current step. Parallel interleave allows you to read from multiple TFRecords shards (pieces of a TFRecord file) and apply preprocessing of those interleaved data streams. This reduces the latency required to read a training batch and is especially helpful when reading data from the network.
Using TensorFlow Recorder
Creating a TFRecord using TFRecorder requires only a few lines of code. Here’s how it works.TFRecorder currently expects data to be in the same format as Google AutoML Vision.
This format looks like a pandas dataframe or CSV formatted as:
Where:
- split can take on the values TRAIN, VALIDATION, and TEST
- image_uri specifies a local or google cloud storage location for the image file.
- label can be either a text-based label that will be integerized or an integer
While this example would work well to convert a few thousand images into TFRecords, it probably wouldn’t scale well if you have millions of images. To scale up to huge datasets, TensorFlow Recorder provides connectivity with Google Cloud Dataflow, which is a serverless Apache Beam pipeline runner. Scaling up to DataFlow requires only a little bit more configuration.
What’s next?
We’d love for you to try out TensorFlow Recorder. You can get it from GitHub or simply pip install tfrecorder. Tensorflow Recorder is very new and we’d greatly appreciate your feedback, suggestions, and pull requests.By Mike Bernico and Carlos Ezequiel, Google Cloud AI Engineers