Advanced TPU optimization with XProf: Continuous profiling, utilization insights, and LLO bundles
In our previous post, we introduced the updated XProf and the Cloud Diagnostics XProf library, which are designed to help developers identify model bottlenecks and optimize memory usage. As machine learning workloads on TPUs continue to grow in complexity—spanning both massive training runs and large-scale inference—developers require even deeper visibility into how their code interacts with the underlying hardware.
Today, we are exploring three advanced capabilities designed to provide "flight recorder" visibility and instruction-level insights: Continuous Profiling Snapshots, the Utilization Viewer, and LLO Bundle Visualization.
Continuous Profiling Snapshots: The "Flight Recorder" for ML
Standard profiling often relies on "sampling mode," where users manually trigger high-fidelity traces for short, predefined durations. While effective for general optimization, this traditional approach can miss transient anomalies, intermittent stragglers, or unexpected performance regressions that occur during long-running training jobs.
To address this visibility gap, XProf is introducing Continuous Profiling Snapshots. This feature functions as an "always-on" flight recorder for your TPU workloads.
How it works: Continuous profiling snapshots (Google Colab) operates quietly in the background with minimal system overhead (approximately 7µs per packet CPU overhead). It utilizes a host-side circular buffer of roughly 2GB to seamlessly retain the last ~90 seconds of performance data. This architecture allows developers to snapshot performance data programmatically precisely when an anomaly occurs, bypassing the overhead and unpredictability of traditional one-shot profiling.
Key technical features include:
- Circular Buffer Management: Continuously holds recent trace data to ensure you can capture the exact moments leading up to an anomaly or regression.
- Out-of-band State Tracking: A lightweight service polls hardware registers for P-state (voltage and frequency) and trace-drop counters, ensuring the snapshot contains the necessary environmental context for accurate analysis.
- Context Reconstruction: The system safely decouples state capture from the trace stream. This ensures that any arbitrary snapshot retains the ground truth required for precise, actionable debugging.
Visualizing Hardware Efficiency with the Utilization Viewer
Raw performance counters are powerful, but interpreting thousands of raw hardware metrics can be a daunting, time-consuming process. The new Utilization Viewer bridges the gap between raw data streams and actionable optimization strategies.
This tool translates raw performance counter values into easily understandable utilization percentages for specific hardware components, such as the TensorCore (TC), SparseCore (SC), and High Bandwidth Memory (HBM).
Figure 3: Deriving actionable insights from raw performance counters.
From Counters to Insights: Instead of requiring developers to manually analyze a raw list of event counts, the Utilization Viewer automatically derives high-level metrics. For example, it can translate raw bus activity into a clear utilization percentage (e.g., displaying an average MXU bus utilization of 7.3%). This immediate clarity allows you to determine at a glance whether your model is compute-bound or memory-bound.
Inspecting the Metal: Low-Level Operations (LLO) Bundles
For advanced users and kernel developers utilizing Pallas, we are now exposing Low-Level Operations (LLO) bundle data. LLO bundles represent the specific machine instructions issued to the TPU's functional units during every clock cycle.
This feature is critical for "Instruction Scheduling" verification—ensuring that the compiler is honoring your programming intentions and correctly re-ordering instructions to maximize hardware performance.
New Visualizations via Trace View Integration: You can now visualize LLO bundles directly within the trace viewer. Through dynamic instrumentation, XProf inserts traces exactly when a bundle executes. This provides exact execution times and block utilization metrics, rather than relying on static compiler estimates.
Why it matters: Accessing this level of granularity enables hyper-specific bottleneck analysis. For instance, developers can now identify idle cycles within the Matrix Multiplication Unit (MXU) pipeline, making it easier to spot and resolve latency between vmatmul and vpop instructions.
Conclusion
Whether you are trying to capture a fleeting performance regression with Continuous Profiling, verifying kernel efficiency with LLO Bundles, or assessing overall hardware saturation with the Utilization Viewer, these new features bring internal-grade Google tooling directly to the open-source community. These tools are engineered to provide the absolute transparency required to optimize high-scale ML workloads.
Get started by checking out the updated resources:
- XProf GitHub Repository: https://github.com/openxla/XProf
- Official Documentation: https://openxla.org/XProf