As machine learning workloads scale to unprecedented heights, developers are increasingly writing highly specialized Tensor Processing Unit (TPU) kernels using frameworks like Pallas, Mosaic, and Triton to maximize hardware performance.
However, customizing high-performance kernels has historically introduced a major engineering challenge: optimization blind spots. To legacy performance profilers, custom compilation paths appear as opaque execution paths. Developers are left with single, massive execution blocks in their trace captures, lacking granular visibility into what is actually occurring inside the chip's internal components. Did a vector processing instruction stall? Was matrix math idle due to data loading bottlenecks?
Traditional profiling relies heavily on compile-time static cost models to estimate kernel efficiency. While helpful for standard operations, these models cannot capture dynamic runtime realities like instruction execution stalls, memory subsystem congestion, or hardware scheduling conflicts.
To open this opaque execution path, we are excited to introduce the Kernel Profiling suite in XProf—a low-level hardware debugging suite engineered specifically for Pallas kernel authoring and optimization on Google TPUs. By combining static compilation tracking with dynamic, sub-microsecond hardware telemetry, XProf Kernel provides the deep transparency required to optimize high-scale ML workloads.
Deep visibility: HLO Graphs & MLIR Inspection
The first step in debugging any custom kernel is understanding how your high-level code is translated by the compiler. When compiling a JAX or PyTorch model, the compiler generates a High-Level Optimizer (HLO) graph. Previously, custom calls inside these graphs remained completely obscured.
XProf's updated Graph Viewer resolves this by exposing the internal compilation logic of these custom regions directly. To unlock this deep visibility, developers must pass the appropriate debug flags to the XLA compilation environment.
--xla_enable_custom_call_region_trace=true
--xla_xprof_register_llo_debug_info=true
Once these flags are active, any trace captured via XProf includes comprehensive compiler metadata. In the XProf Graph Viewer, clicking on a custom-call block reveals an interactive panel titled "Custom Call Text." This displays the raw, lowered MLIR (Multi-Level Intermediate Representation) code generated by the compiler.
By displaying the MLIR text side-by-side with high-level source-code representations, developers can immediately verify whether the compiler is correctly fusing operations and structuring memory tiles as intended.
Tracing Instrumented Low-Level Operations (LLO) Analysis
To provide cycle-level execution visibility, XProf exposes Low-Level Operations (LLO) bundle data directly inside the Trace Viewer. An LLO bundle represents the actual machine instructions issued to the TPU core's functional units during every clock cycle.
Through dynamic instrumentation, XProf inserts hardware markers exactly when a LLO bundle region executes. Within the Trace Viewer, this manifests as dedicated, time-aligned execution tracks representing the TPU bundle's slot utilization metrics from static analysis:
- MXU (Matrix Multiply Unit): Tracks active, busy cycles of high-throughput matrix-multiplication pipelines.
- Scalar and Vector ALUs: Displays the execution profile of mathematical operations, letting you spot pipeline imbalances.
- Vector Fills, Loads, Spills, and Stores: Exposes HBM-to-register data movement, critical for identifying bandwidth-throttling bottlenecks.
- XLU (Cross-Lane Unit): Monitors collective communications and data shuffling across physical TPU cores.
Runtime Performance Counter Sampling
While static analysis effectively verifies instruction counts or vector store logic, it remains detached from the dynamic realities of runtime execution. To bridge this gap, XProf introduces fine-grained, periodic performance counter sampling—available starting with TPU v7 (Ironwood). This capability empowers developers to move beyond static estimation and measure precisely how hardware blocks are utilized in real-time, providing the empirical ground truth needed to identify whether compute units are truly active or stalled by memory subsystems.
Consider the optimization of a tiled matrix multiplication (Matmul) kernel. While a static trace might indicate a logically perfect sequence of operations, real-world performance often falters if the Matrix Multiply Unit (MXU) sits idle while awaiting data from High-Bandwidth Memory (HBM). To diagnose and resolve such bottlenecks, developers can utilize a structured three-step profiling workflow:
- Set up the Profiling Environment: Configure the TPU v7 (Ironwood) runtime by defining specific hardware counters—such as scalar issues or synchronization waits.
- Capture a Kernel Profile: Use the XProf request interface to capture fine-grained performance counters, which can then be visualized as a time-series within the Trace Viewer.
- Interpret the Data: Analyze the resulting counters to distinguish between a Memory-Bound Scenario (characterized by massive spikes in
sync_wait) and an Optimized Scenario. For instance, implementing triple buffering to overlap memory loads with MXU compute can reduce runtime from 125.5µs to 88µs—a ~30% performance gain validated by a drastic reduction in synchronization events.
By shifting from static code inspection to empirical runtime telemetry, hardware behavior explicitly validates optimization strategies, ensuring every cycle on the silicon is spent productively. For a hands-on example to check out these techniques, please explore our Pallas Matmul w/ Perf Counters demo.
Visualizing the "Utilization Gap"
This dynamic tracking exposes the significant gap left by traditional static analysis tools. A static tool analyzes instructions linearly, completely ignoring time. It might flag an MXU instruction block as "100% Utilized."
In contrast, XProf plots actual hardware execution over time. You might discover that a long-running Scalar ALU operation is stalling the entire execution pipeline, leaving the powerful MXU completely idle. By visualizing these temporal idle gaps, developers can adjust data shapes, memory alignments, and instruction sequencing to maximize compute density.
STATIC ESTIMATION:
[========== Block Execution: MXU Flagged 100% Utilized ==========]
XPROF REAL-WORLD TIMELINE:
├─ [Scalar ALU (Active)] ─┼─ [MXU (Active)] ─┼── [MXU (Idle / Memory Stall)] ──┤
│ Stalling pipeline... │ Compute phase │ Starved; waiting for HBM Load │
Overall Utilization from Performance Counters
Navigating profiling metrics can be daunting. Relying on metrics calculated via compile-time cost models often misrepresents performance when applied to custom compilation paths. To solve this, XProf establishes a clear Hierarchy of Trust:
┌───────────────────────────────┐
│ Absolute Ground Truth │
│ (HBM, Hardware Registers, │ (100% Trustworthy)
│ TPO Metrics, CSRs) │
└───────────────┬───────────────┘
▼
┌───────────────────────────────┐
│ Estimated Metrics │
│ (Program Optimal FLOPs, │ (Requires caution with
│ Goodput Efficiency) │ custom compiling paths)
└───────────────────────────────┘
- The Absolute Ground Truth (100% Trustworthy): Metrics derived directly from physical hardware registers (HBM utilization, TPO metrics, unprivileged hardware stats). When profiling custom kernels, these represent physical reality and should be your primary optimization anchors.
- Estimated Metrics (Use with Caution): Metrics like "Compared to program optimal FLOPS" or "Goodput efficiency" rely on XLA cost models. Because custom compilation paths bypass standard passes, these metrics can be highly skewed or outright non-functional.
For the unvarnished truth, XProf exposes the Perf Counters View, providing direct, tabular access to over 16,000 raw hardware counters read straight from the TPU silicon.
Understanding Trace Tracks: The height of a trace track does not represent a normalized 0-100% percentage. It represents the maximum raw counter value observed in that interval. For example, if a counter increments by 100 cycles over a 500-nanosecond trace window (roughly 1,000 clock cycles on a 2.0 GHz core), it indicates exactly 10% physical utilization of that unit.
To configure and profile the runtime performance counters sampling method, please follow the instructions from <openxla.org/xprof/kernel-profiling.html>.
Advanced Sampling: Event-Triggered Profiling
Previously, dynamic capturing was limited to Periodic Sampling Mode—polling counters based on a host-level timer, which hit a physical resolution floor of 1 microsecond.
CORE 0 CORE 1 CORE 2 CORE 3
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 28 Counters │ │ 28 Counters │ │ 28 Counters │ │ 28 Counters │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
└─────────────────────────────────────────────────────────────────┘
4 x 28 Sparse Matrix
To capture lightning-fast hardware cycles, XProf now supports External Event-Triggered Mode. The dynamic sampler intercepts physical TPU trace instructions and boundary triggers (such as entering/exiting custom call scopes), allowing for sub-microsecond capture latency and precise attribution.
Developers can configure up to 28 hardware counters per core, distributed across up to four active SparseCores, creating a 4 x 28 profiling matrix that maximizes data variety while protecting workload performance.
Activating this is straightforward via standard JAX JIT profilers:
options = jax.profiler.ProfileOptions()
# Example request for externally triggered collection
options.advanced_configuration = {
"tpu_enable_periodic_counter_sampling" : True,
"tpu_tc_perf_counter_sampling_options" : (
'is_external_trigger:true scaling:0 counter_size_bits:1 indices:10 indices:11 indices:56 indices:57 indices:58'
),
}
# For periodic sampling, please use interval_us instead of is_external_trigger.
Getting Started
Ready to transition from guessing performance to measuring and optimizing the physical limits of your ML silicon? Explore these open-source resources to get started with XProf Kernel today:
- XProf GitHub Repository: github.com/openxla/xprof
- Official XProf Documentation: openxla.org/xprof
- JAX Profiling Guide: jax.readthedocs.io/en/latest/profiling.html

