opensource.google.com

Menu

W3C Trace Context Specification: What it Means for You

Wednesday, December 11, 2019

Since the first days of Google Cloud Platform (GCP), Google has been at the forefront of making your applications more observable. Beyond Stackdriver, our most visible impact in this space is OpenTelemetry, which we initiated in 2017 (as OpenCensus) and has grown into a huge community that includes the majority of APM / monitoring vendors and cloud platforms.

While OpenTelemetry allows developers to easily capture distributed traces and metrics from their own services, there’s also a need to trace requests as they propagate through components that developers don’t directly control, like managed services, load balancers, network hardware, etc. To solve this we co-defined a prototype HTTP header that these components can rely on, gathered partners, and moved the work into the W3C.

This work is now complete, and the W3C Trace Context format is now an official standard. Once implemented in GCP, this will make our services even easier to manage, both with Stackdriver and other third party distributed tracing tools. We explain more in the
official post on the W3C blog, which I’ve copied below:

The W3C Distributed Tracing working group has moved the Trace Context specification to the next maturity level. The specification is already being adopted and implemented by many platforms and SDKs. This article describes the Trace Context specification and how it improves troubleshooting and monitoring of modern distributed apps.

W3C Trace Context specification defines the format for propagating distributed tracing context between services. Distributed tracing makes it easy for developers to find the causes of issues in highly-distributed microservices applications by tracking how a single interaction was processed across multiple services. Each step of a trace is correlated through an ID that is passed between services, and W3C Trace Context now defines a standard for these context propagation headers.

Until now, different tracing systems have defined their own headers. Examples include Zipkin’s B3 format and X-Google-Cloud-Trace. Adopting a common context propagation format has been long desired by developers, APM vendors, and cloud platform hosts, as compatibility provides numerous benefits:
  • Web and RPC frameworks that use this standard to provide context propagation out of the box will also offer cross-service log correlation, even for developers who haven’t set up distributed tracing.
  • API producers can record the trace IDs of requests from API consumers and provide additional spans or metadata to their customers for a given traced request. Producers can also correlate customer trace IDs to internal traces when debugging technical issues raised by consumers.
  • Networking infrastructure (proxies, load balancers, routers, etc.) can both ensure that context propagation headers are not removed from requests passing through them, and can record spans or logs for a given trace, without having to support multiple vendor-specific formats. Potential examples of these include router appliances, cloud load balancers, and sidecar proxies like Envoy.
  • Instrumentation can be further decoupled from a developer’s choice of APM vendor. For example, using both OpenTelemetry and a given vendor’s agents, a developer can instrument different services in an application, and traces will flow through the system and be processed correctly by the vendor’s backend.
  • Web browsers and other clients can use these identifiers to correlate their telemetry with traces collected from backend services. This functionality is currently being defined.
To address this effort, a group of cloud providers, open source contributors, and APM vendors started defining a standard HTTP context propagation header that would replace their homegrown formats. This specification has been discussed and iterated on over the past two years, and the group working on it has grown significantly over that time. Sponsors include Google, Microsoft, Dynatrace, and New Relic (W3C members), and the group was officially moved into the W3C in 2018 for the work to proceed under the guidance of an official standards body and to spur even greater adoption.

TraceContext has since been adopted by OpenTelemetry (which enables it by default and also serves as the reference implementation), Azure services, Dynatrace, Elastic, Google Cloud Platform, Lightstep, and New Relic. We are tracking adoption in this list.

This first phase of work has focused on HTTP, as it is commonly used and has no built-in affordances for trace context propagation (gRPC and some newer RPC systems do). The same group of committee members are also working to define trace context propagation in other formats, starting with AMQP and MQTT for IoT; other upcoming topics include context propagation from clients and web browsers.

By Morgan McLean, OpenTelemetry + Stackdriver
.