Skip to content

Distributed Tracing — Notes#

Functional#

  • Trace = tree of spans across services.
  • Each span: (trace_id, span_id, parent_id, service, op, ts, duration, attrs).
  • Query by trace ID, service, operation, time range.

Non-functional#

  • Ingestion: 100k+ spans/s.
  • Storage at sampled rate; 1-10% typical head sampling.
  • p99 trace lookup < 1 s.

Trade-offs#

  • Head sampling simpler; tail sampling keeps all errors + slow.
  • OpenTelemetry as standard wins; proprietary SDKs declining.
  • Exemplars link metrics → exact traces.

Refs#

  • OpenTelemetry docs; Jaeger, Tempo, Zipkin papers.
  • Honeycomb / Lightstep blog series.
  • Google Dapper paper (the original).