Distributed Tracing — Notes
Functional
- Trace = tree of spans across services.
- Each span:
(trace_id, span_id, parent_id, service, op, ts, duration, attrs).
- Query by trace ID, service, operation, time range.
Non-functional
- Ingestion: 100k+ spans/s.
- Storage at sampled rate; 1-10% typical head sampling.
- p99 trace lookup < 1 s.
Trade-offs
- Head sampling simpler; tail sampling keeps all errors + slow.
- OpenTelemetry as standard wins; proprietary SDKs declining.
- Exemplars link metrics → exact traces.
Refs
- OpenTelemetry docs; Jaeger, Tempo, Zipkin papers.
- Honeycomb / Lightstep blog series.
- Google Dapper paper (the original).