Observability — Notes
Why three pillars (and what's missing)
- Metrics: cheap, aggregable, weak detail.
- Logs: rich detail, costly, hard to query for trends.
- Traces: causal chains across services.
- 4th: Continuous profiling (Pyroscope/Parca) — code-level resource attribution.
- 5th, emerging: eBPF events (network, syscalls) for kernel-side observability.
Cost rule of thumb
- Logs are by far the most expensive.
- Sample logs and traces aggressively in steady state.
- Always keep error traces / error logs.
Cardinality budget (Prometheus / Mimir)
- Labels multiply:
user × route × method × status × dc.
- Drop user-level labels; aggregate them server-side if needed.
What to instrument by default
- RED: Request rate, Errors, Duration (per route).
- USE: Utilization, Saturation, Errors (per resource).
- Golden Signals: latency, traffic, errors, saturation (SRE book).
Trace propagation
- W3C
traceparent header in HTTP/gRPC.
- Across queues, propagate via headers (Kafka, SQS message attributes).
- Server logs include
trace_id for join with traces.
Practical wins
- Exemplars on histogram metrics let you click p99 latency directly to a slow trace.
- "Tail sampling" Collector keeps every error trace without overload.
- Service map auto-generated from traces gives free architecture visibility.
Refs
- Google SRE Book + SRE Workbook (SLO/SLI chapters).
- OpenTelemetry docs (https://opentelemetry.io).
- "Distributed Systems Observability" — Cindy Sridharan.
- Honeycomb blog series on tail sampling.
- Prometheus / Grafana / Loki / Tempo docs.