Skip to content

Observability — Notes#

Why three pillars (and what's missing)#

  • Metrics: cheap, aggregable, weak detail.
  • Logs: rich detail, costly, hard to query for trends.
  • Traces: causal chains across services.
  • 4th: Continuous profiling (Pyroscope/Parca) — code-level resource attribution.
  • 5th, emerging: eBPF events (network, syscalls) for kernel-side observability.

Cost rule of thumb#

  • Logs are by far the most expensive.
  • Sample logs and traces aggressively in steady state.
  • Always keep error traces / error logs.

Cardinality budget (Prometheus / Mimir)#

  • Labels multiply: user × route × method × status × dc.
  • Drop user-level labels; aggregate them server-side if needed.

What to instrument by default#

  • RED: Request rate, Errors, Duration (per route).
  • USE: Utilization, Saturation, Errors (per resource).
  • Golden Signals: latency, traffic, errors, saturation (SRE book).

Trace propagation#

  • W3C traceparent header in HTTP/gRPC.
  • Across queues, propagate via headers (Kafka, SQS message attributes).
  • Server logs include trace_id for join with traces.

Practical wins#

  • Exemplars on histogram metrics let you click p99 latency directly to a slow trace.
  • "Tail sampling" Collector keeps every error trace without overload.
  • Service map auto-generated from traces gives free architecture visibility.

Refs#

  • Google SRE Book + SRE Workbook (SLO/SLI chapters).
  • OpenTelemetry docs (https://opentelemetry.io).
  • "Distributed Systems Observability" — Cindy Sridharan.
  • Honeycomb blog series on tail sampling.
  • Prometheus / Grafana / Loki / Tempo docs.