Observability — Detailed#
flowchart TB
subgraph App[Application]
INST([OpenTelemetry SDK<br/>auto + manual instrumentation])
SLO[SLI / SLO definitions]
EXEM[Exemplar trace ids on metrics]
end
subgraph Pipelines[Collection]
OTEL[OTel Collector<br/>receivers, processors, exporters]
AGENT[Per-host agent<br/>FluentBit / Vector / Promtail]
SCRAPE[Prometheus scrape]
end
subgraph Metrics[Metrics tier]
PROM[Prometheus / Thanos / Mimir / VictoriaMetrics]
DD[Datadog / NewRelic]
REC[Recording rules / aggregations]
ALERT[Alertmanager]
end
subgraph Logs[Logs tier]
LOKI[Loki / Elasticsearch / OpenSearch / Splunk]
PARSE[Structured parsing<br/>JSON]
REDACT[PII redaction]
INDEX[Indexing strategy: labels + content]
ARCH[Cold archive S3]
end
subgraph Traces[Traces tier]
JAEG[Jaeger / Tempo / Honeycomb]
SAMPL[Sampling head + tail]
SPAN[Spans, links, baggage]
PROP[W3C traceparent propagation]
end
subgraph Profiles[Continuous Profiling]
PROF[Pyroscope / Parca / Pixie]
CPU[CPU / heap / lock / off-cpu]
end
subgraph SLO_Stack[SLO & error budget]
BURN[Burn rate alerts]
MWMW[Multi-window multi-burn]
OBJ[Targets: 99.9% etc]
end
subgraph UX[Dashboards & UX]
DASH[Grafana / Kibana]
NOTI[PagerDuty / Opsgenie]
INCID[Incident commander / runbook]
end
INST --> OTEL
INST --> SCRAPE
AGENT --> OTEL
OTEL --> Metrics
OTEL --> Logs
OTEL --> Traces
SCRAPE --> PROM
PROM --> REC --> ALERT
ALERT --> NOTI
Metrics --> DASH
Logs --> DASH
Traces --> DASH
Profiles --> DASH
EXEM -. link metric -> trace .-> Traces
SLO --> BURN --> ALERT
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class AGENT,REC,PARSE,REDACT,INDEX,SAMPL,SPAN,PROF,CPU,MWMW,OBJ,INCID service;
class ARCH storage;
class INST,SLO,EXEM,OTEL,SCRAPE,PROM,DD,ALERT,LOKI,JAEG,PROP,BURN,DASH,NOTI obs;
SLI / SLO basics#
- SLI = signal (e.g., "fraction of requests < 300 ms").
- SLO = target (e.g., 99.9% over 28 days).
- Error budget =
1 - SLO. Spend it on shipping. - Burn-rate alerts: page on fast burn (1 hr / 5%), warn on slow burn (6 hr / 10%).
Metric types (Prometheus model)#
- Counter — monotonic, use
rate()for per-second. - Gauge — value at a moment.
- Histogram — bucketed; allows
histogram_quantile. - Summary — pre-computed quantiles, not aggregatable.
Sampling#
- Head sampling: decide at root span (random N%).
- Tail sampling: decide after full trace (keep all errors, slow).
- Adaptive sampling: keep enough per-route signal.
Logging discipline#
- Structured JSON, severity, request id, user id (hashed), trace id.
- Sample noisy lines; reserve INFO for state changes, DEBUG for diag only.
- Don't log PII or secrets; redact at agent.
Pitfalls#
- Cardinality explosion in Prometheus — beware unbounded labels (user id).
- Logs as primary metric source — slow and expensive.
- Alerts on symptoms not causes (user impact > CPU%).
- "Alert fatigue" — page only on user-visible breakage.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Observability | metrics, logs, traces, SLOs | observability |
LLD |
Testing strategy | pyramid, doubles, TDD, contracts | testing-strategy |
LLD |
Behavioural patterns | Strategy, Observer, State, Command, Chain | behavioral-patterns |