Skip to content

Observability — Detailed#

flowchart TB
  subgraph App[Application]
    INST([OpenTelemetry SDK<br/>auto + manual instrumentation])
    SLO[SLI / SLO definitions]
    EXEM[Exemplar trace ids on metrics]
  end

  subgraph Pipelines[Collection]
    OTEL[OTel Collector<br/>receivers, processors, exporters]
    AGENT[Per-host agent<br/>FluentBit / Vector / Promtail]
    SCRAPE[Prometheus scrape]
  end

  subgraph Metrics[Metrics tier]
    PROM[Prometheus / Thanos / Mimir / VictoriaMetrics]
    DD[Datadog / NewRelic]
    REC[Recording rules / aggregations]
    ALERT[Alertmanager]
  end

  subgraph Logs[Logs tier]
    LOKI[Loki / Elasticsearch / OpenSearch / Splunk]
    PARSE[Structured parsing<br/>JSON]
    REDACT[PII redaction]
    INDEX[Indexing strategy: labels + content]
    ARCH[Cold archive S3]
  end

  subgraph Traces[Traces tier]
    JAEG[Jaeger / Tempo / Honeycomb]
    SAMPL[Sampling head + tail]
    SPAN[Spans, links, baggage]
    PROP[W3C traceparent propagation]
  end

  subgraph Profiles[Continuous Profiling]
    PROF[Pyroscope / Parca / Pixie]
    CPU[CPU / heap / lock / off-cpu]
  end

  subgraph SLO_Stack[SLO & error budget]
    BURN[Burn rate alerts]
    MWMW[Multi-window multi-burn]
    OBJ[Targets: 99.9% etc]
  end

  subgraph UX[Dashboards & UX]
    DASH[Grafana / Kibana]
    NOTI[PagerDuty / Opsgenie]
    INCID[Incident commander / runbook]
  end

  INST --> OTEL
  INST --> SCRAPE
  AGENT --> OTEL
  OTEL --> Metrics
  OTEL --> Logs
  OTEL --> Traces
  SCRAPE --> PROM
  PROM --> REC --> ALERT
  ALERT --> NOTI
  Metrics --> DASH
  Logs --> DASH
  Traces --> DASH
  Profiles --> DASH
  EXEM -. link metric -> trace .-> Traces
  SLO --> BURN --> ALERT

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class AGENT,REC,PARSE,REDACT,INDEX,SAMPL,SPAN,PROF,CPU,MWMW,OBJ,INCID service;
    class ARCH storage;
    class INST,SLO,EXEM,OTEL,SCRAPE,PROM,DD,ALERT,LOKI,JAEG,PROP,BURN,DASH,NOTI obs;

SLI / SLO basics#

  • SLI = signal (e.g., "fraction of requests < 300 ms").
  • SLO = target (e.g., 99.9% over 28 days).
  • Error budget = 1 - SLO. Spend it on shipping.
  • Burn-rate alerts: page on fast burn (1 hr / 5%), warn on slow burn (6 hr / 10%).

Metric types (Prometheus model)#

  • Counter — monotonic, use rate() for per-second.
  • Gauge — value at a moment.
  • Histogram — bucketed; allows histogram_quantile.
  • Summary — pre-computed quantiles, not aggregatable.

Sampling#

  • Head sampling: decide at root span (random N%).
  • Tail sampling: decide after full trace (keep all errors, slow).
  • Adaptive sampling: keep enough per-route signal.

Logging discipline#

  • Structured JSON, severity, request id, user id (hashed), trace id.
  • Sample noisy lines; reserve INFO for state changes, DEBUG for diag only.
  • Don't log PII or secrets; redact at agent.

Pitfalls#

  • Cardinality explosion in Prometheus — beware unbounded labels (user id).
  • Logs as primary metric source — slow and expensive.
  • Alerts on symptoms not causes (user impact > CPU%).
  • "Alert fatigue" — page only on user-visible breakage.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Observability metrics, logs, traces, SLOs observability
LLD Testing strategy pyramid, doubles, TDD, contracts testing-strategy
LLD Behavioural patterns Strategy, Observer, State, Command, Chain behavioral-patterns