Skip to content

Metrics & Monitoring — Notes#

Functional#

  • Pull (Prom) or push (Datadog/StatsD) metric collection.
  • Multi-tenant ingestion, retention, dashboards.
  • Alerting with deduplication and routing.
  • SLOs & error budgets.

Non-functional#

  • 100M active series possible per cluster.
  • p99 dashboard query < 1 s for common ranges.
  • Long-term retention to 1+ year on object storage.

Capacity#

  • ~1 B/sample compressed.
  • 10M active series × 1 sample/15s = ~700k samples/s.

Trade-offs#

  • Pull vs push: pull simpler for service discovery; push for short-lived jobs.
  • Cardinality is the killer — guard label values strictly.
  • Federation vs single big cluster: scaling pattern.

Refs#

  • Prometheus, Cortex, Mimir, Thanos, VictoriaMetrics docs.
  • Google SRE book on SLO/SLI.
  • "Observability Engineering" Charity Majors.