Metrics & Monitoring — Notes
Functional
- Pull (Prom) or push (Datadog/StatsD) metric collection.
- Multi-tenant ingestion, retention, dashboards.
- Alerting with deduplication and routing.
- SLOs & error budgets.
Non-functional
- 100M active series possible per cluster.
- p99 dashboard query < 1 s for common ranges.
- Long-term retention to 1+ year on object storage.
Capacity
- ~1 B/sample compressed.
- 10M active series × 1 sample/15s = ~700k samples/s.
Trade-offs
- Pull vs push: pull simpler for service discovery; push for short-lived jobs.
- Cardinality is the killer — guard label values strictly.
- Federation vs single big cluster: scaling pattern.
Refs
- Prometheus, Cortex, Mimir, Thanos, VictoriaMetrics docs.
- Google SRE book on SLO/SLI.
- "Observability Engineering" Charity Majors.