Distributed Logging — Notes
Functional
- Collect logs from every host / service.
- Parse, enrich, route by tags.
- Index for free-text search.
- Dashboards + alerts.
- Tiered retention (hot / warm / cold).
Non-functional
- 100k+ events/s for big estates.
- p99 indexing latency < 30 s.
- 99.9% availability for ingest.
Capacity
- Logs are the most expensive observability pillar; budget by team.
- ES hot tier: ~1 KB/event, 100M events/day = ~100 GB/day per tenant.
Trade-offs
- ES inverted index = great search, expensive disk.
- Loki labels-only = cheap storage, weaker search (regex over data).
- CDC vs polling at sources: agents always push.
- Structured JSON logs vs free text: enforce JSON + redaction.
Refs
- ELK / EFK stack docs; Loki paper.
- "Honeycomb: How we built our datastore" blogs.
- Vector + OpenTelemetry Collector docs.