Batch & Stream — Notes
Choosing freshness
- ≤ 100 ms — in-process (Kafka Streams, Flink low-latency mode).
- ≤ 5 s — mainstream stream processing.
- ≤ 5 min — micro-batches (Spark Structured Streaming with 30 s triggers).
- ≥ 15 min — batch jobs.
Schema management
- Use Avro / Protobuf with a Schema Registry.
- Forward + backward compat rules: add fields with defaults, never remove.
- Compact topics for "latest value per key" semantics.
Cost levers
- Compact / dedupe early to shrink downstream.
- Tier-out hot vs cold storage in the warehouse.
- Time-travel features (Snowflake / Iceberg) — fast but pricey.
Operational checklist
- Consumer lag monitor (Burrow / Kafka exporter).
- Watermark monitor for stream jobs.
- Backfill SOP: pause downstream, replay, validate, resume.
- Schema-evolution playbook.
Refs
- "Streaming Systems" — Akidau, Chernyak, Lax (Google).
- Apache Flink docs, Kafka Streams DSL.
- Jay Kreps: "Questioning the Lambda Architecture" (origin of Kappa).
- "Designing Data-Intensive Applications" — ch.10-11.