Skip to content

Batch & Stream — Notes#

Choosing freshness#

  • ≤ 100 ms — in-process (Kafka Streams, Flink low-latency mode).
  • ≤ 5 s — mainstream stream processing.
  • ≤ 5 min — micro-batches (Spark Structured Streaming with 30 s triggers).
  • ≥ 15 min — batch jobs.

Schema management#

  • Use Avro / Protobuf with a Schema Registry.
  • Forward + backward compat rules: add fields with defaults, never remove.
  • Compact topics for "latest value per key" semantics.

Cost levers#

  • Compact / dedupe early to shrink downstream.
  • Tier-out hot vs cold storage in the warehouse.
  • Time-travel features (Snowflake / Iceberg) — fast but pricey.

Operational checklist#

  • Consumer lag monitor (Burrow / Kafka exporter).
  • Watermark monitor for stream jobs.
  • Backfill SOP: pause downstream, replay, validate, resume.
  • Schema-evolution playbook.

Refs#

  • "Streaming Systems" — Akidau, Chernyak, Lax (Google).
  • Apache Flink docs, Kafka Streams DSL.
  • Jay Kreps: "Questioning the Lambda Architecture" (origin of Kappa).
  • "Designing Data-Intensive Applications" — ch.10-11.