Skip to content

Batch & Stream Processing — Detailed#

flowchart TB
  subgraph Sources
    APP[App events]
    DB[DB CDC]
    LOGS[Logs]
  end

  subgraph Bus[Streaming bus]
    K[Kafka / Kinesis / Pub-Sub]
  end

  subgraph Lake[Storage]
    DL[(Data lake - S3 / Parquet)]
    DWH[(Warehouse - Snowflake / BigQuery / Redshift)]
  end

  subgraph Batch[Batch processing]
    BJ[Airflow / Spark / dbt]
    SCHED[Cron-style schedule]
  end

  subgraph Stream[Stream processing]
    FL[Flink / Kafka Streams / Spark Structured Streaming]
    STATE[(State store - RocksDB)]
    WIN[Tumbling / hopping / session windows]
    WM[Watermarks for late events]
  end

  Sources --> K
  K --> Stream
  K --> DL
  DL --> Batch
  Batch --> DWH
  Stream --> DWH
  Stream --> RT[(Real-time serving store)]

  classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
  class BJ,FL,WIN,WM,SCHED,STATE compute;

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class APP,WIN,WM service;
    class DB,DWH,STATE,RT datastore;
    class K,FL queue;
    class BJ,SCHED compute;
    class DL storage;
    class LOGS obs;

When each wins#

Need Pick
Daily / hourly reports Batch
Heavy backfills, joins on TB Batch
Real-time dashboards Stream
Real-time ML features Stream
Anomaly / fraud detection Stream
Strict reproducibility Batch (deterministic re-run)
Latency-sensitive personalization Stream

Lambda architecture#

flowchart LR
  Ev[Events] --> Bf[Batch: full re-compute]
  Ev --> Sf[Stream: incremental]
  Bf --> Serve[Serving layer]
  Sf --> Serve
  Serve --> User

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class Ev,Bf,Serve service;
    class Sf queue;

Two code paths, two storage paths — heavy maintenance.

Kappa architecture#

flowchart LR
  Ev[Events forever-log] --> Sf[Stream: one path]
  Sf --> Serve[Serving layer]
  Serve --> User
  Ev -.replay.-> Sf

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class Ev,Serve service;
    class Sf queue;

One code path. Reprocess history by rewinding the stream.

Stream-processing fundamentals#

Time#

  • Event time: when the event happened (clock on the device).
  • Ingestion time: when it entered the system.
  • Processing time: when the operator processed it.
  • Watermark: "we've seen all events with event-time ≤ T" — triggers window close.

Windows#

  • Tumbling — fixed, non-overlapping (every 1 min).
  • Hopping / sliding — fixed size, overlap (5 min window, slide 1 min).
  • Session — gap-defined (close after 30 min idle).

State#

  • Stream operators maintain state (counts, joins, sessions).
  • Backed by an embedded store (Flink + RocksDB).
  • Checkpointed to durable storage for recovery (every 1-30 s).

Exactly-once#

  • Source: replayable, dedupable offsets (Kafka).
  • Operator: deterministic + idempotent + transactional sinks (Flink + Kafka tx).
  • Sink: upsert by key.

OLAP store choices#

Store Best for Trade-off
Snowflake / BigQuery / Redshift analyst SQL, federated per-query cost
ClickHouse / Druid / Pinot sub-second OLAP at scale ops complexity
Iceberg / Delta / Hudi on S3 data-lake table format governance + ETL friction

ETL vs ELT#

  • ETL (transform before load) — older; precomputed shape.
  • ELT (transform inside warehouse with dbt) — modern; raw data preserved.

Common pitfalls#

  • Cardinality blow-up in stream aggregations (group by user_id × second).
  • Watermark stuck because one slow source — partition the bus, run shadow watermarks.
  • Reprocessing in Lambda — code drift between batch and stream branches.
  • Backfills: stream platforms struggle with multi-year historical data; bring batch back for that.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Pub/Sub & message brokers topics, consumer groups, delivery semantics pub-sub-pattern
HLD LSM vs B-Tree engines WAL, memtable, SSTables, compaction storage-engines-lsm-btree
HLD Change Data Capture WAL/binlog tailing, outbox publishing change-data-capture
HLD Batch & stream processing Lambda vs Kappa, watermarks, windows batch-stream-processing