Batch & Stream Processing — Detailed#
flowchart TB
subgraph Sources
APP[App events]
DB[DB CDC]
LOGS[Logs]
end
subgraph Bus[Streaming bus]
K[Kafka / Kinesis / Pub-Sub]
end
subgraph Lake[Storage]
DL[(Data lake - S3 / Parquet)]
DWH[(Warehouse - Snowflake / BigQuery / Redshift)]
end
subgraph Batch[Batch processing]
BJ[Airflow / Spark / dbt]
SCHED[Cron-style schedule]
end
subgraph Stream[Stream processing]
FL[Flink / Kafka Streams / Spark Structured Streaming]
STATE[(State store - RocksDB)]
WIN[Tumbling / hopping / session windows]
WM[Watermarks for late events]
end
Sources --> K
K --> Stream
K --> DL
DL --> Batch
Batch --> DWH
Stream --> DWH
Stream --> RT[(Real-time serving store)]
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
class BJ,FL,WIN,WM,SCHED,STATE compute;
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class APP,WIN,WM service;
class DB,DWH,STATE,RT datastore;
class K,FL queue;
class BJ,SCHED compute;
class DL storage;
class LOGS obs;
When each wins#
| Need | Pick |
|---|---|
| Daily / hourly reports | Batch |
| Heavy backfills, joins on TB | Batch |
| Real-time dashboards | Stream |
| Real-time ML features | Stream |
| Anomaly / fraud detection | Stream |
| Strict reproducibility | Batch (deterministic re-run) |
| Latency-sensitive personalization | Stream |
Lambda architecture#
flowchart LR
Ev[Events] --> Bf[Batch: full re-compute]
Ev --> Sf[Stream: incremental]
Bf --> Serve[Serving layer]
Sf --> Serve
Serve --> User
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class Ev,Bf,Serve service;
class Sf queue;
Two code paths, two storage paths — heavy maintenance.
Kappa architecture#
flowchart LR
Ev[Events forever-log] --> Sf[Stream: one path]
Sf --> Serve[Serving layer]
Serve --> User
Ev -.replay.-> Sf
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class Ev,Serve service;
class Sf queue;
One code path. Reprocess history by rewinding the stream.
Stream-processing fundamentals#
Time#
- Event time: when the event happened (clock on the device).
- Ingestion time: when it entered the system.
- Processing time: when the operator processed it.
- Watermark: "we've seen all events with event-time ≤ T" — triggers window close.
Windows#
- Tumbling — fixed, non-overlapping (every 1 min).
- Hopping / sliding — fixed size, overlap (5 min window, slide 1 min).
- Session — gap-defined (close after 30 min idle).
State#
- Stream operators maintain state (counts, joins, sessions).
- Backed by an embedded store (Flink + RocksDB).
- Checkpointed to durable storage for recovery (every 1-30 s).
Exactly-once#
- Source: replayable, dedupable offsets (Kafka).
- Operator: deterministic + idempotent + transactional sinks (Flink + Kafka tx).
- Sink: upsert by key.
OLAP store choices#
| Store | Best for | Trade-off |
|---|---|---|
| Snowflake / BigQuery / Redshift | analyst SQL, federated | per-query cost |
| ClickHouse / Druid / Pinot | sub-second OLAP at scale | ops complexity |
| Iceberg / Delta / Hudi on S3 | data-lake table format | governance + ETL friction |
ETL vs ELT#
- ETL (transform before load) — older; precomputed shape.
- ELT (transform inside warehouse with dbt) — modern; raw data preserved.
Common pitfalls#
- Cardinality blow-up in stream aggregations (group by user_id × second).
- Watermark stuck because one slow source — partition the bus, run shadow watermarks.
- Reprocessing in Lambda — code drift between batch and stream branches.
- Backfills: stream platforms struggle with multi-year historical data; bring batch back for that.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Pub/Sub & message brokers | topics, consumer groups, delivery semantics | pub-sub-pattern |
HLD |
LSM vs B-Tree engines | WAL, memtable, SSTables, compaction | storage-engines-lsm-btree |
HLD |
Change Data Capture | WAL/binlog tailing, outbox publishing | change-data-capture |
HLD |
Batch & stream processing | Lambda vs Kappa, watermarks, windows | batch-stream-processing |