Change Data Capture — Detailed#
flowchart TB
subgraph Sources[Source DBs]
PG[(Postgres - logical WAL,<br/>wal2json / pgoutput)]
MY[(MySQL - binlog ROW)]
MG[(MongoDB - oplog / change streams)]
DY[(DynamoDB Streams)]
OR[(Oracle - LogMiner / GoldenGate)]
SQ[(SQL Server CDC tables)]
end
subgraph Capture[Capture]
DEB([Debezium connectors])
DMS[AWS DMS / GCP Datastream]
MAX[Maxwell / Canal]
SNAP[Initial snapshot + ongoing tail]
PUB[(Replication slot / publication)]
end
subgraph Bus[Streaming Bus]
KAF[[Kafka topics<br/>one per table]]
SR[Schema Registry<br/>Avro / Protobuf]
COMPACT[Log compaction<br/>by PK]
end
subgraph Sinks[Sinks / consumers]
ES[(Search index<br/>Elasticsearch)]
CACHE[Cache invalidator]
DWH[(Snowflake / BigQuery / Redshift)]
LAKE[(Data lake / Iceberg / Hudi)]
ML[Feature store]
AUDIT[Audit immutable log]
INV[Inverse: SCD2 dim]
OUT[[Outbox-based event publisher]]
end
subgraph Patterns
OUTB[[Outbox pattern<br/>transactional event emission]]
DEDU[Dedup at sink via offset/LSN]
BACK([Backfill bootstrap])
EVOL[Schema evolution]
DLQ[[DLQ on transform errors]]
SAGA[Sagas powered by CDC]
end
PG --> DEB
MY --> DEB
MG --> DEB
DY --> DMS
OR --> DMS
SQ --> MAX
DEB --> KAF
DMS --> KAF
KAF --> SR
KAF --> COMPACT
KAF --> ES
KAF --> CACHE
KAF --> DWH
KAF --> LAKE
KAF --> ML
KAF --> AUDIT
OUTB --- DEB
SNAP --> KAF
DLQ --- DEB
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class DMS,MAX,SNAP,SR,COMPACT,CACHE,INV,DEDU,EVOL,SAGA service;
class PG,MY,MG,DY,OR,SQ,PUB,ES,DWH,LAKE,ML,AUDIT datastore;
class KAF,OUT,OUTB,DLQ queue;
class DEB,BACK compute;
CDC mechanics#
- Log-based: tail the DB's WAL/binlog → low overhead, captures every change.
- Trigger-based: row triggers write to shadow table → higher overhead, simpler.
- Polling: timestamp/CDC column scan → loses deletes, high lag.
Snapshot + tail#
- On first connect, snapshot table(s) to topic with
__op=r(read). - Then attach to log position; events arrive as
c/u/d(create/update/delete).
Ordering & exactly-once#
- Per-table topic → ordered by PK with
partition.key = pk. - Sinks track
offsetor(lsn, op)for idempotent upserts. - Schema registry enforces backward-compat changes.
Real uses#
- Cache invalidation: bust Redis key on row change.
- Search index sync: ES via Debezium sink.
- Analytics: Postgres → Kafka → Snowflake/BigQuery (near-real-time).
- Service decomposition: extract a microservice consuming legacy DB CDC.
- Audit log: append-only journal.
Pitfalls#
- Schema changes (DDL) need handling; some connectors halt on ALTER.
- Initial snapshot of huge tables — use chunked snapshot mode.
- Replication slots in Postgres can pin WAL and fill disk if consumers lag.
- Connector failure semantics — Debezium "at-least-once" with offsets; design sinks for replay.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Cache strategies | cache-aside, read/write-through, eviction | caching-strategies |
HLD |
Pub/Sub & message brokers | topics, consumer groups, delivery semantics | pub-sub-pattern |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
LSM vs B-Tree engines | WAL, memtable, SSTables, compaction | storage-engines-lsm-btree |
HLD |
Distributed transactions | 2PC, TCC, sagas, outbox/inbox | distributed-transactions |
HLD |
Change Data Capture | WAL/binlog tailing, outbox publishing | change-data-capture |
LLD |
Testing strategy | pyramid, doubles, TDD, contracts | testing-strategy |
LLD |
Immutability | immutable types, persistent collections | immutability |