Change Data Capture — Notes#
Why CDC#
- Single source of truth for "what changed" without dual-writes.
- Dual-writes (write DB + write to Kafka in app) are unsafe — atomicity gap.
- CDC + outbox closes the gap.
Outbox pattern (paired with CDC)#
- App writes
domain_rowandoutbox(event_id, payload)in same DB tx. - CDC tails outbox and publishes events.
- Mark or delete outbox rows post-publish.
This decouples app from broker availability.
Bootstrap modes#
- Snapshot-only (one-time export).
- Snapshot + streaming (initial backfill + ongoing).
- Streaming-only (rebuild from log retention).
- Chunked / parallel snapshot for huge tables.
Sink design rules#
- Idempotent upserts (
MERGEorON CONFLICT). - Track last applied
(lsn, op_seq)per partition. - Handle tombstones (deletes) → either physical delete or soft-flag.
Watch out#
- Postgres replication slot retention: pin WAL → disk full risk. Set
max_slot_wal_keep_size. - MySQL binlog format: must be ROW (not STATEMENT) for usable CDC.
- DDL events: most tools have schema history topic; coordinate downstream.
- PII redaction at capture (column filter) if downstream is less trusted.
Refs#
- Debezium docs and connector reference.
- Martin Kleppmann: "Turning the database inside out", "Online event processing".
- Confluent blog series on CDC.
- AWS DMS, GCP Datastream docs.