Skip to content

Change Data Capture — Notes#

Why CDC#

  • Single source of truth for "what changed" without dual-writes.
  • Dual-writes (write DB + write to Kafka in app) are unsafe — atomicity gap.
  • CDC + outbox closes the gap.

Outbox pattern (paired with CDC)#

  1. App writes domain_row and outbox(event_id, payload) in same DB tx.
  2. CDC tails outbox and publishes events.
  3. Mark or delete outbox rows post-publish.

This decouples app from broker availability.

Bootstrap modes#

  • Snapshot-only (one-time export).
  • Snapshot + streaming (initial backfill + ongoing).
  • Streaming-only (rebuild from log retention).
  • Chunked / parallel snapshot for huge tables.

Sink design rules#

  • Idempotent upserts (MERGE or ON CONFLICT).
  • Track last applied (lsn, op_seq) per partition.
  • Handle tombstones (deletes) → either physical delete or soft-flag.

Watch out#

  • Postgres replication slot retention: pin WAL → disk full risk. Set max_slot_wal_keep_size.
  • MySQL binlog format: must be ROW (not STATEMENT) for usable CDC.
  • DDL events: most tools have schema history topic; coordinate downstream.
  • PII redaction at capture (column filter) if downstream is less trusted.

Refs#

  • Debezium docs and connector reference.
  • Martin Kleppmann: "Turning the database inside out", "Online event processing".
  • Confluent blog series on CDC.
  • AWS DMS, GCP Datastream docs.