Skip to content

Change Data Capture — Detailed#

flowchart TB
  subgraph Sources[Source DBs]
    PG[(Postgres - logical WAL,<br/>wal2json / pgoutput)]
    MY[(MySQL - binlog ROW)]
    MG[(MongoDB - oplog / change streams)]
    DY[(DynamoDB Streams)]
    OR[(Oracle - LogMiner / GoldenGate)]
    SQ[(SQL Server CDC tables)]
  end

  subgraph Capture[Capture]
    DEB([Debezium connectors])
    DMS[AWS DMS / GCP Datastream]
    MAX[Maxwell / Canal]
    SNAP[Initial snapshot + ongoing tail]
    PUB[(Replication slot / publication)]
  end

  subgraph Bus[Streaming Bus]
    KAF[[Kafka topics<br/>one per table]]
    SR[Schema Registry<br/>Avro / Protobuf]
    COMPACT[Log compaction<br/>by PK]
  end

  subgraph Sinks[Sinks / consumers]
    ES[(Search index<br/>Elasticsearch)]
    CACHE[Cache invalidator]
    DWH[(Snowflake / BigQuery / Redshift)]
    LAKE[(Data lake / Iceberg / Hudi)]
    ML[Feature store]
    AUDIT[Audit immutable log]
    INV[Inverse: SCD2 dim]
    OUT[[Outbox-based event publisher]]
  end

  subgraph Patterns
    OUTB[[Outbox pattern<br/>transactional event emission]]
    DEDU[Dedup at sink via offset/LSN]
    BACK([Backfill bootstrap])
    EVOL[Schema evolution]
    DLQ[[DLQ on transform errors]]
    SAGA[Sagas powered by CDC]
  end

  PG --> DEB
  MY --> DEB
  MG --> DEB
  DY --> DMS
  OR --> DMS
  SQ --> MAX
  DEB --> KAF
  DMS --> KAF
  KAF --> SR
  KAF --> COMPACT
  KAF --> ES
  KAF --> CACHE
  KAF --> DWH
  KAF --> LAKE
  KAF --> ML
  KAF --> AUDIT
  OUTB --- DEB
  SNAP --> KAF
  DLQ --- DEB

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class DMS,MAX,SNAP,SR,COMPACT,CACHE,INV,DEDU,EVOL,SAGA service;
    class PG,MY,MG,DY,OR,SQ,PUB,ES,DWH,LAKE,ML,AUDIT datastore;
    class KAF,OUT,OUTB,DLQ queue;
    class DEB,BACK compute;

CDC mechanics#

  • Log-based: tail the DB's WAL/binlog → low overhead, captures every change.
  • Trigger-based: row triggers write to shadow table → higher overhead, simpler.
  • Polling: timestamp/CDC column scan → loses deletes, high lag.

Snapshot + tail#

  • On first connect, snapshot table(s) to topic with __op=r (read).
  • Then attach to log position; events arrive as c/u/d (create/update/delete).

Ordering & exactly-once#

  • Per-table topic → ordered by PK with partition.key = pk.
  • Sinks track offset or (lsn, op) for idempotent upserts.
  • Schema registry enforces backward-compat changes.

Real uses#

  • Cache invalidation: bust Redis key on row change.
  • Search index sync: ES via Debezium sink.
  • Analytics: Postgres → Kafka → Snowflake/BigQuery (near-real-time).
  • Service decomposition: extract a microservice consuming legacy DB CDC.
  • Audit log: append-only journal.

Pitfalls#

  • Schema changes (DDL) need handling; some connectors halt on ALTER.
  • Initial snapshot of huge tables — use chunked snapshot mode.
  • Replication slots in Postgres can pin WAL and fill disk if consumers lag.
  • Connector failure semantics — Debezium "at-least-once" with offsets; design sinks for replay.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Cache strategies cache-aside, read/write-through, eviction caching-strategies
HLD Pub/Sub & message brokers topics, consumer groups, delivery semantics pub-sub-pattern
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD LSM vs B-Tree engines WAL, memtable, SSTables, compaction storage-engines-lsm-btree
HLD Distributed transactions 2PC, TCC, sagas, outbox/inbox distributed-transactions
HLD Change Data Capture WAL/binlog tailing, outbox publishing change-data-capture
LLD Testing strategy pyramid, doubles, TDD, contracts testing-strategy
LLD Immutability immutable types, persistent collections immutability