Skip to content

Key-Value Store — Detailed (Dynamo / Cassandra style)#

flowchart TB
  subgraph Clients
    CL([App with smart driver])
  end

  subgraph Coordination
    GOSS[Gossip - SWIM]
    RING[Consistent hash ring<br/>tokens / vnodes]
    SEED[Seed nodes]
    REG[(Cluster meta)]
  end

  subgraph Node[Storage Node x N]
    direction TB
    COO[Coordinator]
    PART[Partitioner<br/>hash key]
    REP[Replicator<br/>N replicas, R/W quorum]
    MEM[Memtable<br/>sorted in-memory]
    WAL[[Commit log / WAL]]
    SS[(SSTable on disk<br/>LSM-tree)]
    COMP[Compaction<br/>size-tiered / leveled]
    BF[Bloom filter per SSTable]
    SUM[Index summary]
    HH[Hinted handoff]
    AE[Anti-entropy<br/>Merkle tree repair]
    RR[Read repair]
  end

  subgraph Consistency[Consistency Levels]
    ONE[ONE]
    QUO[QUORUM<br/>R+W>N]
    ALL[ALL]
    LWT[Lightweight Tx<br/>Paxos / EPaxos]
    CRDT2[CRDTs - opt]
  end

  subgraph Ops[Ops & Tooling]
    SNAP[Snapshots / Backups]
    REPAIR([Scheduled repair])
    BAL[Token rebalancer]
    MET[Metrics: read latency,<br/>pending compactions]
  end

  CL --> COO
  PART --> RING
  COO --> PART
  COO --> REP --> Node
  Node --> WAL --> MEM
  MEM -. flush .-> SS
  SS --> BF --> SUM
  SS --> COMP
  COO --> HH
  HH -.replay.-> Node
  AE --- Node
  RR --- COO
  GOSS --> RING
  SEED --> GOSS
  Consistency --- COO
  Ops --- Node

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class CL client;
    class GOSS,RING,SEED,COO,PART,REP,COMP,SUM,HH,AE,RR,ONE,QUO,ALL,LWT,CRDT2,SNAP,BAL service;
    class REG,MEM,SS,BF datastore;
    class WAL queue;
    class REPAIR compute;
    class MET obs;

Write path#

  1. Client → coordinator node (any peer).
  2. Coordinator hashes key → owners on ring, picks N replicas.
  3. Forwards write; each replica appends to WAL, applies to memtable.
  4. Acknowledged when W replicas confirm.
  5. Memtable flushes to SSTable; background compaction merges.

Read path#

  1. Client → coordinator.
  2. Coordinator queries R replicas in parallel.
  3. Compares versions (vector clock / timestamp), returns latest.
  4. On divergence → read repair.

LSM internals#

  • Memtable = skiplist / red-black tree.
  • SSTables immutable, sorted by key.
  • Bloom filter avoids disk reads for missing keys.
  • Compaction styles: size-tiered (Cassandra default), leveled (RocksDB / LevelDB).

Anti-entropy#

  • Merkle tree per range — replicas compare hashes, repair diffs.
  • Hinted handoff buffers writes destined for a down replica.

Failure handling#

  • Node down: gossip propagates "DN" state in seconds.
  • New node: streams ranges from peers; reads from old + new during bootstrap.
  • Quorum survives (N-1)/2 failures.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Consistent hashing key placement with minimal remap consistent-hashing
HLD Raft / Paxos consensus replicated state machine via majority quorum consensus-raft-paxos
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD LSM vs B-Tree engines WAL, memtable, SSTables, compaction storage-engines-lsm-btree
HLD Probabilistic data structures Bloom, HLL, Count-Min, MinHash, t-digest probabilistic-data-structures
HLD Logical clocks Lamport, vector clocks, HLC, TrueTime logical-clocks
HLD Observability metrics, logs, traces, SLOs observability
LLD Testing strategy pyramid, doubles, TDD, contracts testing-strategy
LLD Immutability immutable types, persistent collections immutability