Key-Value Store — Detailed (Dynamo / Cassandra style)#
flowchart TB
subgraph Clients
CL([App with smart driver])
end
subgraph Coordination
GOSS[Gossip - SWIM]
RING[Consistent hash ring<br/>tokens / vnodes]
SEED[Seed nodes]
REG[(Cluster meta)]
end
subgraph Node[Storage Node x N]
direction TB
COO[Coordinator]
PART[Partitioner<br/>hash key]
REP[Replicator<br/>N replicas, R/W quorum]
MEM[Memtable<br/>sorted in-memory]
WAL[[Commit log / WAL]]
SS[(SSTable on disk<br/>LSM-tree)]
COMP[Compaction<br/>size-tiered / leveled]
BF[Bloom filter per SSTable]
SUM[Index summary]
HH[Hinted handoff]
AE[Anti-entropy<br/>Merkle tree repair]
RR[Read repair]
end
subgraph Consistency[Consistency Levels]
ONE[ONE]
QUO[QUORUM<br/>R+W>N]
ALL[ALL]
LWT[Lightweight Tx<br/>Paxos / EPaxos]
CRDT2[CRDTs - opt]
end
subgraph Ops[Ops & Tooling]
SNAP[Snapshots / Backups]
REPAIR([Scheduled repair])
BAL[Token rebalancer]
MET[Metrics: read latency,<br/>pending compactions]
end
CL --> COO
PART --> RING
COO --> PART
COO --> REP --> Node
Node --> WAL --> MEM
MEM -. flush .-> SS
SS --> BF --> SUM
SS --> COMP
COO --> HH
HH -.replay.-> Node
AE --- Node
RR --- COO
GOSS --> RING
SEED --> GOSS
Consistency --- COO
Ops --- Node
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class CL client;
class GOSS,RING,SEED,COO,PART,REP,COMP,SUM,HH,AE,RR,ONE,QUO,ALL,LWT,CRDT2,SNAP,BAL service;
class REG,MEM,SS,BF datastore;
class WAL queue;
class REPAIR compute;
class MET obs;
Write path#
- Client → coordinator node (any peer).
- Coordinator hashes key → owners on ring, picks N replicas.
- Forwards write; each replica appends to WAL, applies to memtable.
- Acknowledged when W replicas confirm.
- Memtable flushes to SSTable; background compaction merges.
Read path#
- Client → coordinator.
- Coordinator queries R replicas in parallel.
- Compares versions (vector clock / timestamp), returns latest.
- On divergence → read repair.
LSM internals#
- Memtable = skiplist / red-black tree.
- SSTables immutable, sorted by key.
- Bloom filter avoids disk reads for missing keys.
- Compaction styles: size-tiered (Cassandra default), leveled (RocksDB / LevelDB).
Anti-entropy#
- Merkle tree per range — replicas compare hashes, repair diffs.
- Hinted handoff buffers writes destined for a down replica.
Failure handling#
- Node down: gossip propagates "DN" state in seconds.
- New node: streams ranges from peers; reads from old + new during bootstrap.
- Quorum survives
(N-1)/2failures.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Consistent hashing | key placement with minimal remap | consistent-hashing |
HLD |
Raft / Paxos consensus | replicated state machine via majority quorum | consensus-raft-paxos |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
LSM vs B-Tree engines | WAL, memtable, SSTables, compaction | storage-engines-lsm-btree |
HLD |
Probabilistic data structures | Bloom, HLL, Count-Min, MinHash, t-digest | probabilistic-data-structures |
HLD |
Logical clocks | Lamport, vector clocks, HLC, TrueTime | logical-clocks |
HLD |
Observability | metrics, logs, traces, SLOs | observability |
LLD |
Testing strategy | pyramid, doubles, TDD, contracts | testing-strategy |
LLD |
Immutability | immutable types, persistent collections | immutability |