HLD
Storage Engines — Detailed
flowchart TB
subgraph Common[Common building blocks]
WAL[Write-Ahead Log<br/>group commit, fsync]
PC[Page / Block Cache<br/>buffer pool]
CKP[Checkpoint / fsync interval]
REC[Crash recovery<br/>replay WAL]
end
subgraph BTree[B+ Tree engine - InnoDB, Postgres heap+btree, BoltDB]
direction TB
BTROOT[Root node]
BTINT[Internal nodes]
BTLEAF[Leaf pages<br/>sorted, linked]
BTINS[Insert: locate leaf,<br/>split if full O log n]
BTGET[Point read O log n,<br/>1-2 disk hops with cache]
BTSCAN[Range scan via leaf links]
BTFREE[Free list / FSM]
end
subgraph LSM[LSM Tree engine - RocksDB, Cassandra, LevelDB, ScyllaDB]
direction TB
LMEM[Memtable<br/>skiplist / rbtree]
IMMU[Immutable memtable]
L0[L0 SSTables<br/>overlap possible]
L1[L1 SSTables<br/>non-overlapping]
LN[L2...Ln<br/>10x size each]
BLOOM[Bloom filter per SSTable<br/>avoid disk for missing keys]
INDEX[Sparse index + summary]
COMPS[Compaction strategies<br/>size-tiered / leveled / universal]
TS[Tombstones for deletes]
SNAP[Snapshot via seqno]
end
subgraph Tradeoffs[Read vs Write amplification]
RA[Read amp: levels touched per read]
WA[Write amp: bytes rewritten / bytes written]
SA[Space amp: live / stored]
end
WAL --> LMEM
LMEM -. seal .-> IMMU -. flush .-> L0
COMPS --> L0
COMPS --> L1
COMPS --> LN
BLOOM --- L0
BLOOM --- L1
INDEX --- L1
TS --- COMPS
WAL --- BTree
BTROOT --> BTINT --> BTLEAF
PC --- BTree
PC --- LSM
REC --- WAL
Tradeoffs --- LSM
Tradeoffs --- BTree
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class WAL,CKP,REC,BTROOT,BTINT,BTLEAF,BTINS,BTGET,BTSCAN,BTFREE,LN,INDEX,COMPS,TS,SNAP,RA,WA,SA service;
class LMEM,IMMU,L0,L1,BLOOM datastore;
class PC cache;
When each wins
Property
B+ Tree
LSM
Write throughput
medium (in-place page writes)
high (sequential append)
Read latency
low, predictable
higher tail (multi-level lookup)
Range scan
very fast
fast (with leveled)
Write amp
low
high (compaction)
Space amp
low
higher (multiple copies during compaction)
Concurrency
latch-coupling complex
simpler (immutable SSTables)
Best for
OLTP, mixed
write-heavy, log-like, embedded KV
Compaction strategies (LSM)
Size-tiered : merge same-sized SSTables → low write amp, high space amp.
Leveled : bounded levels, non-overlapping per level → low space amp, high write amp.
Universal / hybrid : tune knobs for workload.
WAL & durability
Group commit batches fsyncs; trade latency for throughput.
sync vs async mode (Postgres fsync=off → fast & dangerous).
Read path (LSM)
Check memtable.
Check immutable memtable.
For each level: bloom filter → SSTable index → block cache → disk.
Merge values, dedupe by seqno, return latest.
Glossary & fundamentals
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
Tag
Concept
What it is
Page
HLD
LSM vs B-Tree engines
WAL, memtable, SSTables, compaction
storage-engines-lsm-btree
HLD
Probabilistic data structures
Bloom, HLL, Count-Min, MinHash, t-digest
probabilistic-data-structures
LLD
State machines
FSM, HSM, transitions, guards
state-machines
LLD
Testing strategy
pyramid, doubles, TDD, contracts
testing-strategy
LLD
Immutability
immutable types, persistent collections
immutability