Skip to content

Storage Engines — Detailed#

flowchart TB
  subgraph Common[Common building blocks]
    WAL[Write-Ahead Log<br/>group commit, fsync]
    PC[Page / Block Cache<br/>buffer pool]
    CKP[Checkpoint / fsync interval]
    REC[Crash recovery<br/>replay WAL]
  end

  subgraph BTree[B+ Tree engine - InnoDB, Postgres heap+btree, BoltDB]
    direction TB
    BTROOT[Root node]
    BTINT[Internal nodes]
    BTLEAF[Leaf pages<br/>sorted, linked]
    BTINS[Insert: locate leaf,<br/>split if full O log n]
    BTGET[Point read O log n,<br/>1-2 disk hops with cache]
    BTSCAN[Range scan via leaf links]
    BTFREE[Free list / FSM]
  end

  subgraph LSM[LSM Tree engine - RocksDB, Cassandra, LevelDB, ScyllaDB]
    direction TB
    LMEM[Memtable<br/>skiplist / rbtree]
    IMMU[Immutable memtable]
    L0[L0 SSTables<br/>overlap possible]
    L1[L1 SSTables<br/>non-overlapping]
    LN[L2...Ln<br/>10x size each]
    BLOOM[Bloom filter per SSTable<br/>avoid disk for missing keys]
    INDEX[Sparse index + summary]
    COMPS[Compaction strategies<br/>size-tiered / leveled / universal]
    TS[Tombstones for deletes]
    SNAP[Snapshot via seqno]
  end

  subgraph Tradeoffs[Read vs Write amplification]
    RA[Read amp: levels touched per read]
    WA[Write amp: bytes rewritten / bytes written]
    SA[Space amp: live / stored]
  end

  WAL --> LMEM
  LMEM -. seal .-> IMMU -. flush .-> L0
  COMPS --> L0
  COMPS --> L1
  COMPS --> LN
  BLOOM --- L0
  BLOOM --- L1
  INDEX --- L1
  TS --- COMPS
  WAL --- BTree
  BTROOT --> BTINT --> BTLEAF
  PC --- BTree
  PC --- LSM
  REC --- WAL
  Tradeoffs --- LSM
  Tradeoffs --- BTree

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class WAL,CKP,REC,BTROOT,BTINT,BTLEAF,BTINS,BTGET,BTSCAN,BTFREE,LN,INDEX,COMPS,TS,SNAP,RA,WA,SA service;
    class LMEM,IMMU,L0,L1,BLOOM datastore;
    class PC cache;

When each wins#

Property B+ Tree LSM
Write throughput medium (in-place page writes) high (sequential append)
Read latency low, predictable higher tail (multi-level lookup)
Range scan very fast fast (with leveled)
Write amp low high (compaction)
Space amp low higher (multiple copies during compaction)
Concurrency latch-coupling complex simpler (immutable SSTables)
Best for OLTP, mixed write-heavy, log-like, embedded KV

Compaction strategies (LSM)#

  • Size-tiered: merge same-sized SSTables → low write amp, high space amp.
  • Leveled: bounded levels, non-overlapping per level → low space amp, high write amp.
  • Universal / hybrid: tune knobs for workload.

WAL & durability#

  • Group commit batches fsyncs; trade latency for throughput.
  • sync vs async mode (Postgres fsync=off → fast & dangerous).

Read path (LSM)#

  1. Check memtable.
  2. Check immutable memtable.
  3. For each level: bloom filter → SSTable index → block cache → disk.
  4. Merge values, dedupe by seqno, return latest.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD LSM vs B-Tree engines WAL, memtable, SSTables, compaction storage-engines-lsm-btree
HLD Probabilistic data structures Bloom, HLL, Count-Min, MinHash, t-digest probabilistic-data-structures
LLD State machines FSM, HSM, transitions, guards state-machines
LLD Testing strategy pyramid, doubles, TDD, contracts testing-strategy
LLD Immutability immutable types, persistent collections immutability