Skip to content

Storage Engines — Notes#

The fundamental trade-off#

Disk loves sequential IO, hates random. B-trees do in-place updates (random IO on writes); LSMs sequentialize writes via append + later compaction.

Engines in the wild#

Engine Model Used in
InnoDB B+ tree MySQL
Postgres heap + btree heap files + B-tree index Postgres
BoltDB / LMDB mmap'd B+ tree etcd v3, Mongo legacy
RocksDB / LevelDB LSM TiKV, CockroachDB, Kafka Streams, MyRocks
WiredTiger hybrid (B-tree default, LSM opt) MongoDB
Cassandra LSM (custom) Cassandra
ScyllaDB LSM (Seastar) Scylla

Tunable knobs#

  • Block size (4 KB–64 KB).
  • Compression (Snappy, Zstd, LZ4) — trades CPU for IO.
  • Cache size — biggest single performance lever.
  • Bloom filter bits/key (10 = 1% FPR typical).
  • Compaction threads / IO scheduler.

Failures & recovery#

  • WAL fsync absolutely required for ACID-D.
  • On crash: replay WAL into memtable; redo / undo for B-tree.
  • Torn writes → use checksums per page; double-write buffer (InnoDB).

When to choose#

  • Heavy random write, append-mostly, time-series → LSM.
  • Mixed OLTP with strong range and update → B-tree.
  • Tiny embedded need with mmap → LMDB.

Refs#

  • Designing Data-Intensive Applications ch.3.
  • "The Log-Structured Merge-Tree" (O'Neil et al. 1996).
  • RocksDB wiki, ScyllaDB blog series on LSM, Postgres docs.