Storage Engines — Notes#
The fundamental trade-off#
Disk loves sequential IO, hates random. B-trees do in-place updates (random IO on writes); LSMs sequentialize writes via append + later compaction.
Engines in the wild#
| Engine | Model | Used in |
|---|---|---|
| InnoDB | B+ tree | MySQL |
| Postgres heap + btree | heap files + B-tree index | Postgres |
| BoltDB / LMDB | mmap'd B+ tree | etcd v3, Mongo legacy |
| RocksDB / LevelDB | LSM | TiKV, CockroachDB, Kafka Streams, MyRocks |
| WiredTiger | hybrid (B-tree default, LSM opt) | MongoDB |
| Cassandra | LSM (custom) | Cassandra |
| ScyllaDB | LSM (Seastar) | Scylla |
Tunable knobs#
- Block size (4 KB–64 KB).
- Compression (Snappy, Zstd, LZ4) — trades CPU for IO.
- Cache size — biggest single performance lever.
- Bloom filter bits/key (10 = 1% FPR typical).
- Compaction threads / IO scheduler.
Failures & recovery#
- WAL fsync absolutely required for ACID-D.
- On crash: replay WAL into memtable; redo / undo for B-tree.
- Torn writes → use checksums per page; double-write buffer (InnoDB).
When to choose#
- Heavy random write, append-mostly, time-series → LSM.
- Mixed OLTP with strong range and update → B-tree.
- Tiny embedded need with mmap → LMDB.
Refs#
- Designing Data-Intensive Applications ch.3.
- "The Log-Structured Merge-Tree" (O'Neil et al. 1996).
- RocksDB wiki, ScyllaDB blog series on LSM, Postgres docs.