Distributed Key-Value Store — Notes
Goals (Dynamo paper)
- Always writable (AP).
- Symmetric, decentralized nodes.
- Incremental scalability.
- Heterogeneous hardware support.
Data model
- key → opaque blob (or column family).
- Sometimes secondary indexes (materialized views in Cassandra).
Quorum tuning (N, R, W)
- N = replicas, R = read consensus, W = write consensus.
R + W > N → linearizability-like guarantee for last-writer wins.
- Common: N=3, R=W=2 → strong-ish, tolerate 1 down.
- Lower W=1 for fast writes (less durable).
Conflict resolution
- Last-writer-wins (server clock) — easy, can lose data.
- Vector clocks (Dynamo) — keep siblings, app reconciles.
- CRDT counters / sets — automatic merge.
Capacity
- 1 PB across 100 nodes ≈ 10 TB/node.
- 100k ops/s/node common with SSD.
- Compaction can double IO; size accordingly.
Trade-offs
- Tunable consistency is great but operationally complex.
- Wide rows in Cassandra → range scans cheap but hotspot risk.
- TTL columns → tombstones → compaction pressure.
- LWTs (Paxos) much slower than normal write.
Refs
- Dynamo paper (SOSP '07), Cassandra design doc, ScyllaDB blog,
Riak vs Cassandra comparisons, RocksDB tuning guide.