Consensus — Notes#
Problem#
Replicate a state machine across N nodes so all apply the same ordered commands, even with crashes & message loss. Tolerate f failures with 2f+1 nodes.
Safety properties#
- Election safety: at most one leader per term.
- Log matching: if two logs share a
(term,index), all prior entries match. - Leader completeness: a committed entry is in every future leader's log.
- State machine safety: never applies different commands at same index.
Liveness#
- Requires majority of nodes reachable.
- Randomized election timeout prevents split votes.
- Pre-vote phase avoids term inflation by partitioned nodes.
Optimizations#
- Batching + pipelining AppendEntries.
- Read-index / lease reads: leader serves reads without log append by confirming it's still leader.
- Learner / non-voting members to scale reads or join slowly.
- Witness / co-located: 2 voting + 1 witness for cheap quorum.
Common pitfalls#
- Linearizable reads through stale leader — fix with read-index.
- Disk fsync absolutely required for
currentTerm/votedFor. - Snapshot install must not regress committed entries.
Refs#
- Diego Ongaro, John Ousterhout: "In Search of an Understandable Consensus Algorithm" (Raft, USENIX ATC '14).
- Lamport: "The Part-Time Parliament" (1998); "Paxos Made Simple" (2001).
- Howard et al.: "Flexible Paxos: Quorum intersection revisited".
- Heidi Howard's PhD thesis; Jepsen analyses of etcd/Consul.