Skip to content

Consensus — Notes#

Problem#

Replicate a state machine across N nodes so all apply the same ordered commands, even with crashes & message loss. Tolerate f failures with 2f+1 nodes.

Safety properties#

  • Election safety: at most one leader per term.
  • Log matching: if two logs share a (term,index), all prior entries match.
  • Leader completeness: a committed entry is in every future leader's log.
  • State machine safety: never applies different commands at same index.

Liveness#

  • Requires majority of nodes reachable.
  • Randomized election timeout prevents split votes.
  • Pre-vote phase avoids term inflation by partitioned nodes.

Optimizations#

  • Batching + pipelining AppendEntries.
  • Read-index / lease reads: leader serves reads without log append by confirming it's still leader.
  • Learner / non-voting members to scale reads or join slowly.
  • Witness / co-located: 2 voting + 1 witness for cheap quorum.

Common pitfalls#

  • Linearizable reads through stale leader — fix with read-index.
  • Disk fsync absolutely required for currentTerm/votedFor.
  • Snapshot install must not regress committed entries.

Refs#

  • Diego Ongaro, John Ousterhout: "In Search of an Understandable Consensus Algorithm" (Raft, USENIX ATC '14).
  • Lamport: "The Part-Time Parliament" (1998); "Paxos Made Simple" (2001).
  • Howard et al.: "Flexible Paxos: Quorum intersection revisited".
  • Heidi Howard's PhD thesis; Jepsen analyses of etcd/Consul.