Consensus: Raft / Paxos — Detailed#
flowchart TB
subgraph Roles
LR[Leader]
FOL[Followers]
CAND[Candidate]
end
subgraph Phases[Raft Phases]
LE[Leader Election<br/>RequestVote RPC]
LR2[Log Replication<br/>AppendEntries RPC]
SAFE[Safety: Election restriction +<br/>commit only current-term entries]
SNAP[Log Compaction /<br/>Snapshots]
CFG[Membership Change<br/>joint consensus]
end
subgraph State[Persistent State]
PT[currentTerm]
PV[votedFor]
LOG[(log entries<br/>term, index, cmd)]
end
subgraph Replication[Replication]
Q[Majority quorum]
COMMIT[Commit Index]
APPLY[Apply to State Machine]
end
subgraph Variants[Variants]
BP[Basic Paxos]
MP[Multi-Paxos]
EPX[EPaxos - leaderless]
FPX[Fast Paxos]
ZB[Zab - ZooKeeper]
VR[Viewstamped Replication]
end
subgraph Failures
SP[Split vote]
NP[Network partition]
DL[Delayed RPC]
end
Client[Client] --> LR
LR --> LR2
LR2 --> FOL
FOL -.ack.-> LR
LR --> Q --> COMMIT --> APPLY
LR -. heartbeat .-> FOL
FOL -. timeout .-> CAND --> LE --> LR
LR --> SNAP
Variants --- LE
Failures --- LE
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class Client client;
class LR,FOL,CAND,LE,SNAP,CFG,PT,PV,Q,COMMIT,APPLY,BP,MP,EPX,FPX,ZB,VR,SP,NP,DL service;
class LR2,SAFE,LOG datastore;
Raft cheat sheet#
- Term: monotonically increasing logical clock; one leader per term.
- RequestVote: candidate asks for votes; granter must have at least as up-to-date a log.
- AppendEntries: leader replicates entries; serves as heartbeat when empty.
- Commit rule: entry committed when stored on majority and leader has committed an entry from current term.
- Membership change: joint configuration
C_old,newthenC_new.
Paxos vs Raft#
- Paxos: hard to implement, classical, decouples roles (proposer/acceptor/learner).
- Multi-Paxos ≈ Raft with elected leader + log of values.
- Raft simplifies via strong leader and contiguous log.
Performance#
- Latency = 1 RTT to majority. With 5 nodes, lose 1 RTT to slowest of 3.
- Throughput bound by leader fsync.
- Optimizations: pipelined AppendEntries, batched fsync, read leases / read-index.
Where it's used#
- etcd, Consul, CockroachDB, TiKV, MongoDB (replica set), Kafka KRaft, RethinkDB, Google Chubby (Paxos), Spanner (Paxos per group), Aurora.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Pub/Sub & message brokers | topics, consumer groups, delivery semantics | pub-sub-pattern |
HLD |
Raft / Paxos consensus | replicated state machine via majority quorum | consensus-raft-paxos |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
LSM vs B-Tree engines | WAL, memtable, SSTables, compaction | storage-engines-lsm-btree |
LLD |
State machines | FSM, HSM, transitions, guards | state-machines |
LLD |
Testing strategy | pyramid, doubles, TDD, contracts | testing-strategy |
LLD |
Behavioural patterns | Strategy, Observer, State, Command, Chain | behavioral-patterns |