Distributed Consensus: From Paxos to Raft
How distributed systems agree on a single value — and why it's fundamentally hard.
The consensus problem is one of the most important challenges in distributed computing. Given a set of nodes that can fail independently, how do you get them to agree on a single value? This seemingly simple question has occupied computer scientists for decades.
Leslie Lamport introduced the Paxos algorithm in 1989 (published in 1998), and it became the gold standard for consensus. Paxos works in three phases: Prepare, Promise, and Accept. A proposer selects a proposal number, sends a Prepare request to a majority of acceptors, and if it receives promises from a majority, it sends an Accept request with the proposed value. The elegance of Paxos lies in its safety guarantees — it never produces inconsistent results — but its liveness depends on having a stable leader.
The problem with Paxos is understandability. Lamport himself noted that many researchers found it difficult to grasp, and implementations diverged wildly. Google's Chubby lock service used Paxos internally, but the engineering team reported that the gap between the algorithm on paper and a production system was enormous.
Enter Raft, introduced by Diego Ongaro and John Ousterhout in 2014. Raft achieves the same safety properties as Paxos but decomposes the problem into three sub-problems: leader election, log replication, and safety. Each sub-problem is understandable in isolation, making the overall algorithm far more approachable.
In Raft, one node is elected leader and is responsible for managing the replicated log. Clients send all requests to the leader, which appends entries to its log and replicates them to followers. An entry is considered committed once a majority of nodes have written it. If the leader fails, a new election occurs — nodes time out, increment their term number, and request votes from peers.
The key insight of Raft's leader election is the use of randomized timeouts to avoid split votes. Each node picks a random election timeout between 150ms and 300ms. The first node to time out starts an election, and in practice, split votes are rare. This simple mechanism avoids the complex liveness proofs required by Paxos.
Real-world implementations include etcd (used by Kubernetes for cluster state), CockroachDB, and TiKV. Understanding consensus is essential for anyone building or operating distributed systems, because every distributed database, every coordination service, and every replicated state machine depends on some variant of these ideas.
The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no deterministic consensus algorithm can guarantee termination in an asynchronous system with even one faulty process. This is why all practical consensus algorithms rely on partial synchrony assumptions or randomization. The impossibility is not a barrier to building systems — it's a reminder that distributed computing requires careful thinking about failure models.