Surveyed Papers

  1. Time, Clocks, and the Ordering of Events in a Distributed System

  2. Paxos Made Simple, Leslie Lamport

  3. In Search of an Understandable Consensus Algorithm

Abstract:

Distributed systems consists of multiple computers networked together to solve a common computational task, aiming to achieve better performance, reliability, and scalability. In this computational model, each node in the network has its own physical CPU, memory, and storage. By distributing the computational workload of the problem across multiple machines, the overall processing power of the system is increased, allowing for faster and more efficient processing of large data sets or complex tasks. However, building distributed systems presents significant challenges, such as managing concurrency, coordinating across multiple nodes, and ensuring fault tolerance.

AIM

This survey will focus on the development of time, clocks and event ordering in distributed systems, as well as the emergence and differences of consensus algorithms of Paxos and Raft.The strengths and weaknesses of these algorithms are analyzed in detail, and their performance is compared to provide a clear understanding of their respective merits. Finally, the survey concludes by highlighting the key takeaways of these papers, where they might fall short and suggesting avenues for future research in this domain.

Introduction

In the past decades, proliferation of continous data streams has made it increasingly difficult for a single computer to handle large scale computational tasks on its own even with advanced hardware. This explosion of data generation only sped up the studies and use of distributed systems to address these problems. Distributed systems are often used for applications requiring large scale data processing such as fluid simulations, weather forecasting, financial modeling, and training artificial intelligence models. Distributed systems also found uses in fault-tolerant systems, such as distributed databases, telecommunication networks, real-time process control and parallel computation.

Background

Leslie Lamport's paper "Time, Clocks, and the Ordering of Events in a Distributed System" tackles the challenges of designing distributed systems that is able to provide a consistent view of events across multiple nodes. In a distributed system, each process performs computations and communicates with other processes, and has its own logical clock that timestamps events based on their order of occurrence. Lamport explains that a system is considered distributed if the message transmission takes significant time compared to the time between events in a single process. Reader might think of a multiprocessing system in a single computer as something similar to distributed system as multiprocessing systems involves similar problems to those of distributed systems as there can be no guarantees made of the order in which threads will be executed and results will be obtained.

Lamport presents a method for clock synchronization, but cautions that it may not be possible to definitively determine the order in which events occur in a distributed system. Additionally, the paper describes the upper bound on the degree of clock synchronization in a distributed system. The concepts presented in Lamport's paper have had a significant impact on the development of distributed systems and continue to be relevant in the design and implementation of modern distributed systems.

Synchronization of Systems

Logical relationship between processes

Lamport describes a distributed system as a collection of processes, where each process is made up of a sequence of events. It is assumed that the events within a process form a sequence in which one event occurs before another. For example, if event 'a' happens before event 'b', it is denoted as a -> b. In a distributed system, receiving a message is considered an event in the process.

Distributed systems, as described by Lamport, lack physical clocks to perfectly synchronize time. Instead, they rely on establishing happened-before and happened-after relationships without using physical clocks. These relationships are determined by the execution order of processes.

The notation a -> b is used to represent happened before relationships between events in the same process as well as between events in different processes. For instance, if event 'a' involves sending a message from one process and event 'b' involves receiving the message in another process, the relationship between them is also denoted as a -> b. Furthermore, if a -> b and b -> c, it can be deduced that a -> c.

In contrast to the happened before and happened after relationships, two events are considered concurrent if neither event has a causal effect on the other, despite one occurring before the other in time. This is represented by a -x-> b and b -x-> a, where 'a' and 'b' are concurrent events.

In summary, synchronization in distributed systems relies on establishing happened before and happened after relationships without using physical clocks, as explained by Lamport. These relationships are based on the order execution of processes, with events within a process forming a sequence.

Introduction of Logical Clocks into Distributed Systems:

Logical clocks are introduced into distributed systems by assigning numbers to events, representing the time at which the event occurs. Each process, denoted as Pi, has its own clock Ci that assigns a number Ci{a} to events in the process. Since these clocks don't correspond to physical time, they are more akin to logical ordering mechanisms than actual clocks.

Then what does it mean for these clock relationships to be "correct" in a distributed system? As they lack a timekeeping mechanism, correctness is based on ordering of events in the system relative to each other. An intuitive definition of happened before relationship in terms of clock is; if a -> b then C{a} < C{b}.

Two conditions need to hold for the clock condition to be satisfied:

Condition C1: If events a and b are in the same process and a comes before b, then C{a} < C{b}

Condition C2: If event a is sending a message from process Pi to Pj and b is the event of receiving the message, then Ci{a} < Cj{b}

Lamport assumes that a clock for a specific process will "tick" through every number, with ticks occurring between the events of that process. For example, if process Pi has consecutive events a and b with Ci{a} = 4 and Ci{b} = 7, then clock ticks 5, 6, and 7 occur between these two events. Lamport uses dashed "tick lines" to connect the same numbered ticks from different processes. Condition C1 implies that there must be a tick line between any two events on a process line, while Condition C2 dictates that every message line should intersect a tick line. Readers can refer to a visual interpretation provided by Lamport as a clear representsation of two conditions that satisfy the Clock Condition [1][Figure 3].

The introduction of clocks into the system while satisfying the Clock Condition involves a primitive clock assigning numbers to events, with the number assigned representing the "time" the event occurred. For example, process Pi has a clock Ci, which assigns number Ci{a} to event a in that process. The overall system of clocks is given as a function C, which assigns any event b the number C{b}, where C{b} is denoted as Cj{b} if b is an event in process Pj. However, at this stage, no relationship is established between the numbers Ci{a} and the physical time of the system, so the clock is still considered a "logical clock" rather than a physical clock.

In this context, processes can be thought of as algorithms and events as the execution of those algorithms. If Pi is the process being executed, its clock is represented by register Ci, and Ci{a} is the value contained in Ci during the execution of event a. The value of Ci changes as events progress, but this change is not an event itself. Clocks "tick" only when there is a change in the event rather than being periodically tied to a clock.

To satisfy Conditions C1 and C2, clocks must adhere to specific rules. Condition C1 is simply the logical ordering of events in a process and doesn't require further elaboration. Condition C2 is more complex, necessitating that each message sent contains a timestamp Tm, representing the time at which the message was sent. The message recipient must then advance its clock to be at or later than Tm. For example, if event a in process Pi sends a message m to process Pj with a timestamp Tm = Ci{a}, then Pj must set its Cj to be greater than or equal to Tm.

Note: Semantically receiving the message m is considered to occur after setting the Cj. However this rule is only of notational importance and has no implementational significance.

Total ordering of events

Clock Condition can be utilized to establish a total ordering of all events within the system. To break ties in event ordering, the author introduces the total ordering notation < for processes. A new notation => is defined such that if a is an event in process Pi and b is an event in process Pj, a => b holds if and only if either Ci{a} < Cj{b} or Ci{a} = Cj{b} and Pi < Pj. Naturally, the happened-before relation a -> b implies a => b. Thus, a => b represents the total ordering of events, complementing the previous "happened before" partial ordering representation, a -> b.

It is crucial to note that the total ordering representation => is not unique and depends on the system of clocks Ci. Different choices of clocks can satisfy Conditions C1 and C2, yielding distinct total ordering representations. The total ordering extends the partial ordering, and the clock conditions ensure this total ordering. Uniqueness exists only at the partial ordering level, determined by the system of events.

In distributed systems, total event ordering can be beneficial for implementing various functionalities. A correct system of logical clocks is required to achieve such ordering. One practical application of total ordering is demonstrated through the mutual exclusion problem.

The mutual exclusion problem arises when a group of processes shares a single resource, and only one process can use the resource at a time. To prevent conflicts, processes must synchronize their resource usage. The objective is to devise an algorithm that grants the resource to a process while satisfying three conditions:

  1. A process granted the resource must release it before it can be granted to another process.
  2. Requests for the resource must be granted in the order they are made.
  3. If every process granted the resource eventually releases it, then every request is eventually granted.

Lamport first highlights that using a central scheduling process to grant requests in the order received is insufficient without additional assumptions. Lamport's proposed solution is implementing a clock system and defining a total ordering of all events. The proposed algorithm comprises five rules outlining the interaction of each process:

  1. A process sends a message to every other process to request the resource and adds the message to its request queue.
  2. A process acknowledges the message receipt and places it in its request queue.
  3. A process releases the resource by removing request messages from its request queue and informing all other processes through a message.
  4. A process removes request messages from its queue upon receiving a release message from another process.
  5. A process is granted the resource when certain conditions are met: there is a request message in its queue ordered before any other request, and it has received a message from every other process timestamped later than the request message.

The paper assumes that messages between processes are received in the same order they are sent and that every message is eventually received. The algorithm's implementation involves ensuring that conditions 1 and 2 of Rule 5 are satisfied locally by each process.

Adding Physical Clocks To The Mix

Lamport's paper delves into the synchronization techniques of physical clocks in a space-time system to guarantee accurate time measurement in distributed systems. Lamport introduces two conditions that need to be satisfied for proper clock synchronization. The first condition, PC1, ensures that each clock runs at approximately the correct rate. The second condition, PC2, mandates that the clocks are synchronized such that the variation in height of a single tick line is less than a sufficiently small constant e. Additionally, the paper emphasizes the need for an algorithm to prevent clocks from drifting apart and the required smallness of e to avert anomalous behavior.

Lamport then presents an algorithm for synchronizing clocks in the distributed system, aiming to maintain consistency across clocks in different processes. The algorithm is based on two rules, IR1 and IR2, and considers the delay between message sending and receiving. Timestamps in messages are employed to synchronize the clocks. Lamport also defines the concept of a directed graph with a diameter a, representing communication lines between processes. This diameter is utilized to bound the time it takes for clocks to synchronize with one another. The author concludes that, given specific conditions are met, this algorithm ensures clock readings are consistent within a certain margin of error.

It is important to note that Lamport acknowledges the significant amount of work that has been conducted on clock synchronization in the past. However, the approach presented in this paper, where clocks are never set backwards, distinguishes it from previous studies, adding a new perspective to clock synchornization work that has been done to date.

To clarify the scope of Lamport's paper however, it should be noted that Lamport's paper primarily focuses on logical clocks, their synchronization, and the ordering of events in a distributed system. The directed graph and related concepts are used to explain and visualize these ideas, rather than directly synchronizing physical clocks.

Emergence of Consensus Algorithms

In his influential "Time, Clocks, and the Ordering of Events in a Distributed System," Lamport lays the foundation for understanding the partial ordering of events (happened-before relationship) and total ordering of events in distributed systems, which are essential for maintaining consistency and achieving consensus among distributed nodes. In his later work, "Paxos Made Simple", Lamport introduces Paxos consensus algorithm which provides a robust mechanism for ensuring that a collection of distributed nodes can agree on a single value, even in the face of failures and network partitioning.

Paxos is a widely adopted consensus algorithm and is used in many large-scale distributed systems. Lamport's logical clocks and the concept of happened-before relationship are not directly applied in the Paxos algorithm, but they form an essential background for understanding distributed systems and their challenges, such as consensus.

"In Search of an Understandable Consensus Algorithm" by Ongaro and Ousterhout introduces the Raftconsensus algorithm, as an alternative to Lambert's consensus algorithm Paxos. The authors' criticize that even experts in the field have difficult time grasping the Paxos and mention that their goal was to design a consensus algorithm that is more understandable and easier to implement than Paxos while providing similar guarantees of safety and liveness. Raft achieves this by decomposing the consensus problem into smaller subproblems and by using a leader-based approach to manage state replication among distributed nodes.

In this paper "Paxos Made Simple", Lamport starts by wanting to debunk the claims that the Paxos fault-tolerant distributed systems algorithm is difficult to understand and it has been regarded as so due to the original presentation. The claim is that the algorithm is actually simple and obvious and is based on a consensus algorithm named “synod”. The Paxos algorithm is obtained by applying consensus to state machine approach for building the distributed system.

In a distributed system where a number of processes can propose values, how the will the system reach consensus? The requirements of such a system are:

  1. Only a value that has been proposed can be chosen
  2. Only a single value is chosen
  3. A process is not notified of a chosen value until the value has been actually chosen.

The Paxos algorithm satisfies the three requirements outlined above and there are three roles to play in the consensus algorithm performed by three classes of agents: proposers, acceptors, and learners. These agents communicate with each other by sending messages as introduced in the prior paper about sending messages and distributed system messaging. The requirements of a data being stale or live are not yet specified but the goal is to be eventually choose a proposed vale and let processes to learn the chosen value.

The system is fault tolerant as agents are said to be operating at arbitrary speed, may fail and may restart. All agents may fail and if that happens there needs to be an agent that remembers the state after failing and restarting for a solution to be possible. Another assumption is that messages can take arbitrarily long to reach their intended targets, they can be duplicated and can even be lost but they are not corrupted.

Mechanism of choosing a proposed value: Lamport describes that in order to address the problem of multiple proposal being accepted, each proposal is assigned a proposal number. The proposed solution involves using multiple acceptor agents and a majority vote rule, where a value is chosen when it is accepted by a majority of the agents.

To ensure progress even in the absence of failure or message loss, the first proposed value must be accepted by an acceptor. However, this can lead to a situation where several values are proposed at the same time, leading to no single value being accepted by a majority of acceptors. To solve this problem, an acceptor must be allowed to accept more than one proposal, which is achieved by assigning a unique number to each proposal. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors, and all chosen proposals must have the same value to ensure safety.

To guarantee safety and progress, several conditions must be satisfied, including the requirement that every higher-numbered proposal that is chosen has the same value as the first chosen proposal, and the requirement that if a proposal with a value is chosen, then every higher-numbered proposal accepted by any acceptor must also have that value. These conditions ensure that only a single value is chosen and that progress is made even in the presence of multiple proposals being issued at the same time.

Lamport describes how to implement a distributed system using a collection of servers and how Paxos consensus algorithm ensures that all servers execute the same sequence of state machine commands. The system is described as a deterministic state machine that performs client commands in a certain order.

During normal operation, a single server is elected as the leader and acts as the distinguished proposer in all instances of the consensus algorithm. Clients send commands to the leader, who decides where each command should appear in the sequence. The leader proposes the command as the value of the corresponding instance of the consensus algorithm, and if successful, the command is added to the sequence of chosen commands.

If the previous leader fails, a new leader is selected, who should know most of the chosen commands. The new leader then executes phase 1 of the consensus algorithm for the missing instances, proposes the values for the chosen instances, and fills any gaps with special "no-op" commands. Once all missing commands have been chosen, the new leader can propose the next client command.

However, it is possible for a gap in the sequence of chosen commands to arise if a leader fails before its proposed command is chosen. The new leader can get up to n commands ahead before a gap of up to n-1 commands could arise. In this case, the new leader executes phase 1 for the missing instances, proposes the missing values, and fills any gaps with no-op commands.Lamport also discusses potential problems and solutions, including handling multiple leaders, maintaining consistency across servers, and recovering from failures.

An Understandable Consensus Algorithm

Rafts introduction arises from the need to develop a more understandable and to provide a better foundation to build practical distributed systems. Ongaro and Ousterhout claims that Raft produces results equivalent to multi-Paxos but it is just as efficient with a different structure, which makes it more understandable. Raft makes this possible by separating key elements of consensus.

Consensus algorithms are important and needed to build a reliable large scale software systems and Paxos is said to dominate the space for the last decade when this paper was published in 2014. However Paxos falls short when it comes to practical implementation of the systems and its architecture requires complex changes to support them. In addition to its challenges when it comes to implementation of real systems, it is challenging to understand. The authors of the article set out to find a new consensus algorithm that is significantly easier to learn than Paxos and facilitates the development of intuitions that are essential for system builders. Their primary goal was to focus on understandability and define a consensus algorithm for practical systems that is obvious why it works.

Raft was designed to be more understandable and facilitate the development of intuitions that are essential for system builders. The Raft algorithm uses techniques like decomposition and state space reduction to improve understandability. Raft has several novel features, including a stronger form of leadership, randomized timers to elect leaders, and a joint consensus approach to changing the set of servers in the cluster. Raft is considered superior to Paxos and other consensus algorithms by the authors for educational purposes and as a foundation for implementation because it is simpler and more understandable, described completely enough to meet the needs of a practical system, has open-source implementations and is used by several companies, has formally specified and proven safety properties, and its efficiency is comparable to other algorithms.

Authors claim that Paxos has two significant drawbacks, one of which have been mentioned in this paper: The difficulty to understand it. Authors claim that even seasoned researchers struggle with Paxos, and it took the authors of the paper almost a year to understand it. The opaqueness of Paxos stems from its choice of the single-decree subset as its foundation, which is dense and subtle, making it difficult to develop intuitions about why the protocol works. Secondly, Paxos does not provide a good foundation for building practical implementations. It does not have a widely agreed-upon algorithm for multi-Paxos, and the Paxos architecture is not suitable for building practical systems. As a result, practical systems bear little resemblance to Paxos, and each implementation begins with Paxos and develops a significantly different architecture. The authors of the paper designed an alternative consensus algorithm, called Raft, as a result of these problems.

The most important goal of the designers was understandability, so that a large audience could comfortably comprehend the algorithm. To achieve this goal, the designers used problem decomposition to divide problems into separate pieces that could be solved independently, and simplified the state space by reducing the number of states to consider, making the system more coherent, and eliminating nondeterminism where possible. In some cases, nondeterminism was actually used to improve understandability, such as in the randomized approach used to simplify the Raft leader election algorithm. Overall, the designers of Raft evaluated alternative approaches based on understandability and chose the approach that was easiest to explain and understand.

Raft implements consensus by electing a leader responsible for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and notifies servers when it's safe to apply log entries. Raft decomposes the consensus problem into three subproblems: leader election, log replication, and safety. Leader election is the process of choosing a new leader when the existing one fails. Log replication involves replicating log entries across the cluster and ensuring agreement among logs. Safety is the State Machine Safety Property, ensuring no other server applies a different command for the same log index. Raft ensures this property by adding an additional restriction on the election mechanism.

Safety in Raft

In leader-based consensus algorithms, the leader must eventually store all committed log entries. However, in some algorithms, a leader can be elected even if it doesn't initially contain all the committed entries, but additional mechanisms are needed to identify and transmit the missing entries to the new leader. Raft takes a simpler approach and guarantees that all committed entries from previous terms are present on each new leader from the moment of its election without transferring them. Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. If the candidate's log is at least as up-to-date as any other log in the majority of the cluster, it will hold all the committed entries, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

The paper presents a new algorithm called Raft that is designed to be more understandable than Paxos, which has been a challenging algorithm for students and developers for many years. The primary design goal of Raft was to prioritize understandability, which changed the approach to the design process. The designers found themselves repeatedly reusing techniques such as decomposing the problem and simplifying the state space, which not only improved the understandability of Raft but also made it easier to verify its correctness. The authors believe that Raft provides a better foundation for system building than Paxos.

Strengths And Weaknesses

Time, Clocks, and the Ordering of Events in a Distributed System

The approach described by Lamport involves using a state machine consisting of a set of possible commands and states, which can be used to implement any desired form of synchronization for a distributed multiprocess system. Each process independently simulates the execution of the State Machine, using commands issued by all processes, and synchronization is achieved by ordering the commands according to their timestamps. The resulting algorithm requires the active participation of all processes and failure is a difficult problem to address. Without physical time, it is difficult to distinguish a failed process from one that is just pausing between events.

The article discusses how a resource scheduling algorithm orders requests based on a total ordering, which can result in anomalous behavior if requests are issued externally. The article provides an example of a nationwide system of interconnected computers where a person issues a request A on a computer A and then telephones a friend in another city to have him issue a request B on a different computer B. Due to the lack of knowledge of the system regarding the precedence of A and B, it is possible for request B to receive a lower timestamp and be ordered before request A.

To avoid anomalous behavior, the article suggests two possible ways. The first way is to explicitly introduce the necessary information about ordering into the system. In the provided example, the person issuing request A could receive the timestamp TA of that request from the system, and when issuing request B, his friend could specify that B be given a timestamp later than TA. This gives the user the responsibility for avoiding anomalous behavior.

The second approach is to construct a system of clocks that satisfies the Strong Clock Condition. The Strong Clock Condition states that for any events a, b in the system, if a happens before b, then the clock time of a is less than the clock time of b. This condition is stronger than the ordinary Clock Condition because it relates events that happen outside the system. The article explains that logical clocks do not generally satisfy the Strong Clock Condition, and physical clocks are needed to eliminate anomalous behavior.

The article concludes by discussing the mystery of the universe that it is possible to construct a system of physical clocks that, running independently of each other, can satisfy the Strong Clock Condition. This implies that physical clocks can be used to eliminate anomalous behavior in a system.

A scenario in which multiple proposers keep issuing proposals that are never chosen can occur in Paxos proposed solution. This problem can happen when one proposer completes phase 1 for a proposal number, and another proposer completes phase 1 for a proposal number greater than the first proposer’s proposal number. The acceptors then ignore the first proposer’s phase 2 accept request since there is a higher proposal numbered proposal by the second proposer. This condition results in a cycle that continues indefinitely, thus no progress will be made. To ensure progress, like the distinguished learner, a distinguished proposer is selected to be the only one that issues proposals. This proposer then needs to communicate with majority of the acceptors successfully and uses a proposal number greater than previously issued numbers, it will eventually succeed in getting its proposal accepted by the network.

Although the authors claim that Raft has performance similar to other consensus algorithms and claim that it can be improved by making suggestions, the issue still manifests itself as a weak point in the paper. How similar is the performance? Is it worse? What kind of performance improvements can the reader expect from the potential suggested optimizations?

Conclusion

The three papers introduced in this survey paper delve into the fundamentals of distributed systems and challenges researchers and students face trying to understand, implement and ensure their reliability. The first paper introduced, "Time, Clocks and the Ordering of Events in a Distributed System" by Leslie Lamport proposed a partial ordering of events using logical clocks (orderings) to resolve the issue of lacking a global clock in a distributed system.

The second paper, "Paxos Made Simple" also authored by Leslie Lamport, introduced the Paxos algorithm, which describes how to achieve consensus in a distributed system. Paxos is a widely used algorithm in distributed systems due to its effectiveness, but its complexity and difficulty to understand have made it challenging for many developers to implement and deploy. In addition to it's complexity Diego Ongaro and John Ousterhout, the authors of the third paper "In Search of an Understandable Consensus Algorithm" claim that engineers have to deviate from the proposed algorithm by Lamport as it's not complete and many problems arise while implementing the algorithm in a real system.

The designers of the Raft algorithm, have acknowledged the importance and contributions of Paxos but have also noted its complexity and the difficulty of understanding it. They have also noted that while Raft draws many parallels from Paxos, it represents a distinct approach to achieving consensus.

The third paper, "In Search of an Understandable Consensus Algorithm" by Diego Ongaro and John Ousterhout, introduces the Raft consensus algorithm, an alternative to Paxos. Raft was designed with understandability in mind, making it easier to implement and maintain than Paxos. Raft provides similar guarantees to Paxos in terms of consistency and fault tolerance but achieves these guarantees through a different approach.

In conclusion, these three papers are essential reading for anyone interested in learning more about distributed systems and basics of consensus algorithms. The first paper provides the necessary background for understanding the second and third papers, which introduce two consensus algorithms: Paxos and Raft. Paxos, while powerful and effective, comes with a steep learning curve, making it difficult for developers to implement and deploy. In contrast, Raft was designed to be more understandable, easier to implement and maintain, making it a popular choice for many developers trying to implement a consensus algorithm.

References:

  1. http://lamport.azurewebsites.net/pubs/time-clocks.pdf

  2. http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-simple.pdf

  3. https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf