Abstract

Checkpointing is a technique in which a snapshot of the application's state is saved so that the application can be restarted from that point in case of failure, thus providing fault-tolerance capability to the failure-prone computing systems. This paper presents a survey of three early checkpointing techniques: Libckpt, Supporting Checkpointing and Process Migration Outside the UNIX kernel, and Multiple Bypass. The survey covers the key aspects of these tools, including their goals, architecture, performance, and limitations.

Introduction

Checkpointing is a technique in which a snapshot of the application's state is saved so that the application can be restarted from that point in case of failure, thus providing fault-tolerance capability to the failure-prone computing systems. Checkpointing is particularly important for a long-running application programs because it enables application programs to run longer than the allocated time.

Transparent Checkpointing is a checkpointing technique in which no modification to the application program is required and checkpoint can be initiated externally or by the application program. In this technique, when a checkpoint is initiated: (i) the execution of the application program is suspended; (ii) a snapshot of the application's state is saved; (iii) the execution of the application program is resumed.

The next three subsections will provide a summary of each of the three papers, covering their key contributions, implementation details, and evaluation results.

Libckpt: Transparent Checkpointing under Unix

This paper presents a solution to the challenge of implementing transparent checkpointing in Unix-based systems. The authors propose a library called Libckpt, which provides a checkpointing mechanism that can be easily integrated into existing applications without modifying their source code. The library works by intercepting calls to system functions that modify the process state and recording the changes made to the process in a checkpoint file. The checkpoint file can be used to restore the process state in case of a failure or error. The authors evaluate the performance of Libckpt by measuring the overhead introduced by the library on various applications. They conclude that the overhead introduced by Libckpt is low and that the library can be used for transparent checkpointing in production environments.

Supporting Checkpointing and Process Migration Outside the Unix Kernel

This paper presents a solution to the challenge of implementing checkpointing and process migration in non-Unix based systems. The authors propose a system that is based on a user-level daemon that intercepts system calls made by applications and records the process state in a checkpoint file. The authors also propose a mechanism for process migration between different nodes in a distributed system. The authors evaluate the performance of their solution by measuring the overhead introduced by the checkpointing mechanism and the time taken to migrate a process from one node to another. They conclude that their solution is efficient and can be used for transparent checkpointing and process migration in non-Unix based systems.

Multiple Bypass: Interposition Agents for Distributed Computing

This paper presents a solution to the challenge of implementing transparent checkpointing and other services in distributed computing systems. The authors propose a mechanism called multiple bypass, which uses interposition agents to intercept system calls made by applications and provide services such as checkpointing, process migration, and remote procedure calls. The authors evaluate the performance of their solution by measuring the overhead introduced by the interposition agents and the time taken to complete various operations. They conclude that their solution is efficient and can be used for transparent checkpointing and other services in distributed computing systems.

Background

Checkpointing is a well-known technique in computer systems to provide fault-tolerance by periodically saving the state of a running program in a checkpoint. In the event of a failure, the program can be restarted from a checkpoint to resume its execution, reducing the amount of lost work. The idea of checkpointing has been around for several decades, and several research papers have been published in this domain. This section aims to provide a brief background on the history and evolution of checkpointing techniques.

Early History of Checkpointing

One of the earliest mentions of checkpointing in the literature is in the work of John von Neumann, who proposed a method for saving the state of a computing machine on punched tape in the early 1950s. In the following years, several researchers proposed different techniques for checkpointing, including the use of virtual memory and hardware support for checkpointing.

Checkpointing in UNIX

In the 1990s, the UNIX operating system became the dominant platform for scientific computing, and several research groups started developing checkpointing techniques for UNIX. One of the earliest such techniques was Libckpt, developed by Jim Plank and his colleagues at the University of Tennessee. Libckpt was a user-level checkpointing library that could be used with any UNIX program without modifying the program's source code. Another notable checkpointing technique for UNIX was developed by Michael Litzkow and Miron Livny at the University of Wisconsin-Madison. Their technique supported checkpointing and process migration outside the UNIX kernel.

Checkpointing in Distributed Systems

In the early 2000s, checkpointing techniques were extended to distributed systems to provide fault-tolerance for large-scale computations. One such technique was the use of interposition agents, proposed by Douglas Thain and Miron Livny in their work on Multiple Bypass. In this technique, interposition agents intercept system calls to transparently checkpoint and migrate processes across different machines in a distributed system.

Checkpointing has a long history in computer systems, and several techniques have been proposed over the years. Early techniques focused on checkpointing in single-machine systems, while later techniques extended the idea to distributed systems. The next sections of this survey paper will examine some of the seminal papers in this area, highlighting their contributions and limitations.

Taxonomy of Checkpointing Approaches

Checkpointing has been a widely researched area in distributed computing systems. In this section, we present a taxonomy of the different approaches for checkpointing as discussed in the three papers reviewed earlier.

System-Level Checkpointing

This approach involves the checkpointing of the entire system state, including the operating system, process state, and application data. This method is mostly used in traditional high-performance computing systems, where the checkpoints are usually taken at specific intervals or when requested by the user. Plank et al.'s Libckpt approach falls under this category, where they implement checkpointing at the system call level, providing transparent and efficient checkpointing for UNIX applications.

Application-Level Checkpointing

In this approach, the checkpointing mechanism is implemented within the application itself, and the checkpointed data is specific to the application. Application-level checkpointing is suitable for distributed systems, where the application is running on multiple nodes, and the application state must be checkpointed independently of the operating system. Litzkow and Solomon's approach is an example of this, where the checkpointing mechanism is implemented as a user-level library.

Process-Level Checkpointing

This approach is similar to application-level checkpointing but focuses on checkpointing specific processes. It is commonly used in systems where individual processes can be restarted independently without affecting other processes. Thain and Livny's approach falls under this category, where they use interposition agents to checkpoint individual processes running on distributed systems.

Hybrid Checkpointing

This approach combines two or more of the above checkpointing techniques to achieve better fault tolerance and recovery. Hybrid checkpointing is commonly used in large-scale systems, where the different components have different requirements for checkpointing. For example, a distributed application may use system-level checkpointing for the operating system and application-level checkpointing for individual processes.

In conclusion, the above taxonomy provides a useful way of categorizing the different checkpointing approaches. While the reviewed papers present different checkpointing techniques, they all fall under one of the above categories. This taxonomy can help researchers understand the strengths and weaknesses of each checkpointing approach and choose the most appropriate one for their specific application.

Approach

The three surveyed papers propose different approaches for implementing checkpointing in computing systems. In this section, we will discuss each approach in more detail.

The first paper, "Libckpt: Transparent Checkpointing under UNIX," by Plank et al., proposes a library-based solution for implementing checkpointing. This approach involves adding a library to an application that provides checkpointing functionality. The library intercepts system calls and other events that may affect the application state and saves the relevant information to a checkpoint file. The application can then be restarted from the last checkpoint if a failure occurs.

One strength of library-based checkpointing is that it can be easily integrated into existing applications without modifying their source code. The library can be linked at runtime, and the checkpointing functionality can be activated using command-line arguments or environment variables. This approach also provides a high degree of transparency since the application is not aware of the checkpointing process.

The second paper, "Supporting Checkpointing and Process Migration Outside the UNIX Kernel," by Litzkow and Solomon, proposes a user-level daemon to manage the checkpointing process. This approach involves running a separate process that monitors the target application and periodically saves its state to a checkpoint file. The daemon can be configured to checkpoint the entire process or only a subset of its threads.

One advantage of this approach is that it provides greater flexibility and control over the checkpointing process. The daemon can be configured to use different checkpointing algorithms, and the checkpointing frequency can be adjusted based on the application's workload. The daemon can also perform other tasks, such as process migration, which can improve the overall system performance.

The third paper, "Multiple Bypass: Interposition Agents for Distributed Computing," by Thain and Livny, proposes an interposition agent to provide checkpointing and other services in a distributed computing environment. This approach involves intercepting system calls and network traffic between the client and server nodes and redirecting them to the agent. The agent can then provide the requested service, such as checkpointing, replication, or load balancing.

One advantage of this approach is that it can provide a uniform interface for checkpointing and other services across different operating systems and programming languages. The agent can also perform other tasks, such as error recovery and resource management, which can improve the overall system reliability and efficiency. However, this approach may require more complex setup and configuration, and the agent may introduce additional latency and overhead.

In addition to the approaches proposed in the surveyed papers, there are other possible approaches for implementing checkpointing. For example, a system could use virtual machine technology to provide checkpointing functionality. This approach involves running an application inside a virtual machine and periodically taking snapshots of the virtual machine state. If a failure occurs, the application can be restarted from the last snapshot. This approach provides a high degree of isolation and can be used with a variety of operating systems and programming languages.

Another possible approach is to use transactional memory to provide checkpointing functionality. This approach involves using hardware or software support for transactional memory to atomically update shared data structures. If a failure occurs, the transaction can be rolled back to the last checkpoint, which ensures that the system remains in a consistent state.

Overall, there are many different approaches for implementing checkpointing in computing systems, and each approach has its strengths and weaknesses. The choice of approach depends on various factors, such as the application requirements, system architecture, and performance goals.

Key Findings

This section presents the key findings of the survey on checkpointing techniques. The surveyed papers offer valuable insights into the design, implementation, and evaluation of checkpointing systems. The main findings can be summarized as follows:

Checkpointing is a crucial technique for fault tolerance in distributed systems. It allows the system to recover from failures by saving the state of the executing processes at regular intervals.
The implementation of checkpointing systems requires a trade-off between performance overhead and checkpointing frequency. While frequent checkpointing provides more fine-grained recovery, it incurs higher overhead. On the other hand, infrequent checkpointing reduces overhead but leads to more coarse-grained recovery.
The surveyed papers present various techniques for minimizing the overhead of checkpointing. For example, Plank et al. propose a lazy checkpointing approach that defers the copying of memory pages until they are modified, while Litzkow and Solomon suggest the use of incremental checkpointing to reduce the amount of data to be saved.
Checkpointing can be used in conjunction with other techniques, such as process migration and replication, to enhance fault tolerance in distributed systems. Thain and Livny propose a bypassing approach that uses interposition agents to redirect communication between processes and enable transparent process migration and checkpointing.
The surveyed papers highlight the importance of evaluating checkpointing systems in terms of performance, scalability, and reliability. Plank et al. perform experiments to measure the overhead of their checkpointing system, while Litzkow and Solomon evaluate the performance of their system in terms of checkpointing time and recovery time.
Finally, the surveyed papers demonstrate the applicability of checkpointing techniques in various domains, such as parallel computing, cluster computing, and distributed systems. Checkpointing is a general-purpose technique that can be used in any system that requires fault tolerance.

Overall, the surveyed papers provide a comprehensive overview of checkpointing techniques and their implementation in distributed systems. They offer valuable insights into the design and evaluation of checkpointing systems and demonstrate the importance of fault tolerance in modern computing environments.

Cross-cutting Themes

In this section, we will discuss some of the cross-cutting themes that emerge from the three surveyed papers on checkpointing. Cross-cutting themes refer to ideas or concepts that appear in multiple approaches or techniques. By analyzing the commonalities between these approaches, we can gain insights into the underlying principles of checkpointing and identify areas where future research could be focused.

One of the most common themes that emerge from these papers is the need for transparency in checkpointing. All three papers discuss the importance of making checkpointing as transparent as possible to the user and the system. The first paper, "Libckpt: Transparent checkpointing under Unix" by Plank et al., proposes a technique that is transparent to both the user and the application. The second paper, "Supporting checkpointing and process migration outside the UNIX kernel" by Litzkow and Solomon, also emphasizes the importance of transparency, and proposes a technique that is transparent to the user and minimally intrusive to the system. Finally, the third paper, "Multiple Bypass: Interposition Agents for Distributed Computing" by Thain and Livny, proposes a technique that uses interposition agents to provide transparent checkpointing in a distributed environment.

Another common theme is the need for efficient checkpointing. All three papers discuss the importance of minimizing the overhead of checkpointing in terms of both time and space. The first paper proposes a technique that uses incremental checkpointing to minimize the amount of data that needs to be saved, while the second paper proposes a technique that minimizes the overhead of checkpointing by using a lightweight kernel module. The third paper proposes a technique that uses interposition agents to minimize the overhead of checkpointing in a distributed environment.

A third common theme is the need for flexibility in checkpointing. All three papers discuss the importance of providing flexibility in terms of what data is saved and how often checkpoints are taken. The first paper proposes a technique that allows the user to specify which parts of the application's state are to be checkpointed, while the second paper proposes a technique that allows the user to specify how often checkpoints are taken. The third paper proposes a technique that allows the user to specify which parts of the application's state are to be checkpointed and where the checkpoints are to be stored.

A fourth common theme is the need for fault tolerance. All three papers discuss the importance of checkpointing as a means of providing fault tolerance in distributed systems. The first paper proposes a technique that can be used to recover from crashes and hardware failures, while the second paper proposes a technique that can be used to migrate processes to other machines in the event of a failure. The third paper proposes a technique that can be used to provide fault tolerance in a distributed environment by using interposition agents to capture and recover from failures.

A final cross-cutting theme that emerges from these papers is the need for scalability. All three papers discuss the importance of making checkpointing scalable to large-scale distributed systems. The first paper proposes a technique that can be used to checkpoint large-scale parallel applications, while the second paper proposes a technique that can be used to checkpoint large-scale distributed systems. The third paper proposes a technique that can be used to provide checkpointing in a distributed environment by using interposition agents that can be distributed across multiple machines.

In conclusion, by identifying these cross-cutting themes, we have gained insights into the underlying principles of checkpointing and identified areas where future research could be focused. These themes suggest that the key to effective checkpointing is to provide transparency, efficiency, flexibility, fault tolerance, and scalability. By focusing on these key principles, researchers can continue to develop new and innovative techniques for checkpointing that meet the evolving needs of distributed systems.

Synergistic Approaches

In this section, we will discuss the possibilities of combining the approaches proposed in the surveyed papers to create a more robust and efficient checkpointing system.

One potential way to combine the ideas from the three papers is to use the interposition agent approach proposed by Thain and Livny in conjunction with the user-level checkpointing technique described in the Libckpt paper by Plank et al. The interposition agent could be used to intercept system calls related to file I/O and network communication, which are not captured by the user-level checkpointing technique. By doing so, the interposition agent can ensure that all necessary state information is captured during the checkpointing process. This combined approach could be particularly useful in distributed computing environments, where multiple processes may be communicating with each other through various network protocols.

Another potential way to combine the ideas from the papers is to use the process migration technique proposed by Litzkow and Solomon in conjunction with the user-level checkpointing technique described in the Libckpt paper. By periodically migrating processes to different nodes in a distributed system, the system can avoid performance bottlenecks and reduce the likelihood of failures due to hardware or software faults. The user-level checkpointing technique can be used to capture the state of the migrated process and transfer it to the new node. This combined approach could be particularly useful in large-scale distributed systems where fault tolerance and load balancing are critical.

Finally, the three papers could be combined to create a complete fault-tolerant system that includes both checkpointing and process migration capabilities. The user-level checkpointing technique described in the Libckpt paper could be used to periodically capture the state of critical processes, while the interposition agent approach proposed by Thain and Livny could be used to ensure that all necessary state information is captured. In the event of a failure, the process migration technique proposed by Litzkow and Solomon could be used to quickly move processes to other nodes in the system. This combined approach could be particularly useful in real-time systems, where failures must be quickly detected and resolved to avoid serious consequences.

In conclusion, by combining the approaches proposed in the surveyed papers, it is possible to create a more robust and efficient checkpointing system. These synergistic approaches could be particularly useful in distributed and real-time systems, where fault tolerance and performance are critical. However, further research is needed to fully explore the potential of these combined approaches and to develop practical implementations.

Strengths and Weaknesses

Strengths and weaknesses of the three surveyed checkpointing approaches are discussed below:

Libckpt: Transparent Checkpointing under Unix

Strengths

The checkpointing process is transparent to the application, which does not need to be modified.
Checkpointing is done incrementally, meaning that only modified pages are written to disk, reducing overhead.
Allows for selective checkpointing of parts of an application or a process.
Supports recovery from checkpoint files saved on remote machines, enabling migration of processes.
Has been used successfully in various production systems.

Weaknesses

The checkpointing process is relatively slow due to the overhead of copying memory pages to disk.
Checkpoint files can become quite large, especially for long-running applications or processes with large memory footprints.
Only works on Unix-based systems, limiting its applicability to other operating systems.
Checkpointing requires disk space, which can be a problem in disk-constrained environments.
Can result in decreased performance due to memory fragmentation caused by the frequent copying of memory pages.

Supporting Checkpointing and Process Migration Outside the Unix Kernel

Strengths

Can checkpoint and migrate processes across different operating systems, making it more versatile than Libckpt.
Supports incremental checkpointing, reducing overhead and enabling faster checkpointing.
Supports multiple checkpointing algorithms, enabling the user to choose an algorithm based on the application's needs.
Uses a global file system to store checkpoint files, enabling checkpoints to be stored on remote machines.

Weaknesses

The checkpointing process is still relatively slow due to the overhead of copying memory pages to disk.
Checkpoint files can become quite large, especially for long-running applications or processes with large memory footprints.
The global file system can be a bottleneck for checkpointing and recovery.
Requires modifications to the operating system, limiting its applicability to non-modifiable systems.
Does not support selective checkpointing of parts of an application or a process.

Multiple Bypass: Interposition Agents for Distributed Computing

Strengths

Allows for the selective checkpointing of parts of an application or a process.
Supports recovery from checkpoint files saved on remote machines, enabling migration of processes.
Can be used with any language that can use dynamic libraries, making it more versatile than the other two approaches.
Has been used successfully in various production systems.

Weaknesses

Checkpointing is still relatively slow due to the overhead of copying memory pages to disk.
Checkpoint files can become quite large, especially for long-running applications or processes with large memory footprints.
Only works in a distributed environment, limiting its applicability to other systems.
Requires the use of interposition agents, which can be difficult to implement and maintain.
Can result in decreased performance due to the overhead of the interposition agents.

Overall, the surveyed checkpointing approaches have several strengths and weaknesses. While they are all capable of achieving their primary goal of checkpointing and recovering processes, they each have their own unique limitations. The strengths and weaknesses of each approach should be carefully considered when deciding which approach to use in a given situation.

Impact of Early Research Papers on Checkpointing and Process Migration

The early research on checkpointing and process migration, as exemplified by the surveyed papers of Libckpt, Supporting Checkpointing and Process Migration Outside the UNIX Kernel, and Multiple Bypass, has had a significant impact on modern research in this domain. In this section, we will discuss how these early papers have helped shape the direction of modern research, and provide examples of how their contributions are still relevant today.

Libckpt: Transparent Checkpointing under Unix

The Libckpt paper introduced a transparent checkpointing mechanism that allowed applications to be checkpointed without requiring any modifications to the source code. This approach greatly simplified the task of checkpointing and made it more accessible to a wider audience. The techniques introduced in this paper have been built upon and extended in many subsequent works. For example, the CRIU project (Checkpoint/Restore In Userspace) is an open-source tool that provides checkpointing and process migration capabilities for Linux-based systems. CRIU relies heavily on the ideas and techniques introduced in the Libckpt paper, such as process migration outside of the kernel.

Supporting Checkpointing and Process Migration Outside the Unix Kernel

The paper by Litzkow and Solomon introduced a technique for supporting checkpointing and process migration outside of the UNIX kernel. This approach enabled checkpointing and migration for a wider range of applications and operating systems. The techniques introduced in this paper have influenced many subsequent works that focus on checkpointing and migration outside of the kernel, including the BLCR (Berkeley Lab Checkpoint/Restart) project, which provides checkpointing and process migration capabilities for Linux-based systems.

Multiple Bypass: Interposition Agents for Distributed Computing

The Multiple Bypass paper introduced the concept of interposition agents for distributed computing. Interposition agents are programs that intercept system calls made by a target application and can modify or enhance their behavior. This approach has been applied in many subsequent works, including the dMTCP (Distributed MultiThreaded CheckPointing) project, which provides checkpointing and process migration capabilities for distributed systems. The dMTCP project uses interposition agents to capture system calls and redirect them to a checkpointing and migration engine. In addition to these specific examples, the early research on checkpointing and process migration has had a broader impact on the field of distributed systems and high-performance computing. Many of the techniques introduced in these papers have been applied in other domains, such as fault tolerance and system management.

Overall, the early research on checkpointing and process migration has had a significant impact on modern research in this domain. The contributions of these early papers have been built upon and extended in many subsequent works, and their ideas and techniques continue to be relevant and influential today.

Conclusion

In conclusion, the domain of checkpointing and process migration has seen significant progress over the past few decades, but it still has scope for further research and development. While some of the surveyed approaches, such as Libckpt and the multiple bypass interposition agents, have been used in production environments, the domain as a whole is still evolving.

One of the main challenges in this domain is achieving low overhead while still providing reliable and efficient checkpointing and process migration. Many of the surveyed approaches have attempted to address this challenge in different ways, but there is still room for improvement.

In addition, the rise of cloud computing and distributed systems has created new opportunities for research in checkpointing and process migration. As more applications move to the cloud and become more distributed, the need for efficient and reliable checkpointing and process migration becomes even more critical.

One potential area of research in the next five years could be the development of new approaches that leverage the benefits of cloud computing and distributed systems to improve checkpointing and process migration. For example, new approaches could use the elasticity and scalability of cloud computing to create more efficient and flexible checkpointing and process migration solutions.

Another potential area of research could be the development of new tools and techniques for monitoring and managing checkpointing and process migration in large-scale distributed systems. As these systems become more complex, it becomes increasingly important to have robust and flexible tools for managing and monitoring checkpointing and process migration.

Overall, while the domain of checkpointing and process migration has seen significant progress over the past few decades, there is still a need for further research and development. With the rise of cloud computing and distributed systems, there are new opportunities for innovation and improvement in this area, and we expect to see continued progress in the coming years.

A Survey on Early Checkpointing Techniques

Tarun Malviya

April 19, 2023

Abstract

Introduction

Libckpt: Transparent Checkpointing under Unix

Supporting Checkpointing and Process Migration Outside the Unix Kernel

Multiple Bypass: Interposition Agents for Distributed Computing

Background

Early History of Checkpointing

Checkpointing in UNIX

Checkpointing in Distributed Systems

Taxonomy of Checkpointing Approaches

System-Level Checkpointing

Application-Level Checkpointing

Process-Level Checkpointing

Hybrid Checkpointing

Approach

Key Findings

Cross-cutting Themes

Synergistic Approaches

Strengths and Weaknesses

Libckpt: Transparent Checkpointing under Unix

Strengths

Weaknesses

Supporting Checkpointing and Process Migration Outside the Unix Kernel

Strengths

Weaknesses

Multiple Bypass: Interposition Agents for Distributed Computing

Strengths

Weaknesses

Impact of Early Research Papers on Checkpointing and Process Migration

Libckpt: Transparent Checkpointing under Unix

Supporting Checkpointing and Process Migration Outside the Unix Kernel

Multiple Bypass: Interposition Agents for Distributed Computing

Conclusion