1 Introduction

To get the maximum performance out of machines nowadays one must utilize parallel and distributed programming as much as possible. Many years have passed since the introduction of multi-core hardware, and therefore writing code that uses only one core seems teribbly inefficient, especially for the tasks that require the best performance. However, writing code that takes advantage of concurrency is hard and can lead to many different subtle bugs that cannot be found using traditional debugging tools.

Our standard debugging tools cannot tackle concurrency-related bugs, so it's no surprise that most software bugs today stem from concurrency-related mistakes. Model-checkers, which can debug concurrency-related issues, have been very popular in formal verification since the advent of multi-core hardware. Therefore, developing tools that can help debug concurrency-related issues is still one of top priorites in formal verification.

In order to ensure efficient code and debugging, it's crucial to diagnose bugs quickly. However, detecting a bug in multi-threaded code is a complex process that can take significant time if the model-checker must check all possible execution combinations. To address this challenge, researchers have developed algorithms that reduce the search space and provide the necessary correctness check more efficiently. As a result, the fastest model checkers are parallel software, as they can quickly and effectively identify concurrency-related issues while avoiding brute-force methods.

In this paper, we will see three different approaches how to make a single-threaded model checker into a concurrent one. The three papers presenting these approaches are:

The outline of the rest of the paper is as follows. In section 2, we focus on the background and motivation of all approaches in the paper. In section 3, we focus on the understanding of the core of approaches along with the reasoning behind them. In section 4, we provide the overview of the weaknesses of each approach studied in this paper. In section 5, we provide suggestions for the future of the each of the approaches along with propositons of cooperation between each other.

2 Background And Motivation

In this section, we review the background and motivation for each of the approaches that are going to be explored in the next section.

2.1 Bounded Model Checking

Bounded model checking (BMC) is a technique that simplifies reachability problems to propositional satisfiability (SAT) or related methods such as SMT. As demonstarated by respected works, combining lazy sequentialization and SAT-based Bounded Model Checking (BMC) is a powerful combination for analyzing complex multi-threaded programs. It has been shown that this variant of BMC is more effective than other mature techniques and can handle programs that were previously difficult to model check.

Although it is effective, BMC has one critical flaw, the lack of parallelism. Without this feature, the BMC approach is not able to take an ever-growing advantage of multi-core hardware. And as we mentioned in the introduction, as programmers, we want our debugging tools to be as efficient as our software, making this missing feature of BMC its true weakness. Sequentialization can be used to extend BMC to handle concurrent programs by converting the concurrent program into a sequential one that retains all possible execution traces within a given bound.

To address the issue of missing concurrency in BMC, Inverso and Trubiani introduce parallelization to bounded model checking of multi-threaded programs. The method used to achieve this is partitioning the set of execution traces of the concurrent program into symbolic subsets. Then, each of these symbolic subsets can be analyzed by multiple instances of the same decision procedure running in parallel. Their approach enables efficient exploration of the state space and improves the overall performance of the BMC's analysis.

The way parallelization of BMC is achieved is by using non-parallel BMC analysis to generate different simple variations of the propositional formula. If any of the formulas obtained are satisfiable, it indicates the presence of a bug in the program. If all of the obtained formulas are unsatisfiable, it proves the absence of bugs within the given bounds. Therefore, a separate solver can be created for each formula and end the overall procedure as soon as one of the solvers discovers a satisfiable assignment or when all of them have completed running. Since the solvers of each propositional formula only communicate after termination and not while running we get to solve the formulas separately, thus making the model-checker run in parallel.

2.2 Shared Hash Tables in Parallel Model Checking (SHT)

In the early stages of parallelization of model checking most of them followed the distributed-memory models. However, since the introduction of multi-core hardware researchers have been focusing more on the shared-memory paradigm. Multiple matured algorithms exist that transform a distributed-memory model checker into a shared-memory model checker. SHT aims to extend algorithms that transform a distributed-memory model checker into a shared-memory model checker, with a specific focus on incorporating extensions for shared-memory using a single shared storage.

The "One Way Catch Them Young" algorithm serves as the base for extending the algorithms that transform a distributed-memory model checker into a shared-memory model checker. To better understand the core of the approach, we will explain how the algorithm works at a high level. The algorithm removes vertices that cannot lie on an accepting cycle using two rules. The first rule removes a vertex from a graph if it has no successors in the graph, while the second rule removes a vertex if it cannot reach an accepting vertex. The procedure of vertex removal continues until no vertices satisfy the two conditions above. If any vertices remain, it means the original graph had an accepting cycle; if no vertices remain, the original graph did not contain an accepting cycle. In the following section, we will explain how this algorithm is incorporated into the overall approach.

The traditional approach in distributed algorithms is to partition the state space statically, using a partition function that precisely assigns each state to one of the cores. It turns out that the same approach is used in shared-memory algorithms where each thread has its own private hash table and private memory. However, the authors point out that in shared memory, all threads share a single block of memory with uniform accessibility to all CPU cores. Therefore, there are other, better ways to partition the state than what has been established as a standard in distributed algorithms and later adopted in shared memory successors. The first way is that if several hash tables are used threads are able to access tables that they do not own. The second way is to have a single shared hash table used by all the threads. The high-level overview of how this is achieved is also explored in the next section where we talk about the core of the approach.

2.3 Parallel Model Checking Using Abstraction

In order to generate a list of all potential states for an asynchronous system, many popular model checkers adopt an approach where the configuration space is represented as an implicitly defined graph. Each node within the graph corresponds to a unique system state, and the edges between nodes represent the potential transitions that each component can make. By following a path through this graph, a possible order of actions that the system may perform can be identified. An exhaustive search algorithm can then be used to explore all possible system states to detect any that violate specific properties. However, as is typical with implicit graphs, the sheer number of nodes can be significant, resulting in a search process that requires an excessive amount of time or memory.

The model checking community, as well as the automated planning and heuristic search communities, have made significant strides in developing new search frameworks that leverage the capabilities of modern multi-core processors. By doing so, they have managed to improve their algorithm's performance. They have successfully decreased the memory requirements of graph searches by shifting a significant portion of the load to external storage devices such as hard disks. However, while heuristic search and planning communities have achieved considerable success with their techniques in other search problems, their effectiveness has yet to be tested for model checking. Given their track record, there is an interest in comparing these approaches to the conventional parallelization methods used in model checking.

On that note, the authors have incorporated two techniques (1) hash-distributed breadth-first search (HD-BFS) and (2) parallel structured duplicate detection (PSDD) into the Spin model checker for parallelizing breadth-first search. The first technique, HD-BFS, uses a common approach for parallel model checking that involves distributing states among different searching threads using a hash function. The second technique, PSDD, originates from the heuristic search and planning communities.

They have found that PSDD offers significant advantages over both HD-BFS and Spin's built-in multi-core depth-first search for model checking. Firstly, HD-BFS employs delayed duplicate detection and therefore has to temporarily store duplicate search nodes while they are being communicated between threads. Yet, PSDD can immediately detect duplicate states and eliminates the need for extra memory to store them. Secondly, PSDD can preserve Spin's partial-order reduction capability, which is a technique used by model checkers to reduce the size of the search space. Due to its ability to explore a smaller search space, PSDD can utilize multiple threads more effectively than HD-BFS and Spin's built-in multi-core depth-first search. This is because HD-BFS and Spin's built-in multi-core depth-first search have to be more conservative when applying partial-order reduction.

3 Approaches

In this section, we provide the core understanding of the approaches along with the reasoning behind them.

3.1 Parallelizing Bounded Model Checking (PBMC)

As we've seen in the previous section the key to parallelizing BMC is decomposing the set of execution traces of a concurrent program into symbolic subsets. On that note, it is crucial to explore this concept to get a better understanding of the approach.

To start simple, let's consider a simple two-thread program. Suppose that the first thread is interested in analyzing all its possible executions up to one execution context. To split the set of all traces into two subsets, we choose that traces whose schedule starts with the first thread will go to subset one, while all others will go to subset two. Since each of the subsets is explored separately, the analysis is performed in parallel using two computational units.

To enable a high-level generalization that works for any number of threads and execution contexts, an assumption is made. This assumption is that all threads have been created before any of them can be scheduled for the initial configuration of the program. Given that the set of all threads is denoted by T and that Tʹ and Tʺ are partitions of the set T containing odd and even thread identifiers respectively. Then the set of all execution contexts can be partitioned by separating the expressions into two parts based on whether the first scheduled thread belongs to Tʹ or Tʺ. Further, the splitting can be done by restricting a thread to belong to one of the partitions rather than to the whole set of threads. After understanding how a set of all traces can be divided into smaller subsets for any number of threads and execution contexts, it's necessary to explore how parallelization is achieved.

To see how parallelization is achieved, for a moment we focus on the high-level flow of non-parallel BMC. When a program that we want to model check is run under non-parallel BMC it is transformed into a sequential program that is equivalent to the original up to specified bounds. Bounded model checking is then performed on this sequential program encoding it into a propositional formula which is checked for satisfiability. Upon checking the formula for satisfiability, if the formula turns out to be satisfiable within initially defined bounds, the error trace is mapped back to the original program.

It turns out that the flow of parallel BMC looks interestingly similar. The main difference is in the step when a sequential program is encoded into a propositional formula. When in parallel, multiple propositional sub-formulae are generated instead of a single formula. Logically, each of the sub-formulae corresponds to a different symbolic partition of the execution traces of the original program. Unlike in non-parallel BMC we have multiple sub-formulae, so each sub-formulae is checked for satisfiability, but independently of each other, thus parallelism is achieved.

As we've seen this approach to bounded model checking of multi-threaded programs is based on partitioning and parallel exploration of the possible traces. The technique represents a significant advancement in the field as it is the first to be specifically designed for concurrent programs and has the potential to be extended to distributed systems. The experimental results have demonstrated promising speedups and scalability, even on complex programs, for both reachable and unreachable error conditions. The structure-aware method relies on a straightforward but efficient partitioning scheme and is more practical compared to state-of-the-art general-purpose parallel solvers.

3.2 Shared Hashed Tables Via Dynamic Partitioning (SHD)

It is well known that when we want to read and write from/to the table using multiple threads we need locking to ensure the correctness of data. In this approach, the tables are accessed frequently, making it impractical to lock the entire table for each access. To overcome this performance drawback, a more efficient method is needed that only locks the region of the table that will be accessed. Another concern now is how to grow the table concurrently, while some regions might be locked. Two solutions are proposed. First, where a fixed number of locks are used and the locking unit grows linearly with the hash table size. Second, both the region and the number of locks are increased based on the square-root rate of growth.

We recall from the previous section that the original shared-memory implementation used static partitioning adopted from distributed-memory implementations. In this implementation, each state has a thread assigned to it, meaning that every time a state is generated it is assigned to the same thread. Because of this, each thread can have its private hash table where it stores all states it owns. Consequently, a thread is able to access and modify the table without worrying about race conditions. Because each state has a thread assigned to it exclusively, this approach provides the benefit of efficient use of processor cache. This is because the table size is much smaller compared to the case of a shared hash table.

Nevertheless, the static partitioning approach has the big performance drawback of having to communicate the ownership of the states between the threads. The impact of this is significant as cross-transitions grow very fast as more and more threads are added to the pool. As a solution authors propose dynamic partitioning that's using a single shared table instead of a table per thread.

Dynamic partitioning allows threads to interact with the table and its states more efficiently as there are no rules necessary to ensure that each state is assigned to exactly one of the threads. The thread that is looking at a transition can decide whether to process it or send it over to the CPU, thus avoiding the performance drawbacks of cross-transitions. Three different approaches are presented to implement dynamic partitioning, and we will discuss those in more detail in 3.2.1, 3.2.2, and 3.2.3.

3.2.1 Modular Partitioning

Modular partitioning is a technique used in parallel state space exploration algorithms that involves dividing the search space among multiple threads by making every n-th transition a cross-transition. This approach efficiently reduces the number of cross-transitions between threads, which is the main drawback of static partitioning. Cross-transitions are time-consuming operations that occur when a transition is sent from one thread to another. One advantage of modular partitioning is that it significantly reduces the number of cross-transitions by only allowing a cross-transition to occur at every n-th transition. Therefore, using modular partitioning results in faster execution times and improved performance, especially for larger search spaces.

However, modular partitioning also has some drawbacks. By making every n-th transition a cross-transition, the search space is divided into n modules, each with its own cache table. This can result in a single, large cache table that reduces cache efficiency and can slow down performance, similar to static partitioning. Additionally, modular partitioning can result in uneven load balancing between threads, as some modules may have more transitions than others. This can result in some threads finishing their work earlier than others, which can cause delays and decreased performance. Despite its drawbacks, modular partitioning remains a viable approach to parallel state space exploration, especially for medium to large search spaces.

3.2.2 Handoff Partitioning

Handoff partitioning is a technique used in parallel state space exploration algorithms, which is similar to DFS-based analysis. This approach involves dividing the search space among multiple threads by sending transitions to the next thread when a certain "handoff depth" is reached. The handoff depth determines the number of levels in the search tree before a thread hands off the remaining search space to another thread. One advantage of handoff partitioning is that it efficiently reduces the number of cross-transitions between threads. Cross-transitions occur when a transition is sent from one thread to another, which can be a time-consuming operation. By using handoff partitioning, cross-transitions only occur at every N (N being the handoff depth) levels in the search tree, resulting in fewer cross-transitions and improved performance.

Compared to modular partitioning, handoff partitioning has a bigger chance of a state being visited by a particular thread several times. This is because handoff partitioning sends the remaining search space to another thread only when the handoff depth is reached, allowing a particular thread to explore a deeper level in the search tree. This results in a smaller table that doesn't reduce cache efficiency as much, which can improve performance.

However, handoff partitioning can still suffer from cache inefficiencies when compared to other partitioning approaches, such as static or dynamic partitioning. This is because handoff partitioning still requires some form of synchronization between threads, which can result in increased overhead and decreased performance. Despite its drawbacks, handoff partitioning remains a viable approach to parallel state space exploration, especially for smaller search spaces.

3.2.3 Shared Queue Partitioning

Shared queue partitioning is a technique used in parallel breadth-first search algorithms. Unlike other approaches that use a partition function to divide the search space among multiple threads, shared queue partitioning places all states in a single shared breadth-first search (BFS) queue. The advantage of this approach is that it achieves reasonable cache efficiency. This is because all threads access the same shared queue, which reduces the number of cache misses and increases the likelihood of finding already-cached data. The reduction of cache misses shared queue partitioning provides can result in faster execution times and improved performance.

However, shared queue partitioning has more overhead when it comes to locking and thread contention than both modular and handoff partitioning approaches. This is because all threads compete for access to the same shared queue, which can result in contention for locks and decreased performance due to waiting times. Locking is used to prevent two or more threads from accessing the same resource simultaneously.

In the case of shared queue partitioning, locking is required to prevent multiple threads from simultaneously accessing the same queue. Despite its overhead in locking and thread contention, shared queue partitioning remains a viable approach to parallel breadth-first search algorithms. Its cache efficiency can result in improved performance, especially for small to medium-sized search spaces. Therefore, for larger search spaces, the other dynamic partitioning approaches may be more suitable as they offer better scalability and less overhead in locking and thread contention.

3.3 Hash-Distributed Breadth-First Search (HD-BFS)

In the previous section, we've seen that the authors choose the breadth-first search algorithm for HD-BFS for obvious performance reasons. If a model violates a safety property, using breadth-first search for model checking will result in finding the shortest counterexample. This is crucial because safety properties are significant in debugging an asynchronous system, and understanding the counterexample provided by the model checker is necessary to determine the reason behind the system's undesired behavior. On the other hand, a depth-first search does not consider the number of steps required to reach a node in the state space. As a result, it may produce a longer counterexample than necessary, which can be difficult to interpret. For instance, it was observed that a deadlock in one model is detected by a depth-first search with a 9,992-step trace, while a breadth-first search finds the smallest possible trace of 42 steps. However, many respected papers discuss the difficulties of the parallelizing breadth-first search algorithm.

As a solution authors propose hash-distributed search as a method of parallelizing breadth-first search algorithm. The high-level idea of how this is achieved we explore here. In HD-BFS, each thread uses two queues to represent the search frontier. One queue, the current queue, holds nodes assigned to the thread at the current search depth, while the other, the next queue, holds nodes assigned to the thread at the next search depth. Each thread also has a hash table containing all previously expanded nodes to prevent the search from expanding duplicates.

During searching, each thread expands nodes from its current queue one at a time. If a successor node is assigned to a different thread by the hash function, it must be sent there through inter-thread communication. However, if the successor node is assigned to the same thread, it is immediately checked for membership in the local hash table. If it is not a duplicate, it is added to the next queue for the local thread without communication. Hash-distributed breadth-first search (HD-BFS) uses a communication scheme to send nodes between threads asynchronously using shared-memory queues.

In HD-BFS, when a receiving thread receives a new node from a different thread, it checks whether the node is a duplicate by testing it for membership in its local hash table. If the node is not a duplicate, it is placed on the thread's next queue. If all current queues are empty and no nodes are in transit between threads, the current depth layer has been fully expanded. All threads synchronously swap their next queue with their current queue and begin searching nodes at the next depth. If all current queues are empty after swapping, the search space has been exhausted, and the algorithm terminates.

3.4 Parallel Structured Duplicate Detection (PSDD)

To address the shortcomings of the previous approach, HD-BFS, the authors present us with a new approach Parallel Structured Duplicate Detection, PSDD. PSDD takes a different approach than assigning nodes to threads in advance using a hash function. It enables threads to dynamically allocate search workloads. PSDD relies on a homomorphic abstraction to associate nodes in the search graph with nodes in an abstract representation of the search graph

In the PSDD algorithm, an abstraction function is used to create a simplified representation of the state-space graph. This is necessary because the state space is too large to be stored in memory, so a smaller representation is needed to exploit the local structure of the graph.

The abstraction function works by removing some state information from each search node, which creates a many-to-one mapping. The resulting image of the search node under the abstraction represents the abstract node to which it maps. To create this representation, the abstraction function needs to be able to compute on each node.

The abstraction used in this implementation of PSDD considers only a fixed subset of process IDs and their automaton states for any given state. For example, if a state has seven processes numbered 0-6, the abstraction may only consider the first two process IDs. This results in the projection of the remaining process IDs 2-6, which leads to a much smaller set of abstract nodes.

To determine predecessor and successor relations in the abstract graph, the transitions of the finite automaton are used. In PSDD, only the state of a single component automaton transitions between a node and its successors. Therefore, the successors in the abstract graph consist of all possible single transitions of the process IDs that haven't been removed through abstraction. To optimize the process, the abstract graph is generated on-demand during the search. This approach yields the benefit of constructing only those portions of the graph that are utilized and constructing it concurrently with the search rather than sequentially as a preprocessing step.

4 Approach Weaknesses

In this section, we provide the overview of the weaknesses of each approach studied in the previous section.

4.1 Weaknesses of Dynamic Partitioning

By implementing proper locking, shared hash tables can be used without modification in all distributed algorithms. However, the dynamic partitioning schemes used by each algorithm place additional requirements on the visit order. For example, the handoff technique requires a DFS stack, while the shared queue is specific to BFS. As a lot of model checkers employ distributed algorithms, each requiring a distinct visit order when using dynamic partitioning techniques, modifying them to use shared hash tables would be a challenging undertaking.

4.2 Weaknesses of Shared Hashed Tables

As we have seen in the previous section shared-hash tables approach has been surpassed by better methods such as PSDD. The results have demonstrated the effectiveness of methods such as Parallel Structured Duplicate Detection (PSDD) in tackling the state-space explosion problem much better. Compared to the shared-hash tables approach it has shown better parallel speedup and significantly lower memory usage. Specifically, the immediate detection of duplicate nodes in PSDD has proven to be more efficient than the delayed detection employed by variants of the shared-hash tables approach. Additionally, the use of external PSDD has shown promise in further reducing the memory requirements of model checking by taking advantage of secondary storage devices like hard disks.

4.3 Weaknesses of Hash-Distributed Breadth-First Search (HD-BFS)

The findings indicate that hash-distributed search, when implemented for model checking, suffers from two primary drawbacks. Firstly, it can delay the detection of duplicate nodes as they are communicated between threads. These nodes are placed on the receiving thread's queue, where they sit until they are eventually checked against the receiving thread's hash table. This delay in identifying duplicate nodes can cause the search to consume more memory, as these duplicates remain in the receiving queue instead of having their memory freed for reuse immediately. Secondly, the shortcoming of hash-distributed search is its limitation in applying partial-order reduction, a model checking technique employed to reduce the search graph's size. When a node is expanded with partial-order reduction, only a portion of its successors are taken into account, and the remainder is discarded. That being the case, authors came up with PSDD approach adressing drawbacks of HD-BFS.

5 Suggestions For The Future of the Approach

In this section, we provide suggestions for the future of the each of the approaches. This involves suggestions in regard to the advancments of the approach itself as well as suggestions for cooperation between the approaches where they could nicely complement each other.

5.1 Suggestions for Parallelizing Bounded Model Checking

It would be interesting to test the technique on large computing clusters with more advanced program transformations and various assumption strategies for partitioning the analysis. A fascinating modification to the approach would be to dynamically assign sub-partitions of the execution traces to the cores. Comparing the static partitioning method to dynamic variations through experiments could provide valuable insights into cooperative parallelization.

In regard to cooperative parallelization, it could be useful to take a look at the paper on shared hash tables approach which uses both static and dynamic partitioning. The experiments conducted there use four different approaches to measure reachability timings. These approaches are BFS with static partitioning, BFS with modular partitioning, DFS with handoff partitioning, and DFS with handoff partitioning. The figures indicate that dynamic partitioning schemes perform better for a small number of cores but are consistently outperformed by the statically partitioned BFS. Surprisingly, the modular partitioning scheme performs better than expected in some cases but is still the slowest and least scalable approach. Therefore, this resource could be a nice complement in the future.

In terms of performance analysis, it would be valuable to conduct a closer examination of the runtime discrepancies among the solvers. This can be done on various sub-formulae and the fluctuations in speedup while adjusting the number of cores. Moreover, it would be beneficial to investigate how different variable selection heuristics affect load balancing. To further improve the understanding, the authors can employ performance models and probabilistic heuristics to connect the effects of splitting choices on the program's data or control flow. Additionally, advanced visualization techniques to depict the solver's behavior could be beneficial for gaining insights into its performance.

Previous research has shown that symbolic partial order reduction has yielded positive results when applied to round-bounded lazy sequentialization. Partial order reduction could potentially benefit the context-bounded analysis even more in this approach. This is because the round-bounded scheduler always executes threads in a fixed order, resulting in limited opportunities to reduce the search space. In addition, experiments with both static and dynamic partial order reduction at the level of the decision procedure can be useful.

Expanding the scope of this methodology to other classes of complex programs, beyond concurrent programs, could be a valuable research direction. Such generalization would require appropriate program transformations that preserve the relevant structure, along with suitable assumptions and selection heuristics for the splitting variables. The the development of general selection heuristics for splitting variables could be useful for enhancing the performance of sequential analysis as well.

5.2 Suggestions for Shared Hashed Tables Via Dynamic Partitioning

As a suggestion for the paper, in the previous sub-section (5.1) we outlined the benefits research could get from comparing the static partitioning method to dynamic variations through experiments. For this paper, we suggest the reference to the same paper for applying these techniques to other complex programs, including sequential analysis, using appropriate program transformations and general selection heuristics for the splitting variables.

5.3 Suggestions for Parallel Structured Duplicate Detection (PSDD)

In terms of future research, the authors should aim to apply PSDD to other model checkers to demonstrate its versatility and efficacy in accelerating search with complete partial-order reduction. With this said it could be interesting to do that with the model checker created with the PMBC discussed in section 3.1. The authors of PMBC state that their implementation for concurrent programs outperforms commonly used state-of-the-art parallel solvers. Hence, improving the performance of PMBC implementation through PSSD would be a valuable and worthwhile challenge.

How To Upgrade A Single-Threaded Model Checker of Multi-Threaded Programs To A Concurrent One

Luka Jovanovic