This web page will be extended into multiple research tracks ---
perhaps on the Course Wiki, when it is made available.
The dining philosophers problem for multi-threaded programming is well known.
If a mutex is used to "protect" each individual fork, then a deadlock
can result. Such a dining philosophers program with just two threads
(two philosophers) was run through the (simple) McMini model checker,
and it reported a trace that leads to deadlock. So, one of the
developers of McMini (Luka Jovanovic, of this class) showed the thread
schedule to an LLM (GPT4), and asked it why the deadlock occurs. GPT4's
response follows.
At the end of this note, you'll find a suggested research track.
User: Why is there a deadlock, what happened: About to exec into ./test/deadlock_program/philosophers_mutex_deadlock Trace 1, *** DEADLOCK DETECTED *** THREAD BACKTRACE thread 0: starts thread 0: pthread_mutex_init(1, _) thread 0: pthread_mutex_init(2, _) thread 0: pthread_create(1) thread 0: pthread_create(2) thread 1: starts thread 1: pthread_mutex_lock(1) thread 2: starts thread 2: pthread_mutex_lock(2) 0, 0, 0, 0, 0, 1, 1, 2, 2, END THREAD STATES thread 0: pthread_join(1, _) thread 1: pthread_mutex_lock(2) thread 2: pthread_mutex_lock(1) END ***** Model checking completed! ***** *** DEADLOCK DETECTED *** (Trace number: 1) Number of transitions: 34 Number of traces: 3 GPT4: The given trace provides information about a simple example of a classic deadlock problem involving two mutexes (or locks) and two threads. Here's a breakdown of what's happening: 1. Thread 0 initializes two mutexes: - `pthread_mutex_init(1, _)` - `pthread_mutex_init(2, _)` 2. Thread 0 creates two threads (Thread 1 and Thread 2): - `pthread_create(1)` - `pthread_create(2)` 3. Both threads start running: - `thread 1: starts` - `thread 2: starts` 4. Thread 1 locks Mutex 1: - `thread 1: pthread_mutex_lock(1)` 5. Thread 2 locks Mutex 2: - `thread 2: pthread_mutex_lock(2)` 6. Now, we hit the deadlock: - Thread 1 tries to lock Mutex 2, but Mutex 2 is already held by Thread 2. - `thread 1: pthread_mutex_lock(2)` - Thread 2 tries to lock Mutex 1, but Mutex 1 is already held by Thread 1. - `thread 2: pthread_mutex_lock(1)` 7. Both threads are stuck, waiting for the other to release a mutex. This creates a deadlock situation, as neither thread can proceed. The main problem here is the order in which the threads attempt to acquire the mutexes. If both threads always tried to acquire the mutexes in the same order (either 1-2 or 2-1), this deadlock would not occur. However, in the given trace, the threads attempt to acquire the mutexes in opposite orders, leading to the deadlock. To fix this deadlock, you'd typically ensure that both threads always try to acquire the mutexes in the same order (e.g., always lock Mutex 1 before Mutex 2). Another common solution is to use a try-lock mechanism (like `pthread_mutex_trylock()`) which attempts to lock the mutex and immediately returns if it's already locked, allowing the thread to either retry or perform another action without blocking.
An example of an oracle in formal verification is an add-on
module that can directly provide correct answers in a limited domain.
An example of this might be a machine learning system that wishes
to prove theorems about algebra. Given the laws of arithmetic,
it could eventually prove that 1001 + 1000 > 2000
.
But if an arithmetic oracle exists, it can immediately determine
if the statement is true.
RESEARCH IDEA: Suppose in a program we want to start with condition 'A' and end with condition 'Z'. An LLM can provide a summary of the issues, and the model checker can perform local checks, and then instruct the LLM on how to correct them. Further, the simple model checker, McMini, can be instrumented to provide additional information that might help the LLM.
More succinctly:
"Can a model checker be used as an oracle for an LLM that uses few-shot training to reason about code and summarize program traces?"(One imagines a "conversation" between the LLM and the model checker. This is analogous to how Code LLMs are used today with a compiler/interpreter as the oracle. The oracle tells the LLM if some proposed source code will satisfy some given input-output constraints.)
RELATED RESOURCES AND TOPICS: Maxwell Pirtle (in this class) and Luka Jovanovic have developed a simple, but performant model checker, on which the text above is based. Its simplicity allows us to experiment with perturbing its behavior, search order, hints, etc., with knowledge gained from LLMs or other sources. As resources, you can find:
The literature on AI-based Programming and automatic development of unit tests (see Suggested Readings should also be relevant.
Francesca Lucchetti (in this class) will be presenting some combination of the following three papers/tech reports on Mechanistic Interpretability. We can review these papers (which will later be presented), and suggested course readings, for ideas about how to develop this theme. The three papers follow:
Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as a building block for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming languages. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages, like OCaml and Racket. These languages enjoy dedicated communities (programming languages research; finance) but are not as well represented in training data as more broadly popular languages.This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of new, validated training items for Racket, OCaml, and Lua from Python. Moreover, we use an open dataset (The Stack) and model (StarCoderBase), which allow us to decontaminate benchmarks and train models on this data without violating the model license.
With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase that achieve state-ofthe-art performance for Racket, OCaml, and Lua on benchmark problems. For Lua, our fine-tuned model achieves the same performance as StarCoderBase as Python—a very high-resource language–on the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on MultiPL-E, bringing their performance close to higher-resource languages such as Ruby and C#. The MultiPL-T approach is easy to apply to new languages and can immediately be used on any of the 18+ languages that MultiPL-E supports. Moreover, as we show, it is significantly more efficient and effective than alternates such as training longer.