Research track for CS7670 (Computer Systems Seminar)

This is an example of a small experiment, and how it can lead to an interesting research question. This creates a research track. This example will hopefully motivate others to either develop their own research track, or pursue an existing one.

This web page will be extended into multiple research tracks --- perhaps on the Course Wiki, when it is made available. The dining philosophers problem for multi-threaded programming is well known. If a mutex is used to "protect" each individual fork, then a deadlock can result. Such a dining philosophers program with just two threads (two philosophers) was run through the (simple) McMini model checker, and it reported a trace that leads to deadlock. So, one of the developers of McMini (Luka Jovanovic, of this class) showed the thread schedule to an LLM (GPT4), and asked it why the deadlock occurs. GPT4's response follows.

At the end of this note, you'll find a suggested research track.

User:
Why is there a deadlock, what happened:

About to exec into ./test/deadlock_program/philosophers_mutex_deadlock
Trace 1, *** DEADLOCK DETECTED ***
THREAD BACKTRACE
thread 0: starts
thread 0: pthread_mutex_init(1, _)
thread 0: pthread_mutex_init(2, _)
thread 0: pthread_create(1)
thread 0: pthread_create(2)
thread 1: starts
thread 1: pthread_mutex_lock(1)
thread 2: starts
thread 2: pthread_mutex_lock(2)
0, 0, 0, 0, 0, 1, 1, 2, 2, 
END
THREAD STATES
thread 0: pthread_join(1, _)
thread 1: pthread_mutex_lock(2)
thread 2: pthread_mutex_lock(1)
END
***** Model checking completed! *****
*** DEADLOCK DETECTED ***
  (Trace number: 1)
Number of transitions: 34
Number of traces: 3

GPT4:
The given trace provides information about a simple example of a classic 
deadlock problem involving two mutexes (or locks) and two threads. Here's a 
breakdown of what's happening:

1. Thread 0 initializes two mutexes:
   - `pthread_mutex_init(1, _)`
   - `pthread_mutex_init(2, _)`

2. Thread 0 creates two threads (Thread 1 and Thread 2):
   - `pthread_create(1)`
   - `pthread_create(2)`

3. Both threads start running:
   - `thread 1: starts`
   - `thread 2: starts`

4. Thread 1 locks Mutex 1:
   - `thread 1: pthread_mutex_lock(1)`

5. Thread 2 locks Mutex 2:
   - `thread 2: pthread_mutex_lock(2)`

6. Now, we hit the deadlock:
   - Thread 1 tries to lock Mutex 2, but Mutex 2 is already held by Thread 2.
   - `thread 1: pthread_mutex_lock(2)`
   - Thread 2 tries to lock Mutex 1, but Mutex 1 is already held by Thread 1.
   - `thread 2: pthread_mutex_lock(1)`

7. Both threads are stuck, waiting for the other to release a mutex. This 
creates a deadlock situation, as neither thread can proceed.

The main problem here is the order in which the threads attempt to acquire the 
mutexes. If both threads always tried to acquire the mutexes in the same order 
(either 1-2 or 2-1), this deadlock would not occur. However, in the given 
trace, the threads attempt to acquire the mutexes in opposite orders, leading 
to the deadlock.

To fix this deadlock, you'd typically ensure that both threads always try to 
acquire the mutexes in the same order (e.g., always lock Mutex 1 before 
Mutex 2). Another common solution is to use a try-lock mechanism (like 
`pthread_mutex_trylock()`) which attempts to lock the mutex and immediately 
returns if it's already locked, allowing the thread to either retry or perform 
another action without blocking.

domain-specific knowledge (the "fine tuning" phase of LLMs that we've seen);
searching the web to see if a statement is true; and
oracles that include extended reasoning to immediately decide if a statement is true.

An example of an oracle in formal verification is an add-on module that can directly provide correct answers in a limited domain. An example of this might be a machine learning system that wishes to prove theorems about algebra. Given the laws of arithmetic, it could eventually prove that 1001 + 1000 > 2000. But if an arithmetic oracle exists, it can immediately determine if the statement is true.

RESEARCH IDEA: Suppose in a program we want to start with condition 'A' and end with condition 'Z'. An LLM can provide a summary of the issues, and the model checker can perform local checks, and then instruct the LLM on how to correct them. Further, the simple model checker, McMini, can be instrumented to provide additional information that might help the LLM.

More succinctly:

"Can a model checker be used as an oracle for an LLM that uses few-shot training to reason about code and summarize program traces?"
(One imagines a "conversation" between the LLM and the model checker. This is analogous to how Code LLMs are used today with a compiler/interpreter as the oracle. The oracle tells the LLM if some proposed source code will satisfy some given input-output constraints.)

RELATED RESOURCES AND TOPICS: Maxwell Pirtle (in this class) and Luka Jovanovic have developed a simple, but performant model checker, on which the text above is based. Its simplicity allows us to experiment with perturbing its behavior, search order, hints, etc., with knowledge gained from LLMs or other sources. As resources, you can find:

The literature on AI-based Programming and automatic development of unit tests (see Suggested Readings should also be relevant.

Francesca Lucchetti (in this class) will be presenting some combination of the following three papers/tech reports on Mechanistic Interpretability. We can review these papers (which will later be presented), and suggested course readings, for ideas about how to develop this theme. The three papers follow:

Progress measures for grokking via mechanistic interpretability
1. (This paper relies on the notation of:
  A Mathematical Framework for Transformer Circuits.
  Read that paper first.)
2. (Although it is not requires to understand "Grokking", the follow-on paper, In-context Learning and Induction Heads, considers extending this model to larger scales, under the rubrik of "In-Context Learning" (learning in large transformer models.)
- And here are some slides that include some context and background to aid in reading the ``Grokking'' paper and ``Mathematical Framework'' paper. A video of the talk by Gene is at: Mathematical Framework for Transformer Circuits talk (see Wii for passcode)
Locating and Editing Factual Associations in GPT (more on mechanistic interpretability)
And Francesca wrote:
> If time permits I would also like to talk about how this connects to my own research (Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs):
Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as a building block for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming languages. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages, like OCaml and Racket. These languages enjoy dedicated communities (programming languages research; finance) but are not as well represented in training data as more broadly popular languages.
This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of new, validated training items for Racket, OCaml, and Lua from Python. Moreover, we use an open dataset (The Stack) and model (StarCoderBase), which allow us to decontaminate benchmarks and train models on this data without violating the model license.
With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase that achieve state-ofthe-art performance for Racket, OCaml, and Lua on benchmark problems. For Lua, our fine-tuned model achieves the same performance as StarCoderBase as Python—a very high-resource language–on the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on MultiPL-E, bringing their performance close to higher-resource languages such as Ruby and C#. The MultiPL-T approach is easy to apply to new languages and can immediately be used on any of the 18+ languages that MultiPL-E supports. Moreover, as we show, it is significantly more efficient and effective than alternates such as training longer.

Code LLMs: We started this research track with a question about summarizing traces. This seems close to the concept of Code LLMs. Here are a few papers from the literature that may be interesting from this point of view.