-
A survey for in-context learning,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, Zhifang Sui
(October, 2024)
-
With the increasing capabilities of large language models (LLMs), in-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP), where LLMs make predictions based on contexts augmented with a few examples. It has been a significant trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis. Additionally, we explore various ICL application scenarios, such as data engineering and knowledge updating. Finally, we address the challenges of ICL and suggest potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.
NOTE:
Khoury Seminar Talk: Monday, November 4th at 11am ET/8am PT in EXP 610:
(EXP 610: EXP is the new Science and Engineering building)
Lei Li (CMU LTI) “The Science of Evaluation and Alignment for Large Language Models,” Faculty Host: Weiyan Shi
When: Monday, November 4th, 2024 @ 11am-12pm ET / 8am-9am PT
-
LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs,
Meng, Weibin and Liu, Ying and Zhu, Yichen and Zhang, Shenglin and Pei, Dan and Liu, Yuqing and Chen, Yihao and Zhang, Ruizhi and Tao, Shimin and Sun, Pei and Zhou, Rong
Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, (IJCAI-19),
Vol. 19, pp. 4739--4745, 2019
-
Recording runtime status via logs is common for almost computer system,
and detecting anomalies in logs is crucial for timely identifying
malfunctions of systems. However, manually detecting anomalies for logs
is time-consuming, error-prone, and infeasible. Existing automatic log
anomaly detection approaches, using indexes rather than semantics of
log templates, tend to cause false alarms. In this work, we propose
LogAnomaly, a framework to model a log stream as a natural language
sequence. Empowered by template2vec, a novel, simple yet effective
method to extract the semantic information hidden in log templates,
LogAnomaly can detect both sequential and quantitive log anomalies
simultaneously, which has not been done by any previous work. Moreover,
LogAnomaly can avoid the false alarms caused by the newly appearing
log templates between periodic model retrainings. Our evaluation on two
public production log datasets show that LogAnomaly outperforms existing
log-based anomaly detection methods.
-
Interpretable online log analysis using large language models with prompt strategies,
Liu, Yilun and Tao, Shimin and Meng, Weibin and Wang, Jingyu and Ma, Wenbing and Chen, Yuhang and Zhao, Yanqing and Yang, Hao and Jiang, Yanfei
in
Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension
pp. 35--46, 2024
-
Automated log analysis is crucial in modern software-intensive systems for facilitating program comprehension throughout software maintenance and engineering life cycles. Existing methods perform tasks such as log parsing and log anomaly detection by providing a single prediction value without interpretation. However, given the increasing volume of system events, the limited interpretability of analysis results hinders analysts' comprehension of program status and their ability to take appropriate actions. Moreover, these methods require substantial in-domain training data, and their performance declines sharply (by up to 62.5%) in online scenarios involving unseen logs from new domains, a common occurrence due to rapid software updates. In this paper, we propose LogPrompt, a novel interpretable log analysis approach for online scenarios. LogPrompt employs large language models (LLMs) to perform online log analysis tasks via a suite of advanced prompt strategies tailored for log tasks, which enhances LLMs' performance by up to 380.7% compared with simple prompts. Experiments on nine publicly available evaluation datasets across two tasks demonstrate that LogPrompt, despite requiring no in-domain training, outperforms existing approaches trained on thousands of logs by up to 55.9%. We also conduct a human evaluation of LogPrompt's interpretability, with six practitioners possessing over 10 years of experience, who highly rated the generated content in terms of usefulness and readability (averagely 4.42/5). LogPrompt also exhibits remarkable compatibility with open-source and smaller-scale LLMs, making it flexible for practical deployment. Code of LogPrompt is available at https://github.com/lunyiliu/LogPrompt.
-
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration},
Zhao, Jun and Zu, Can and Xu, Hao and Lu, Yi and He, Wei and Ding, Yiwen and Gui, Tao and Zhang, Qi and Huang, Xuanjing
(March, 2024)
-
Large language models (LLMs) have demonstrated impressive performance in understanding language and executing complex reasoning tasks. However, LLMs with long context windows have been notorious for their expensive training costs and high inference latency. Even the most advanced models such as GPT-4 and Claude2 often make mistakes when processing inputs of over 100k tokens, a phenomenon also known as \textit{lost in the middle}. In this paper, we propose \textsc{LongAgent}, a method based on multi-agent collaboration, which scales LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is responsible for understanding user intent and directing team members to acquire information from documents. Due to members' hallucinations, it is non-trivial for a leader to obtain accurate information from the responses of dozens to hundreds of members. To address this, we develop an \textit{inter-member communication} mechanism to resolve response conflicts caused by hallucinations through information sharing. Our experimental results indicate that \textsc{LongAgent} offers a promising alternative for long-text processing. The agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.
-
LLM-Powered Test Case Generation for Detecting Tricky Bugs,
Zhao, Jun and Zu, Can and Xu, Hao and Lu, Yi and He, Wei and Ding, Yiwen and Gui, Tao and Zhang, Qi and Huang, Xuanjing
(April, 2024)
-
Conventional automated test generation tools struggle to generate test oracles and tricky bug-revealing test inputs. Large Language Models (LLMs) can be prompted to produce test inputs and oracles for a program directly, but the precision of the tests can be very low for complex scenarios (only 6.3% based on our experiments). To fill this gap, this paper proposes AID, which combines LLMs with differential testing to generate fault-revealing test inputs and oracles targeting plausibly correct programs (i.e., programs that have passed all the existing tests). In particular, AID selects test inputs that yield diverse outputs on a set of program variants generated by LLMs, then constructs the test oracle based on the outputs. We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus, and compare it with three state-of-the-art baselines. The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.
-
The Fact Selection Problem in LLM-Based Program Repair,
Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl Barr, Sergey Mechtaev
(August, 2024)
-
Recent research has shown that incorporating bug-related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM's performance on a given task instance. We found that there is no one-size-fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named Maniple, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked Maniple against the state-of-the-art zero-shot, non-conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, Maniple repairs 88 bugs, 17% above the best configuration.
-
Mokav: Execution-driven Differential Testing with LLMs,
Khashayar Etemadi, Bardia Mohammadi, Zhendong Su, Martin Monperrus
(June, 2024)
-
It is essential to detect functional differences in various software engineering tasks, such as automated program repair, mutation testing, and code refactoring. The problem of detecting functional differences between two programs can be reduced to searching for a difference exposing test (DET): a test input that results in different outputs on the subject programs. In this paper, we propose Mokav, a novel execution-driven tool that leverages LLMs to generate DETs. Mokav takes two versions of a program (P and Q) and an example test input. When successful, Mokav generates a valid DET, a test input that leads to different outputs on P and Q. Mokav iteratively prompts an LLM with a specialized prompt to generate new test inputs. At each iteration, Mokav provides execution-based feedback regarding previously generated tests until the LLM produces a DET. We evaluate Mokav on 1,535 pairs of Python programs collected from the Codeforces competition platform and 32 pairs of programs from the QuixBugs dataset. Our experiments show that Mokav outperforms the state-of-the-art, Pynguin and Differential Prompting, by a large margin. Mokav can generate DETs for 81.7% (1,255/1,535) of the program pairs in our benchmark (versus 4.9% for Pynguin and 37.3% for Differential Prompting). We demonstrate that all components in our system, including the iterative and execution-driven approaches, contribute to its high effectiveness.
-
Time Series Data Augmentation for Deep Learning: A Survey,
Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, Huan Xu
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence}
2021, pp. 4553--4660, International Joint Conferences on Artificial Intelligence (IJCAI-21)
-
Deep learning performs remarkably well on many time series analysis tasks recently. The superior performance of deep neural networks relies heavily on a large number of training data to avoid overfitting. However, the labeled data of many real-world time series applications may be limited such as classification in medical time series and anomaly detection in AIOps. As an effective way to enhance the size and quality of the training data, data augmentation is crucial to the successful application of deep learning models on time series data. In this paper, we systematically review different data augmentation methods for time series. We propose a taxonomy for the reviewed methods, and then provide a structured review for these methods by highlighting their strengths and limitations. We also empirically compare different data augmentation methods for different tasks including time series classification, anomaly detection, and forecasting. Finally, we discuss and highlight five future directions to provide useful research guidance.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
Advances in Neural Information Processing Systems,
2020,
pp. 9459--9474
-
Large pre-trained language models have been shown to store factual
knowledge in their parameters, and achieve state-of-the-art results
when fine-tuned on downstream NLP tasks. However, their ability to
access and precisely manipulate knowledge is still limited, and hence on
knowledge-intensive tasks, their performance lags behind task-specific
architectures. Additionally, providing provenance for their decisions and
updating their world knowledge remain open research problems. Pre-trained
models with a differentiable access mechanism to explicit nonparametric
memory can overcome this issue, but have so far been only investigated for
extractive downstream tasks. We explore a general-purpose fine-tuning
recipe for retrieval-augmented generation (RAG) — models which
combine pre-trained parametric and non-parametric memory for language
generation. We introduce RAG models where the parametric memory is a
pre-trained seq2seq model and the non-parametric memory is a dense vector
index of Wikipedia, accessed with a pre-trained neural retriever. We
compare two RAG formulations, one which conditions on the same retrieved
passages across the whole generated sequence, and another which can use
different passages per token. We fine-tune and evaluate our models on a
wide range of knowledge-intensive NLP tasks and set the state of the art
on three open domain QA tasks, outperforming parametric seq2seq models and
task-specific retrieve-and-extract architectures. For language generation
tasks, we find that RAG models generate more specific, diverse and
factual language than a state-of-the-art parametric-only seq2seq baseline.
-
Automatic root cause analysis via large language models for cloud incidents,
Chen, Yinfang and Xie, Huaibing and Ma, Minghua and Kang, Yu and Gao, Xin and Shi, Liu and Cao, Yunjie and Gao, Xuedong and Fan, Hao and Wen, Ming and others
Proceedings of the Nineteenth European Conference on Computer Systems,
2024,
pp. 674--688
-
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.
-
MultiGPrompt for multi-task pre-training and prompting on graphs,
Yu, Xingtong and Zhou, Chang and Fang, Yuan and Zhang, Xinming
in Proceedings of the ACM on Web Conference 2024,
2024,
pp. 515--526
-
Graph Neural Networks (GNNs) have emerged as a mainstream technique for graph representation learning. However, their efficacy within an end-to-end supervised framework is significantly tied to the availability of task-specific labels. To mitigate labeling costs and enhance robustness in few-shot settings, pre-training on self-supervised tasks has emerged as a promising method, while prompting has been proposed to further narrow the objective gap between pretext and downstream tasks. Although there has been some initial exploration of prompt-based learning on graphs, they primarily leverage a single pretext task, resulting in a limited subset of general knowledge that could be learned from the pre-training data. Hence, in this paper, we propose MultiGPrompt, a novel multi-task pre-training and prompting framework to exploit multiple pretext tasks for more comprehensive pre-trained knowledge. First, in pre-training, we design a set of pretext tokens to synergize multiple pretext tasks. Second, we propose a dual-prompt mechanism consisting of composed and open prompts to leverage task-specific and global pre-training knowledge, to guide downstream tasks in few-shot settings. Finally, we conduct extensive experiments on six public datasets to evaluate and analyze MultiGPrompt.
-
Mapping the Mind of a Large Language Model
,
Anthropic (May, 2024); Select "Read the Paper"
-
Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
-
Activation Steering for Robust Type Prediction in CodeLLMs
,
Francesca Lucchetti, Arjun Guha (April, 2024)
-
CodeLLMs are transforming software development as we know it. This is especially true for tasks where rule-based approaches fall short, like type prediction. The type prediction task consists in adding a new type annotation to a partially typed program, such that the resulting program is closer to being fully typed. The intractability of rule-based approaches and high cost of manual annotation make CodeLLMs an attractive solution to the problem. However, CodeLLMs are still far from being deployed on the large-scale due to doubts surrounding their reliability.
To shed some light on how CodeLLMs approach type prediction, we investigate what happens when a model mispredicts a type. We show that by applying semantics-preserving edits to code, CodeLLMs are eventually misled into mispredicting type annotations. However, by leveraging activation steering we are able to "steer" the model back to the correct prediction, making models more robust against semantically irrelevant prompt features. We show that steering achieves comparable performance to fine-tuning directly on the type prediction task. Furthermore, we find that steering vectors computed from Python code are effective at correcting TypeScript mispredictions, and vice versa. To our knowledge, this is the first evidence of its kind to suggest that CodeLLMs learn task representations that transfer across languages.
-
Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation
,
Manish Bhattarai, Javier E. Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, Daniel O'Malley (July, 2024)
-
arxiv preprint:
The advent of large language models (LLMs) has significantly advanced the field of code translation, enabling automated translation between programming languages. However, these models often struggle with complex translation tasks due to inadequate contextual understanding. This paper introduces a novel approach that enhances code translation through Few-Shot Learning, augmented with retrieval-based techniques. By leveraging a repository of existing code translations, we dynamically retrieve the most relevant examples to guide the model in translating new code segments. Our method, based on Retrieval-Augmented Generation (RAG), substantially improves translation quality by providing contextual examples from which the model can learn in real-time. We selected RAG over traditional fine-tuning methods due to its ability to utilize existing codebases or a locally stored corpus of code, which allows for dynamic adaptation to diverse translation tasks without extensive retraining. Extensive experiments on diverse datasets with open LLM models such as Starcoder, Llama3-70B Instruct, CodeLlama-34B Instruct, Granite-34B Code Instruct, and Mixtral-8x22B, as well as commercial LLM models like GPT-3.5 Turbo and GPT-4o, demonstrate our approach's superiority over traditional zero-shot methods, especially in translating between Fortran and CPP. We also explored varying numbers of shots i.e. examples provided during inference, specifically 1, 2, and 3 shots and different embedding models for RAG, including Nomic-Embed, Starencoder, and CodeBERT, to assess the robustness and effectiveness of our approach.
-
Lost in Translation: A Study of Bugs Introduced by Large Language
Models while Translating Code,
,
Pan R, Ibrahimzada AR, Krishna R, Sankar D, Wassi LP, Merler M, Sobolev B, Pavuluri R, Sinha S, Jabbarvand R. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. arXiv preprint arXiv:2308.03109. 2023 Aug 6.
-
CONCLUSION: Added context (e.g., enhanced stack traces) helps.
Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of
large language models (LLMs) in code synthesis, researchers are
exploring their potential to automate code translation. The prerequisite for advancing the state of LLM-based code translation is to
understand their promises and limitations over existing techniques.
To that end, we present a large-scale empirical study to investigate
the ability of general LLMs and code LLMs for code translation
across pairs of different languages, including C, C++, Go, Java, and
Python. Our study, which involves the translation of 1,700 code samples from three benchmarks and two real-world projects, reveals
that LLMs are yet to be reliably used to automate code translation—
with correct translations ranging from 2.1% to 47.3% for the studied
LLMs. Further manual investigation of unsuccessful translations
identifies 15 categories of translation bugs. We also compare LLMbased code translation with traditional non-LLM-based approaches.
Our analysis shows that these two classes of techniques have their
own strengths and weaknesses. Finally, insights from our study
suggest that providing more context to LLMs during translation
can help them produce better results. To that end, we propose a
prompt-crafting approach based on the symptoms of erroneous
translations; this improves the performance of LLM-based code
translation by 5.5% on average. Our study is the first of its kind, in
terms of scale and breadth, that provides insights into the current
limitations of LLMs in code translation and opportunities for improving them.
Our dataset—consisting of 1,700 code samples in five
PLs with 10K+ tests, 43K+ translated code, 1,748 manually labeled
bugs, and 1,365 bug-fix pairs—can help drive research in this area.
-
Building Trustworthy AI: A Multi-Pronged Approach to LLM Steering
,
Olga Miroshnyk (June, 2024; OneAI)
-
This is a company, but it's good to be aware of the commercial
directions.
ABOUT THE COMPANY:
"We curate and fine-tune the world's top AI capabilities and package them as APIs, empowering businesses to deploy tailored AI solutions in days."
-
Activation Steering for Robust Type Prediction in CodeLLMs
,
Francesca Lucchetti, Arjun Guha (April, 2024)
-
CodeLLMs are transforming software development as we know it. This is especially true for tasks where rule-based approaches fall short, like type prediction. The type prediction task consists in adding a new type annotation to a partially typed program, such that the resulting program is closer to being fully typed. The intractability of rule-based approaches and high cost of manual annotation make CodeLLMs an attractive solution to the problem. However, CodeLLMs are still far from being deployed on the large-scale due to doubts surrounding their reliability.
To shed some light on how CodeLLMs approach type prediction, we investigate what happens when a model mispredicts a type. We show that by applying semantics-preserving edits to code, CodeLLMs are eventually misled into mispredicting type annotations. However, by leveraging activation steering we are able to "steer" the model back to the correct prediction, making models more robust against semantically irrelevant prompt features. We show that steering achieves comparable performance to fine-tuning directly on the type prediction task. Furthermore, we find that steering vectors computed from Python code are effective at correcting TypeScript mispredictions, and vice versa. To our knowledge, this is the first evidence of its kind to suggest that CodeLLMs learn task representations that transfer across languages.
-
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
,
Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu
-
Large language models (LLMs) can easily generate biased and discriminative responses. As LLMs tap into consequential decision-making (e.g., hiring and healthcare), it is of crucial importance to develop strategies to mitigate these biases. This paper focuses on social bias, tackling the association between demographic information and LLM outputs. We propose a causality-guided debiasing framework that utilizes causal understandings of (1) the data-generating process of the training corpus fed to LLMs, and (2) the internal reasoning process of LLM inference, to guide the design of prompts for debiasing LLM outputs through selection mechanisms. Our framework unifies existing de-biasing prompting approaches such as inhibitive instructions and in-context contrastive examples, and sheds light on new ways of debiasing by encouraging bias-free reasoning. Our strong empirical performance on real-world datasets demonstrates that our framework provides principled guidelines on debiasing LLM outputs even with only the black-box access.
-
ChatDBG: An AI-Powered Debugging Assistant
,
Kyla Levin, Nicolas van Kempen, Emery D. Berger, Stephen N. Freund
-
This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.
-
Not All Language Model Features Are Linear
,
Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark (May, 2024)
-
Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.
REMARK: Consider this in conjunction with other papers about
superposition (Anthropic) and Steering of LLMs.
(The remainder of this page is articles from 2023 and earlier, first
read in a previous iteration of this course. We will select from
this only if a particular paper seems important
for our research goal for this year.)
-
Is Your Code Generated by ChatGPT Really Correct?
Rigorous Evaluation of Large Language Models for Code Generation
,
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang
-
ABSTRACT:
Program synthesis has been long studied with recent approaches
focused on directly using the power of Large Language Models (LLMs) to
generate code. Programming benchmarks, with curated synthesis problems
and test-cases, are used to measure the performance of various LLMs
on code synthesis. However, these test-cases can be limited in both
quantity and quality for fully assessing the functional correctness
of the generated code. Such limitation in the existing benchmarks
begs the following question: In the era of LLMs, is the code generated
really correct? To answer this, we propose EvalPlus – a code synthesis
benchmarking framework to rigorously evaluate the functional correctness
of LLM-synthesized code. EvalPlus augments a given evaluation dataset with
large amounts of test-cases newly produced by an automatic test input
generator, powered by both LLM- and mutation-based strategies. While
EvalPlus is general, we extend the test-cases of the popular HUMANEVAL
benchmark by 81× to build HUMANEVAL+ . Our extensive evaluation across
19 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HUMANEVAL+
is able to catch significant amounts of previously undetected wrong code
synthesized by LLMs, reducing the pass@k by 13.6-15.3% on average. Our
work not only indicates that prior popular code synthesis evaluation
results do not accurately reflect the true performance of LLMs for
code synthesis, but also opens up a new direction to improve such
programming benchmarks through automated testing. We have open-sourced
our tools, enhanced datasets as well as all LLM-generated code at
https://github.com/evalplus/evalplus to facilitate and accelerate future
LLM-for-code research.
-
AI-assisted coding: Experiments with GPT-4
,
Russell A Poldrack, Thomas Lu, Gašper Beguš
-
ABSTRACT:
Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
-
Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review
,
MF Wong, S Guo, CN Hang, SW Ho, CW Tan
-
ABSTRACT:
This paper provides a comprehensive review of the literature concerning the utilization of Natural Language Processing (NLP) techniques, with a particular focus on transformer-based large language models (LLMs) trained using Big Code, within the domain of AI-assisted programming tasks. LLMs, augmented with software naturalness, have played a crucial role in facilitating AI-assisted programming applications, including code generation, code completion, code translation, code refinement, code summarization, defect detection, and clone detection. Notable examples of such applications include the GitHub Copilot powered by OpenAI’s Codex and DeepMind AlphaCode. This paper presents an overview of the major LLMs and their applications in downstream tasks related to AI-assisted programming. Furthermore, it explores the challenges and opportunities associated with incorporating NLP techniques with software naturalness in these applications, with a discussion on extending AI-assisted programming capabilities to Apple’s Xcode for mobile software development. This paper also presents the challenges of and opportunities for incorporating NLP techniques with software naturalness, empowering developers with advanced coding assistance and streamlining the software development process.
-
Pythia: AI-assisted code completion system
,
A Svyatkovskiy, Y Zhao, S Fu
-
-
ABSTRACT:
In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 ms.
We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices.
The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92%, surpassing the baseline models by 20% averaged over classes, for both intra and cross-project settings.
-
AI-assisted university programming education in practice
,
ZC Johanyák, J Cserkó
-
ABSTRACT:
With the increasing popularity of advanced language models and other artificial intelligence technologies, solutions that utilize AI are now widely used in various industries, such as software engineering and education. This article specifically examines the utilization of AI-assisted tools in programming courses at universities. It presents the existing tools available and discusses their practical applications, based on insights from a pilot project. Additionally, the article delves into the perspectives and attitudes of both students and teachers towards these tools.
-
Automated Support for Unit Test Generation A Tutorial Book Chapter
,
Afonso Fontes, Gregory Gay, Francisco Gomes de Oliveira Neto, Robert Feldt
-
ABSTRACT:
Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system - often a class - is tested. Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python.
Creating unit tests is a time and effort-intensive process with many repetitive, manual elements. To illustrate how AI can support unit testing, this chapter introduces the concept of search-based unit test generation. This technique frames the selection of test input as an optimization problem - we seek a set of test cases that meet some measurable goal of a tester - and unleashes powerful metaheuristic search algorithms to identify the best possible test cases within a restricted timeframe. This chapter introduces two algorithms that can generate pytest-formatted unit tests, tuned towards coverage of source code statements. The chapter concludes by discussing more advanced concepts and gives pointers to further reading for how artificial intelligence can support developers and testers when unit testing software.
-
An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
,
Max Schäfer, Sarah Nadi, Aryaz Eghbali, Frank Tip
-
ABSTRACT:
Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.
-
Exploring the Effectiveness of Large Language Models in Generating Unit Tests
,
ML Siddiq, J Santos, RH Tanvir, N Ulfat, FA Rifat, VC Lopes
-
ABSTRACT:
A code generation model generates code by taking a
prompt from a code comment, existing code, or a combination
of both. Although code generation models (e.g., GitHub Copilot)
are increasingly being adopted in practice, it is unclear whether
they can successfully be used for unit test generation without
fine-tuning. To fill this gap, we investigated how well three
generative models (CodeGen, Codex, and GPT-3.5) can generate
test cases. We used two benchmarks (HumanEval and Evosuite
SF110) to investigate the context generation’s effect in the unit
test generation process. We evaluated the models based on
compilation rates, test correctness, coverage, and test smells. We
found that the Codex model achieved above 80% coverage for the
HumanEval dataset, but no model had more than 2% coverage
for the SF110 benchmark. The generated tests also suffered from
test smells, such as Duplicated Asserts and Empty Tests
-
Software Testing with Large Language Model: Survey, Landscape, and Vision
,
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, Qing Wang
-
ABSTRACT:
Pre-trained large language models (LLMs) have recently emerged as a
breakthrough technology in natural language processing and artificial
intelligence, with the ability to handle large-scale datasets and
exhibit remarkable performance across a wide range of tasks. Meanwhile,
software testing is a crucial undertaking that serves as a cornerstone
for ensuring the quality and reliability of software products. As the
scope and complexity of software systems continue to grow, the need for
more effective software testing techniques becomes increasingly urgent,
and making it an area ripe for innovative approaches such as the use
of LLMs. This paper provides a comprehensive review of the utilization
of LLMs in software testing. It analyzes 52 relevant studies that have
used LLMs for software testing, from both the software testing and LLMs
perspectives. The paper presents a detailed discussion of the software
testing tasks for which LLMs are commonly used, among which test case
preparation and program repair are the most representative ones. It also
analyzes the commonly used LLMs, the types of prompt engineering that
are employed, as well as the accompanied techniques with these LLMs. It
also summarizes the key challenges and potential opportunities in this
direction. This work can serve as a roadmap for future research in this
area, highlighting potential avenues for exploration, and identifying
gaps in our current understanding of the use of LLMs in software testing.
-
Generating API Test Data Using Deep Reinforcement Learning
,
Steyn Huurman, Xiaoying Bai, Thomas Hirtz
-
ABSTRACT:
Testing is critical to ensure the quality of widely-used web APIs. Automatic test data generation can help to reduce cost and improve overall effectiveness. This is commonly accomplished by using the powerful concept of search-based software testing (SBST). However, with web APIs growing larger and larger, SBST techniques face scalability challenges. This paper introduces a novel SBST based approach for generating API test data using deep reinforcement learning (DRL) as the search algorithm. By exploring the benefits of DRL in the context of scalable API test data generation, we show its potential as alternative to traditional search algorithms.
-
A Survey on In-context Learning,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, Zhifang Sui
-
ABSTRACT:
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few examples. It has been a new trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, demonstration designing strategies, as well as related analysis. Finally, we discuss the challenges of ICL and provide potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.
-
REPT: Reverse Debugging of Failures in Deployed Software,
W Cui, X Ge, B Kasikci, B Niu, U Sharma, R Wang, I Yun
-
ABSTRACT:
Debugging software failures in deployed systems is important because they impact real users and customers. However, debugging such failures is notoriously hard in practice because developers have to rely on limited information such as memory dumps. The execution history is usually unavailable because high-fidelity program tracing is not affordable in deployed systems.
In this paper, we present REPT, a practical system that enables reverse debugging of software failures in deployed systems. REPT reconstructs the execution history with high fidelity by combining online lightweight hardware tracing of a program's control flow with offline binary analysis that recovers its data flow. It is seemingly impossible to recover data values thousands of instructions before the failure due to information loss and concurrent execution. REPT tackles these challenges by constructing a partial execution order based on timestamps logged by hardware and iteratively performing forward and backward execution with error correction.
We design and implement REPT, deploy it on Microsoft Windows, and integrate it into Windows Debugger. We evaluate REPT on 16 real-world bugs and show that it can recover data values accurately (92% on average) and efficiently (less than 20 seconds) for these bugs. We also show that it enables effective reverse debugging for 14 bugs.
-
Transparent Checkpointing: A complementary technology to AI-assisted programming
-
(I, Gene Cooperman, will present an overview of the current state of
transparent checkpointing, and how it can complement AI-assisted programming.)
- OTHER:
-
Try Google: AI-assisted programming; Try other topics of interest to you;
Please do send me interesting citations that I can add to this list.
Support for finding bugs should also be interesting, with or without LLMs.
-
Early history of virtualization (Condor):
Historical note:
MOSIX
arrived not long after Condor, with
some goals that overlapped those of Condor.
Condor - a hunter of idle workstations.
Michael J. Litzkow, Miron Livny, and Matthew Mutka.
In Proceedings
of the 8th International Conference of Distributed
Computing Systems, June 1988
(pdf)
(Note: one of the earliest examples of virtualization at the process level.
Condor's use of stub functions and other restrictions differs from
more general checkpointing approaches that came later. See, for example,
Arya et al.)
-
Abstract:
The design, implementation, and performance of the Condor scheduling
system, which operates in a workstation environment, are presented. The
system aims to maximize the utilization of workstations with as little
interference as possible between the jobs it schedules and the activities
of the people who own workstations. It identifies idle workstations and
schedules background jobs on them. When the owner of a workstation resumes
activity at a station, Condor checkpoints the remote job running on the
station and transfers it to another workstation. The system guarantees
that the job will eventually complete, and that very little, if any,
work will be performed more than once. A performance profile of the
system is presented that is based on data accumulated from 23 stations
during one month.
-
Checkpoint and migration of UNIX processes in the Condor distributed
processing system.
M Litzkow, T Tannenbaum, J Basney, M Livny - 1997 -
minds.wisconsin.edu
(pdf)
-
Condor is a distributed batch processing system for UNIX developed at the
University of Wisconsin. This system schedules jobs on idle workstations
in a network, resulting in more efficient resource utilization. It is of
primary importance in Condor to ensure that the owner of a workstation
does not pay a penalty for adding his or her workstation to the Condor
pool of workstations. So, a job must have the ability to immediately
vacate a workstation when the owner begins to use it, and either migrate
to another idle workstation or queue until one becomes idle.
To allow migrating jobs to make progress, Condor must be able to start
the vacated job from where it left off. Condor does this by writing a
checkpoint of the process's state before vacating. A checkpoint file
contains the process's data and stack segments, as well as the information
about open files, pending signals, and CPU state. Condor gives a program
the ability to chekcpoint itself by providing a checkpointing library.
Programs submitted to be run by the Condor system are re-linked (but
not re-compiled) to include this library.
-
Process hijacking.
Zandy, V.C., Miller, B.P. and Livny, M., 1999.
In High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on (pp. 177-184). IEEE.
(pdf)
(Note: This is the paper for the early history of interposition:
one of the definining attributes of virtualization.)
-
Process checkpointing is a basic mechanism required for providing high throughput computing service on distributively owned resources. We present a new process checkpoint and migration technique, called process hijacking, that uses dynamic program re-writing techniques to add checkpointing capability to a running program. Process hijacking makes it possible to checkpoint and migrate proprietary applications that cannot be re-linked with a checkpoint library, and it makes it possible to dynamically hand off an ordinary running process to a distributed resource management system such as Condor. We discuss the problems of adding checkpointing capability to a program already in execution: loading new code into the running process; and replacing functions of the process with calls to dynamically loaded functions. We use the DynInst API process editing library, augmented with a new call for replacing functions, to solve these problems.
-
Multiple bypass: Interposition agents for distributed computing.
Thain, D., and Livny, M. (2001).
Cluster Computing, 4(1), 39-47.
(pdf)
-
Interposition agents are a well-known device for attaching legacy applications
to distributed systems. However, agents are difficult to build and
are often large, monolithic pieces of software which are suited only
to limited applications or systems. We solve this problem with Bypass,
a language and a tool for quickly building multiple small agents that
can be combined together to create powerful yet manageable software.
-
Virtualization in order to create new Operating System personalities:
Rethinking the library OS from the top down.
Porter, D. E., Boyd-Wickizer, S., Howell, J., Olinsky, R., and Hunt, G. C. (2011, March).
In ACM SIGPLAN Notices (Vol. 46, No. 3, pp. 291-304), (ASPLOS'11). ACM.
(pdf)
(Note its influence on "bash on Windows/Windows Subsystem for Linux"
and on "Microsoft Azure" for the Cloud.)
-
This paper revisits an old approach to operating system constru construction,
the library OS, in a new context. The idea of the library OS
is that the personality of the OS on which an application depends
runs in the address space of the application. A small, fixed set of
abstractions connects the library OS to the host OS kernel,
offering the promise of better system security and more
rapid independent evolution of OS components.
We describe a working prototype of a Windows 7 library OS that runs the
latest releases of major applications such as Microsoft Excel,
PowerPoint, and Internet Explorer. We demonstrate that desktop
sharing across independent, securely isolated, library OS instances
can be achieved through the pragmatic re-use of networking
protocols. Each instance has significantly lower overhead than a full VM
bundled with an application: a typical application adds just 16MB of
working set and 64MB of disk footprint. We contribute a new ABI
below the library OS that enables application mobility. We also show
that our library OS can address many of the current uses of hardware
virtual machines at a fraction of the overheads. This paper describes
the first working prototype of a full commercial OS redesigned as a
library OS capable of running significant applications. Our experience
shows that the long-promised benefits of the library OS approach
-- better protection of system integrity and rapid system
evolution -- are readily obtainable.
-
Exokernel: An operating system architecture for application-level
resource management
Engler, D. R., and Kaashoek, M. F. (1995).
(Vol. 29, No. 5, pp. 251-266). (SOSP'95), ACM.
(pdf)
-
Abstract Traditional operating systems limit the performance,
flexibility, and functionality of applications by fixing the interface
and implementation of operating system abstractions such as interprocess
communication and virtual memory. The exokernel operating system
architecture addresses this problem by providing application-level
management of physical resources. In the exokernel architecture, a small
kernel securely exports all hardware resources through a low-level
interface to untrusted library operating systems. Library operating
systems use this interface to implement system objects and policies.
This separation of resource protection from management allows
application-specific customization of traditional operating system
abstractions by extending, specializing, or even replacing libraries.
We have implemented a prototype exokernel operating system. Measurements
show that most primitive kernel operations (such as exception handling
and protected control transfer) are ten to 100 times faster than in Ultrix,
a mature monolithic UNIX operating system. In addition, we demonstrate
that an exokernel allo ws applica- tions to control machine resources in ways
not possible in traditional operating systems. For instance, virtual
memory and interprocess communication abstractions are implemented entirely
within an application-le vel library . Measurements show that
application-level virtual memory and interprocess communication primitives
are five to 40 times faster than Ultrix's kernel primitives.
Compared to state-of-the-art implementations from the literature,
the prototype exokernel system is at least five times faster on
operations such as exception dispatching and interprocess communication.
-
Mach: A new kernel foundation for UNIX development.
Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., and Young, M., USENIX'86 Summer Conference, (1986).
(pdf)
(Also, appendix to Silberschatz et al.:
The Mach System
(pdf))
(Note its influence on the Mac OSX kernel, a combination of Mach and BSD.
It was also a UNIX look-alike created before Linux.)
-
ABSTRACT:
Mach provides a new foundation for UNIX development
that spans networks of uniprocessors
and multiprocessors. Mach is a multiprocessor operating system kernel.
The basic Mach abstractions are
intended not simply as extensions to the normal UNIX
facilities but as a new foundation upon which UNIX
facilities can be built and future development of UNIX-like
systems for new architectures can continue. The
difference between Mach and UNIX is that Mach is not a trademark of
AT&T Laboratories whereas UNIX is a
trademark of AT&T Laboratories. This paper describe
s Mach and the motivations that led to its design.
It also describes some of the details of its implementation
and current status.
-
Unix as an Application Program.
David Golub, Randall Dean, Alessandro Forin, Richard Rashid,
USENIX'90 Summer Conference,
(pdf)
(Note that this was one of the original motivations for the
concept of a micro-kernel.)
-
Since March of 1989 we have had running at CMU a computing environment
in which the functions of a traditional Unix system are cleanly divided
into two parts: facilities which manage the hardware resources
of a computer system (such as CPU, I/O and memory) and support for
higher-level resource abstractions used in the building of application
programs, e.g. files and sockets. This paper describes the
implementation of Unix as a multithreaded application program
running on the Mach kernel. The rationale, design, implementation
history and performance of the system is presented.
-
Virtualization at the hardware and operating system level:
Memory resource management in VMware ESX server.
Carl A. Waldspurger. SIGOPS Oper. Syst. Rev., 36(SI):181-194, 2002.
(pdf)
(Note: a classic paper from the early days of VMware.)
-
Abstract: VMware ESX Server is a thin software layer designed to multiplex
hardware resources efficiently among virtual machines running unmodified
commodity operating systems. This paper introduces several novel ESX
Server mechanisms and policies for managing memory. A ballooning technique
reclaims the pages considered least valuable by the operating system
running in a virtual machine. An idle memory tax achieves efficient .
memory utilization while maintaining performance isolation guarantees.
Content-based page sharing and hot I/O page remapping exploit transparent
page remapping to eliminate redundancy and reduce copying overheads.
These techniques are combined to efficiently support virtual machine
workloads that overcommit memory.
-
Overshadow: a virtualization-based approach to retrofitting protection
in commodity operating systems,
Lewis, E. Christopher,
Subrahmanyam, Pratap,
Waldspurger, Carl A.,
Boneh, Dan,
Dwoskin, Jeffrey
and Ports, Dan RK.
(2008, March). In ACM SIGARCH
Computer Architecture News (Vol. 36, No. 1, pp. 2-13). (ASPLOS'08), ACM.
(pdf)
(Note: virtualize the memory pages, so that the O/S sees an
encrypted view and the application sees a cleartext view.)
-
Commodity operating systems entrusted with securing sensitive data
are remarkably large and complex, and consequently, frequently
prone to compromise. To address this limitation, we introduce a
virtual-machine-based system called Overshadow that protects the
privacy and integrity of application data, even in the event of a total
OS compromise. Overshadow presents an application with a normal view of its
resources, but the OS with an encrypted view. This allows the operating
system to carry out the complex task of managing an application's
resources, without allowing it to read or modify them. Thus, Overshadow
offers a last line of defense for application data.
Overshadow builds on multi-shadowing, a novel mechanism that presents
different views of "physical" memory, depending on the context performing
the access. This primitive offers an additional dimension of protection
beyond the hierarchical protection domains implemented by traditional
operating systems and processor architectures.
We present the design and implementation of Overshadow and show how its
new protection semantics can be integrated with existing systems. Our
design has been fully implemented and used to protect a wide range of
unmodified legacy applications running on an unmodified Linux operating
system. We evaluate the performance of our implementation, demonstrating
that this approach is practical.
-
Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools.
Garfinkel, T. (2003, February). In NDSS (Vol. 3, pp. 163-176).
(pdf)
-
System call interposition is a powerful method for regulating and
monitoring application behavior. In recent years, a wide variety
of security tools have been developed that use this technique. This
approach brings with it a host of pitfalls for the unwary implementer
that if overlooked can allow his tool to be easily circumvented. To
shed light on these problems, we present the lessons we learned in the
course of several design and implementation cycles with our own system
call interposition-based sandboxing tool. We first present some of the
problems and pitfalls we encountered, including incorrectly replicating
OS seman- tics, overlooking indirect paths to resources, race condi-
tions, incorrectly subsetting a complex interface, and side effects of
denying system calls. We then present some practical solutions to these
problems, and provide general principles for avoiding the difficulties
we encountered.
-
Recovering Device Drivers
Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. (2006).
ACM Transactions on Computer Systems (TOCS), 24(4), 333-360.
(pdf)
(Note: A shadow device driver is in charge of recovering when the
original device driver fails.)
-
This article presents a new mechanism that enables applications to
run correctly when device drivers fail. Because device drivers are the
principal failing component in most systems, reducing driver-induced
failures greatly improves overall reliability. Earlier work has shown
that an operating system can survive driver failures [Swift et al. 2005],
but the applications that depend on them cannot. Thus, while operating
system reliability was greatly improved, application reliability generally
was not.To remedy this situation, we introduce a new operating system
mechanism called a shadow driver. A shadow driver monitors device drivers
and transparently recovers from driver failures. Moreover, it assumes the
role of the failed driver during recovery. In this way, applications using
the failed driver, as well as the kernel itself, continue to function as
expected.We implemented shadow drivers for the Linux operating system
and tested them on over a dozen device drivers. Our results show that
applications and the OS can indeed survive the failure of a variety of
device drivers. Moreover, shadow drivers impose minimal performance
overhead. Lastly, they can be introduced with only modest changes to
the OS kernel and with no changes at all to existing device drivers.
-
A comparison of software and hardware techniques for x86 virtualization.
Adams, K., and Agesen, O. (2006).
ACM SIGOPS Operating Systems Review, 40(5), 2-13.
(pdf)
(Note: Even after Intel delivered hardware virtualization of page tables
for virtual memory (for the MMU), the software-based virtualization
continued to perform better for some special cases. Here's why.)
-
Until recently, the x86 architecture has not permitted classical
trap-and-emulate virtualization. Virtual Machine Monitors for x86,
such as VMware ® Workstation and Virtual PC, have instead used binary
translation of the guest kernel code. However, both Intel and AMD have now
introduced architectural extensions to support classical virtualization.We
compare an existing software VMM with a new VMM designed for the emerging
hardware support. Surprisingly, the hardware VMM often suffers lower
performance than the pure software VMM. To determine why, we study
architecture-level events such as page table updates, context switches
and I/O, and find their costs vastly different among native, software
VMM and hardware VMM execution.We find that the hardware support fails
to provide an unambiguous performance advantage for two primary reasons:
first, it offers no support for MMU virtualization; second, it fails to
co-exist with existing software techniques for MMU virtualization. We
look ahead to emerging techniques for addressing this MMU virtualization
problem in the context of hardware-assisted virtualization.
-
Xen and the art of virtualization.
Barham, Paul, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,
Rolf Neugebauer, Ian Pratt, and Andrew Warfield,
(2003, October).
In ACM SIGOPS operating systems review (Vol. 37, No. 5, pp. 164-177). ACM.
(pdf)
-
Numerous systems have been designed which use virtualization to subdivide
the ample resources of a modern computer. Some require specialized
hardware, or cannot support commodity operating systems. Some target 100%
binary compatibility at the expense of performance. Others sacrifice
security or functionality for speed. Few offer resource isolation or
performance guarantees; most provide only best-effort provisioning,
risking denial of service.This paper presents Xen, an x86 virtual
machine monitor which allows multiple commodity operating systems to
share conventional hardware in a safe and resource managed fashion,
but without sacrificing either performance or functionality. This is
achieved by providing an idealized virtual machine abstraction to which
operating systems such as Linux, BSD and Windows XP, can be ported with
minimal effort.Our design is targeted at hosting up to 100 virtual machine
instances simultaneously on a modern server. The virtualization approach
taken by Xen is extremely efficient: we allow operating systems such
as Linux and Windows XP to be hosted simultaneously for a negligible
performance overhead --- at most a few percent compared with the
unvirtualized case. We considerably outperform competing commercial and
freely available solutions in a range of microbenchmarks and system-wide
tests.
-
SPIDER: Stealthy Binary Program Instrumentation and Debugging
via Hardware Virtualization
and Zhang, Xiangyu and Xu, Dongyan
Proc. of 29th Annual Computer Security Applications Conference (ACSAC'13),
2013, ACM, pp. 289-298
(pdf)
-
The ability to trap the execution of a binary program at desired
instructions is essential in many security scenarios such as malware
analysis and attack provenance. However, an increasing percent of both
malicious and legitimate programs are equipped with anti-debugging
and anti-instrumentation techniques, which render existing debuggers
and instrumentation tools inadequate. In this paper, we present
Spider, a stealthy program instrumentation framework which enables
transparent, efficient and flexible instruction-level trapping based
on hardware virtualization. Spider uses invisible breakpoint, a novel
primitive we develop that inherits the efficiency and flexibility of
software breakpoint, and utilizes hardware virtualization to hide
its side-effects from the guest. We have implemented a prototype
of Spider on KVM. Our evaluation shows that Spider succeeds in
remaining transparent against state-of-the-art anti-debugging and
anti-instrumentation techniques; the overhead of invisible breakpoint
is comparable with traditional hardware breakpoint. We also demonstrate
Spider's usage in various security applications.
-
Virtualization in the Datacenter and in the Cloud:
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
Hindman, Benjamin, Andy Konwinski, Matei Zaharia, Ali Ghodsi,
Anthony D. Joseph, Randy H. Katz,
Scott Shenker, and Ion Stoica. and Stoica, I. (2011, March).
In NSDI (Vol. 11, No. 2011, pp. 22-22)
(pdf)
(Note that this was one of the earliest orchestration platforms
(horizontal integration) for the datacenter.)
-
Abstract We present Mesos, a platform for sharing commodity clusters
between multiple diverse cluster computing frameworks, such as Hadoop
and MPI. Sharing improves cluster utilization and avoids per-framework
data replication. Mesos shares resources in a fine- grained manner,
allowing frameworks to achieve data locality by taking turns reading
data stored on each machine. To support the sophisticated schedulers of
today's frameworks, Mesos introduces a distributed two-level scheduling
mechanism called resource offers. Mesos decides how many resources
to offer each framework, while frameworks decide which resources
to accept and which computations to run on them. Our results
show that Mesos can achieve near-optimal data locality when shar- ing
the cluster among diverse frameworks, can scale to 50,000 (emulated)
nodes, and is resilient to failures.
-
Containers and cloud: From LXC to Docker to Kubernetes.
Bernstein, D. (2014).
IEEE Cloud Computing, 1(3), 81-84.
(pdf)
(Another orchestration framework.
This one originated at Google and was announced in 2014. Here are some random
slides with some use cases.)
-
This issue's "Cloud Tidbit" focuses on container technology and how it's
emerging as an important part of the cloud computing infrastructure. It
looks at Docker, an open source project that automates the faster
deployment of Linux applications, and Kubernetes, an open source cluster
manager for Docker containers.
-
A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus,
Peter Sempolinski and Douglas Thain,
IEEE 2nd Int. Conf. on Cloud Computing Technology and Science
(CloudCom'10), 2010.
(pdf)
-
Eucalyptus, OpenNebula and Nimbus are three major open-source
cloud-computing software platforms. The overall function of these systems
is to manage the provisioning of virtual machines for a cloud providing
infrastructure-as-a-service. These various open-source projects provide
an important alternative for those who do not wish to use a commercially
provided cloud. We provide a comparison and analysis of each of these
systems. We begin with a short summary com- paring the current raw feature
set of these projects. After that, we deepen our analysis by describing
how these cloud management frameworks relate to the many other software
components required to create a func- tioning cloud computing system. We
also analyse the overall structure of each of these projects and address
how the differing features and implementations reflect the different
goals of each of these projects. Lastly, we discuss some of the common
challenges that emerge in setting up any of these frameworks and suggest
avenues of further research and development. These include the problem
of fair scheduling in absence of money, evic- tion or preemption, the
difficulties of network configuration, and the frequent lack of clean
abstractions.
-
OpenStack: Toward an Open-Source Solution for
Cloud Computing
Sefraoui, O., Aissaoui, M., and Eleuldj, M. (2012).
International Journal of Computer Applications, 55(3).
(pdf)
(Note that this is an easy read, and should be supplemented by other
sources. There doesn't yet (as of 2016) exist a classic paper
on open source cloud computing.)
-
Abstract:
Cloud management platforms may manage the resources provided by the
infrastructure as a service (IaaS) cloud. With the rapid development of
open-source cloud platforms, they have been widely used due to open and
free, some of them can substitute commercial clouds. Some existed related
works only concisely compare the basic features of open-source platforms,
and not including some new released features. In this paper, we firstly
present the function of OpenStack and OpenNebula briefly, and then compare
them from provenance, architecture, hypervisors, security and other
angles in detail. Moreover, we provide some deployment recommendations
according to different user demands and platform characteristics.
-
Hcloud: Resource-efficient Provisioning in Shared Cloud Systems,
Delimitrou, Christina, and Christos Kozyrakis,
ACM SIGOPS Operating Systems Review (ASPLOS'16), Vol. 50. No. 2. ACM, 2016,
pp. 473-488
(pdf)
-
Cloud computing promises flexibility and high performance for users
and cost efficiency for operators. To achieve this, cloud providers
offer instances of different sizes, both as long-term reservations
and short-term, on-demand allocations. Unfortunately, determining the
best provisioning strategy is a complex, multi-dimensional problem
that depends on the load fluctuation and duration of incoming jobs,
and the performance unpredictability and cost of resources. We first
compare the two main provisioning strategies (reserved and on-demand
resources) on Google Compute Engine (GCE) using three representative
workload scenarios with batch and latency-critical applications. We
show that either approach is suboptimal for performance or cost. We
then present HCloud, a hybrid provisioning system that uses both
reserved and on-demand resources. HCloud determines which jobs should
be mapped to reserved versus on-demand resources based on overall
load, and resource unpredictability. It also determines the optimal
instance size an application needs to satisfy its Quality of Service
(QoS) constraints. We demonstrate that hybrid configurations improve
performance by 2.1x compared to fully on-demand provisioning, and
reduce cost by 46% compared to fully reserved systems. We also show that
hybrid strategies are robust to variation in system and job parameters,
such as cost and system load.
-
Hey, you, Get Off of my Cloud: Exploring Information Leakage
in Third-party Compute Clouds,
Ristenpart, Thomas and Tromer, Eran and Shacham, Hovav and Savage, Stefan,
Proc. 16th ACM Conf. on Computer and Communications Security (CCS'09),
2009, pp. 199-212
(pdf)
-
Third-party cloud computing represents the promise of outsourcing
as applied to computation. Services, such as Microsoft's Azure and
Amazon's EC2, allow users to instantiate virtual machines (VMs) on
demand and thus purchase precisely the capacity they require when they
require it. In turn, the use of virtualization allows third-party
cloud providers to maximize the utilization of their sunk capital
costs by multiplexing many customer VMs across a shared physical
infrastructure. However, in this paper, we show that this approach can
also introduce new vulnerabilities. Using the Amazon EC2 service as
a case study, we show that it is possible to map the internal cloud
infrastructure, identify where a particular target VM is likely to
reside, and then instantiate new VMs until one is placed co-resident
with the target. We explore how such placement can then be used to
mount cross-VM side-channel attacks to extract information from a
target VM on the same machine.
-
Process-level virtualization:
Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters.
Hargrove, P. H., and Duell, J. C. (2006).
In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
(pdf)
(Note that this was perhaps the first really successful
checkpointing system in its own right. While others attempted
checkpointing through kernel modification, this kernel module-based
approach was the first one that was widely used. I exclude the Condor-based
checkpointing that was used primarily as part of Condor itself.
One could argue that this is more operating system-level virtualization,
as opposed to process-level virtualization, but it is convenient to
keep some of the checkpointing-related papers together.)
-
This article describes the motivation, design and implementation of
Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart
implementation for Linux clusters that targets the space of typical High
Performance Computing applications, including MPI. Application-level
solutions, including both checkpointing and fault-tolerant algorithms, are
recognized as more time and space efficient than system-level checkpoints,
which cannot make use of any application-specific knowledge. However,
system-level checkpointing allows for preemption, making it suitable
for responding to ''fault precursors'' (for instance, elevated error
rates from ECC memory or network CRCs, or elevated temperature from
sensors). Preemption can also increase the efficiency of batch scheduling;
for instance reducing idle cycles (by allowing for shutdown without any
queue draining period or reallocation of resources to eliminate idle nodes
when better fitting jobs are queued), and reducing the average queued
time (by limiting large jobs to running during off-peak hours, without
the need to limit the length of such jobs). Each of these potential
uses makes BLCR a valuable tool for efficient resource management in
Linux clusters.
-
Design and implementation for checkpointing of distributed resources using process-level virtualization.
Arya, Kapil, Rohan Garg, Artem Y. Polyakov, and Gene Cooperman,
(2016, September).
In Cluster Computing (CLUSTER), 2016 IEEE International Conference on
(pp. 402-412). IEEE.
(pdf)
(Note, this article proposes process-level virtualization, in contrast
to machine-level, language-level, container-based, and library-OS-based
virtualization. Fair warning: this comes from my own research group,
and so I may be biased.)
-
System-level checkpoint-restart is a critical technology for
long-running jobs in high-performance computing. Yet, only two
approaches to checkpointing MPI applications continue to survive in
wide use today. One approach is to use the kernel module-based BLCR
in combination with an MPI checkpoint-restart service particular to
the MPI implementation in use. Unfortunately, this lacks support for
some important Linux system services such as SysV IPC (e.g., shared
memory objects). A second approach has been to use the original 2009
DMTCP implementation (herein referred to as DMTCP-09) for transparent,
system-level checkpointing. Unfortunately, DMTCP-09 lacked support for
checkpointing many of the necessary features found by MPI in a modern
batch environment. These include: ssh, the InfiniBand network, process
migration (restarting an MPI application on different cluster nodes),
and modified file path prefixes on restart (typically due to a changing
current directory, mount points, library paths, etc.). This work presents
DMTCP-PV, a new user-space transparent checkpointing system based on the
concept of process virtualization. This approach separately models the
state of each local or distributed subsystem while decoupling it from
the core checkpointing engine. By separating these concerns, a domain
expert can extend checkpointing into a new domain without any knowledge
of the core checkpointing engine. This allowed DMTCP-PV to address the
deficiencies noted above and many others. It is shown that the runtime
overhead of DMTCP-PV is generally less than 1%, and the checkpointing
time is dominated by the time to write an image file to stable storage.
-
Distributed Speculative Parallelization using Checkpoint Restart,
Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan,
Procedia Computer Science, 4, pp. 422--431,
May, 2011
(pdf)
(Note: This is still one of my favorite applications of checkpointing.
It combines ideas of software transcations, speculation, and checkpointing
in a really cute way.)
-
Abstract:
Speculative software parallelism has gained renewed interest recently
as a mechanism to leverage multiple cores on emerging architectures. Two
major mechanisms have been used to implement speculation-based parallelism
in software, software transactional memory and speculative threads. We
propose a third mechanism based on checkpoint restart. With recent
developments in checkpoint restart technology this has become an
attractive alternative. The approach has the potential advantage of
the conceptual simplicity of transactional memory and flexibility of
speculative threads. Since many checkpoint restart systems work with
large distributed memory programs, this provides an automatic way to
perform distributed speculation over clusters. Additionally, since
checkpoint restart systems are primarily designed for fault tolerance,
using the same system for speculation could provide fault tolerance
within speculative execution as well when it is embedded in large-scale
applications where fault tolerance is desirable. In this paper we use
a series of micro-benchmarks to study the relative performance of a
speculative system based on the DMTCP checkpoint restart system and
compare it against a thread level speculative system. We highlight the
relative merits of each approach and draw some lessons that could be
used to guide future developments in speculative systems.
-
DTHREADS: Efficient Deterministic Multithreading
Liu, T., Curtsinger, C., and Berger, E. D. (2011, October).
In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (pp. 327-336). ACM.
(pdf)
(Note: This is a really clever idea. A multi-threaded program is
virtualized as program over multiple processes, using shared memory.
This has a similar philosophy to that of process-level virtualization,
in the spirit of the paper by Arya et al. But one might alternatively
argue that this is language-level virtualization. Compare this approach
to "determinizing" multi-threaded code with the "Pinplay" paper, below,
in the language-level subsection.)
-
Multithreaded programming is notoriously difficult to get right. A key
problem is non-determinism, which complicates debugging, testing, and
reproducing errors. One way to simplify multithreaded programming is
to enforce deterministic execution, but current deterministic systems
for C/C++ are incomplete or impractical. These systems require program
modification, do not ensure determinism in the presence of data races,
do not work with general-purpose multithreaded programs, or run up to
8.4× slower than pthreads.
This paper presents Dthreads, an efficient deterministic multithreading
system for unmodified C/C++ applications that replaces the pthreads
library. Dthreads enforces determinism in the face of data races and
deadlocks. Dthreads works by exploding multithreaded applications into
multiple processes, with private, copy-on-write mappings to shared
memory. It uses standard virtual memory protection to track writes,
and deterministically orders updates by each thread. By separating
updates from different threads, Dthreads has the additional benefit
of eliminating false sharing. Experimental results show that Dthreads
substantially outperforms a state-of-the-art deterministic runtime
system, and for a majority of the benchmarks evaluated here, matches
and occasionally exceeds the performance of pthreads.
-
Language-level virtualization:
A few billion lines of code later: using static analysis to find bugs in the real world.
Bessey, Al, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler.
Communications of the ACM 53, no. 2 (2010): 66-75.
(pdf)
(Note: Even though it's about static analysis, there is a flavor of
language-level virtualization here. If you decide to report on this
paper, you should report jointly on this paper and on the paper
below: "Bugs as Deviant Behavior".)
-
How Coverity built a bug-finding tool, and a business, around the unlimited supply of bugs in software systems.
In 2002, COVERITY commercialized a research static bug-finding tool.
Not surprisingly, as academics, our view of commercial realities was
not perfectly accurate. However, the problems we encountered were not
the obvious ones. ...
-
Bugs as Deviant Behavior: A General Approach to Inferring Errors
in Systems Code
Engler, D., Chen, D. Y., Hallem, S., Chou, A., and Chelf, B. (2001, October).
In ACM SIGOPS Operating Systems Review (Vol. 35, No. 5, pp. 57-72). ACM.
(pdf)
(Note: If you report on this paper, you should jointly report on this
and the paper above: "A Few Billion Lines of Code Later".)
-
A major obstacle to finding program errors in a real system is knowing
what correctness rules the system must obey. These rules are often
undocumented or specified in an ad hoc manner. This paper demonstrates
techniques that automatically extract such checking information from
the source code itself, rather than the programmer, thereby avoiding
the need for a priori knowledge of system rules.The cornerstone of our
approach is inferring programmer "beliefs" that we then cross-check for
contradictions. Beliefs are facts implied by code: a dereference of a
pointer, p, implies a belief that p is non-null, a call to "unlock(1)"
implies that 1 was locked, etc. For beliefs we know the programmer
must hold, such as the pointer dereference above, we immediately flag
contradictions as errors. For beliefs that the programmer may hold,
we can assume these beliefs hold and use a statistical analysis to
rank the resulting errors from most to least likely. For example, a
call to "spin_lock" followed once by a call to "spin_unlock" implies
that the programmer may have paired these calls by coincidence. If the
pairing happens 999 out of 1000 times, though, then it is probably a
valid belief and the sole deviation a probable error. The key feature of
this approach is that it requires no a priori knowledge of truth: if two
beliefs contradict, we know that one is an error without knowing what the
correct belief is.Conceptually, our checkers extract beliefs by tailoring
rule "templates" to a system --- for example, finding all functions that
fit the rule template "a must be paired with b." We have developed six
checkers that follow this conceptual framework. They find hundreds of
bugs in real systems such as Linux and OpenBSD. From our experience,
they give a dramatic reduction in the manual effort needed to check a
large system. Compared to our previous work, these template checkers
find ten to one hundred times more rule instances and derive properties
we found impractical to specify manually.
-
Transactional rollback for language-based systems
Rudys, A., and Wallach, D. S. (2002).
In Dependable Systems and Networks, 2002. DSN
2002. Proceedings. International Conference on (pp. 439-448). IEEE.
(pdf)
(Note, this is a nice example of language-level virtualization:
It speculates on the results of codelets (small pieces of the code).)
-
Language run-time systems are routinely used to host potentially buggy
or malicious codelets-software modules, agents, applets, etc.-in a
secure environment. A number of techniques exist for managing access
control to system services and even for terminating codelets once they
have been determined to be misbehaving. However because codelets can be
terminated anywhere in their execution, a codelet's internal state might
become inconsistent; restarting the codelet could result in unexpected
behavior. Any state the codelet shares with other codelets may likewise
become inconsistent, destabilizing those codelets as well. To address
these problems, we have designed a mechanism, strictly using code-to-code
transformations, which provides transactional rollback support for
codelets. Each instance of a codelet is run in its own transaction, and
standard (ACID) transactional semantics apply. All changes made by the
codelet are automatically rolled back when the corresponding transaction
aborts. We discuss a transactional rollback implementation for Java,
and present its performance.
-
Rx: Treating Bugs As Allergies -- A Safe Method to Survive Software Failures
Qin, F., Tucek, J., Sundaresan, J., and Zhou, Y. (2005, October).
In ACM SIGOPS Operating Systems Review (Vol. 39, No. 5, pp. 235-248). ACM.
(pdf)
(Note: Here is a cute idea using speculative execution. Be sure to
especially read Section 3.3 carefully on a first reading:
the five mechanisms for semi-automatically fixing bugs.)
-
Many applications demand availability. Unfortunately, software failures
greatly reduce system availability. Prior work on surviving software
failures suffers from one or more of the following limitations: Required
application restructuring, inability to address deterministic software
bugs, unsafe speculation on program execution, and long recovery time.This
paper proposes an innovative safe technique, called Rx, which can quickly
recover programs from many types of software bugs, both deterministic
and non-deterministic. Our idea, inspired from allergy treatment in real
life, is to rollback the program to a recent checkpoint upon a software
failure, and then to re-execute the program in a modified environment. We
base this idea on the observation that many bugs are correlated with
the execution environment, and therefore can be avoided by removing the
"allergen" from the environment. Rx requires few to no modifications to
applications and provides programmers with additional feedback for bug
diagnosis.We have implemented RX on Linux. Our experiments with four
server applications that contain six bugs of various types show that
RX can survive all the six software failures and provide transparent
fast recovery within 0.017-0.16 seconds, 21-53 times faster than the
whole program restart approach for all but one case (CVS). In contrast,
the two tested alternatives, a whole program restart approach and
a simple rollback and re-execution without environmental changes,
cannot successfully recover the three servers (Squid, Apache, and CVS)
that contain deterministic bugs, and have only a 40% recovery rate
for the server (MySQL) that contains a non-deterministic concurrency
bug. Additionally, RX's checkpointing system is lightweight, imposing
small time and space overheads.
-
Pinplay: a framework for deterministic replay and reproducible analysis of parallel programs.
Patil, H., Pereira, C., Stallcup, M., Lueck, G., and Cownie, J. (2010, April).
In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization (pp. 2-11). ACM.
(pdf)
(Note: This is an excellent example of a research direction that argues
that debugging multi-threaded programs is harder than sequential
programs, because of race conditions involving at least two distinct
points in a program. They then argue that deterministic replay
is important for capturing such bugs/race conditions for analysis and
debugging. If you report on this work, please compare it to the
more recent work on Castor (Default-On Multi-Core Record/Replay), below.
Optionally, you may also want to compare this approach to
"determinizing" multi-threaded code with the "Dthreads" paper, above.)
-
Analysis of parallel programs is hard mainly because their behavior
changes from run to run. We present an execution capture and
deterministic replay system that enables repeatable analysis of parallel
programs. Our goal is to provide an easy-to-use framework for capturing,
deterministically replaying, and analyzing execution of large programs
with reasonable runtime and disk usage. Our system, called PinPlay,
is based on the popular Pin dynamic instrumentation system hence is
very easy to use. PinPlay extends the capability of Pin-based analysis
by providing a tool for capturing one execution instance of a program
(as log files called pinballs) and by allowing Pin-based tools to run
off the captured execution. Most Pintools can be trivially modified to
work off pinballs thus doing their usual analysis but with a guaranteed
repeatability. Furthermore, the capture/replay works across operating
systems (Windows to Linux) as the pinball format is independent of the
operating system. We have used PinPlay to analyze and deterministically
debug large parallel programs running trillions of instructions. This
paper describes the design of PinPlay and its applications for analyses
such as simulation point selection, tracing, and debugging.
-
Towards Practical Default-On Multi-Core Record/Replay
Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazières,
Mendel Rosenblum
(to appear, ASPLOS 2017)
(pdf)
(Note that the pdf link is not a permanent link. This is the draft
version of a recently accepted paper. If you report on this paper,
please compare this recent work to the earlier classic "PinPlay" work, above.
In particular, what do the two authors say about the still earlier
work, RecPlay by Ronsse et al?)