Suggested Team Projects for CS 5600

NOTE:
The Wiki page for the team projects has now been set up. Please go to the Wiki page for further information.

The preferred team size is three students. Exceptions can be made with proper justification. Please consider me as an informal fourth member of each team.

We will expect 5-minute oral summaries of the progress by each team, during each week, in class. There will also be a full oral presentation, and a full written documentation of the results, at the end of the semester.

Many of these projects are highly ambitious, and it is not necessarily expected that each project will be completed within the semester. Instead, the oral and written presentations should concentrate on documenting what was achieved, what was not achieved, what new information was learned in failing to achieve the desired goals, and what new directions would be taken in the future in order to continue the progress. This philosophy makes the project closer to the real world (as opposed to an academic toy project). This style of work is typical of some industrial production code (e.g., agile software development), of industrial R&D, and of general research.

For the projects based around Mesos, see:

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (also at USENIX NSDI 2011)
Notes on Mesos and Docker (from Douglas Thain, U. of Notre Dame)
Apache Mesos Documentation

1. DMTCP attach: The hw3 homework showed some approaches to doing a DMTCP attach. To date, no one has made a serious effort at creating an attach feature for DMTCP.
2. DMTCP handling of static executables: The hw3 homework also showed some approaches to handling static executables. Handling static executables will most likely require the use of trampolines (code that modifies the assembly entry points to library functions). To date, no one has made a serious effort at supporting static executables using DMTCP.
3. General API for DMTCP Coordinator: Right now, there is only one kind of DMTCP coordinator. (Actually, there are two kinds, since dmtcp_launch --no-coordinator causes DMTCP to create a built-in coordinator.) Can we extend the concept of DMTCP plugins to create a general API that will allow an end user to write a library against the API to define a new kind of coordinator? Some options are: a tree of coordinators; or a second standby coordinator that takes over if the first coordinator dies; etc.
4. General API for DMTCP Plugin for Writing a Checkpoint: Similar to the previous project. But here we want to allow the creation of: checkpoint images on a remote computer; or maybe an encrypted version of a checkpoint image; or maybe replicates of the checkpoint image for fault tolerance.
5. Versioned symbols for ELF: Wrapper functions are a natural concept in computer science. The Linux/Posix system calls dlopen and dlsym directly support wrappers. (See man dlsym and search on "wrapper".)
GNU libc 2.1 (glibc 2.1) introduced symbol versioning. A newer library (.so file) can define both a new version of a symbol (e.g., function) intended to fix bugs and/or add new features; while at the same time defining an older version of the symbol for backward compatibility. The system call dlsym chooses the older version of the symbol, while executables that dynamically link to a library will usually receive the newer version of the symbol (actually, the version that is informally considered the "default version").
The goal of this project is to learn more about ELF, and to use that information to write a new function that will choose the newer "default" version of a symbol.
You will find more information on these issues and a partial implementation in DMTCP, in the DMTCP file doc/dlsym_default.txt. I will provide additional information, if a team chooses this project.
6. Checkpointing valgrind (valgrind attach): Valgrind is a widely used software that excels at finding memory leaks. Its usage is simple: valgrind a.out args. Because running under valgrind is slower than native execution (e.g., 10 times slower or worse), many users have hoped for a "valgrind attach" feature. This is probably impossible, since valgrind runs the executable in software that emulates the underlying assembly language.
So, a next-best option is to run valgrind under DMTCP (or other checkpointing tool) until the interesting point. Then, one checkpoints. Finally, one can restart many times, and direct the executable to choose different execution paths (e.g., different application options) on each restart.
While a VM snapshot could checkpoint valgrind, that is a heavyweight option. The goal of this project is to use a standard checkpointing package (or your own custom one) to checkpoint valgrind.
7. Checkpointing screen and/or tmux: GNU screen, and tmux, are commonly used for detaching a terminal session from the terminal, and other manipulations. This software uses the concept of a ptty. At one time, DMTCP supported GNU screen, but that was before the era of DMTCP plugins.
The goal of this project is to produce a DMTCP plugin for supporting checkpointing of either GNU screen or of tmux.
8. Checkpointing Hadoop (Big Data): Hadoop was the first full-featured open source version of the MapReduce software from Google. Its architecture typically assumes back-end disk nodes with large files, and a front-end compute node on which resides the Hadoop executable, and a Hadoop scheduler for the back end.
Checkpointing would be very useful, in order to put aside a currently running job, when a newer, high-priority job arrives. Since the files on the back end are large, the intention is to copy the back end files to a temporary region as part of the checkpoint, and then to copy them back as part of the restart.
We have access to some software from INRIA that will manage the back-end files. The goal of this project is to write the front-end, including a DMTCP plugin, that will take special actions at checkpoint and restart to save the front-end Hadoop application and later restore it. We will apply this only to the simpler Hadoop, version 1.
9. Checkpointing of Docker: Docker is sometimes called a lightweight virtual machine, although it does not include a separate "guest" Linux kernel. It uses the underlying Linux kernel. Nevertheless, it has gained popularity in many domains where virtual machines are also used.
Virtual machines have snapshots. The goal of this project is to checkpoint Docker using DMTCP. (An alternate checkpointing package that currently works on Docker is CRIU.)
While Docker is normally compiled as a statically linked executable under GC, there is also a dynamically linked executable for Docker using GNU GCCGO. (See The Go Blog for more information.) In principle, this should make it easy for DMTCP to checkpoint Docker. However, DMTCP must be extended to support Linux cgroups and pid namespaces.
There is already a partial implementation of checkpointing of Docker within the DMTCP team. This will be made available to a team that tackles this project.
Docker typically runs just a single process. If time permits, the effort should be extended to support Docker's Supervisor package. Alternatively, the team may prefer a different extension: the use of plugins to integrate with the Docker daemon on checkpoint and restart.
10. Security: Multi-architecture Checkpoint-Restart: In defending against malware, it is useful to present a dynamically shifting "attack surface" against attackers. One such technique is multi-architecture checkpoint-restart. An example of such work (as execution migration) is: Execution Migration in a Heterogeneous-ISA Chip Multiprocessor.
The goal of this project is to checkpoint under one CPU instruction set (e.g., Intel), and to restart under a different CPU instruction set (e.g., ARM).
We will assume that we fully control the target application. For example, we can compile it under both CPU architectures. We can also compile it with research compilers such as LLVM. LLVM is the foundation for the well-known clang compiler. LLVM allows you to easily modify the compiler to emit additional code, such as "landmarks" in the prolog and epilog of a function, where it is acceptable to checkpoint. Thus, one can checkpoint at one of these landmarks, and replace the text segment with the text segment of the other CPU architecture, and then restart at the corresponding landmark in the alternative text segment. With a little luck, we can persuade LLVM to emit an almost identical data segment under the two CPU architectures. The remaining task is then to translate the call frames of the stack from one CPU architecture to another.
If a team takes on this project, we will provide additional lectures on how to modify the LLVM compiler.
11. Mesos: Fault-tolerant Resource Scheduling: Many companies that operate at web scale spread their production systems across data centers in different geographical locations. This project will implement a feature in Apache Mesos allowing slaves in a datacenter to connect to a local Mesos master, enabling the Mesos masters of different datacenters to handle automated failover among them.
If a team takes on this project, we will provide additional lectures on this aspect of Mesos.
12. Mesos: Load Balancing: Apache Mesos operates in a master-slave hierarchy. If a leading master fails, one of the standby masters will take over. However, if a master is failing due to overload or network congestion, failover to a single standby master is not an appropriate solution. This project should create multiple active masters to share the workload.
If a team takes on this project, we will provide additional lectures on this aspect of Mesos.

I am still considering additional projects. Students are welcome to propose additional projects in areas of their interest, or modifications to the current projects.