Web Resources for CS7680 (Special Topics in Systems)

Instructor: Gene Cooperman
Spring, 2017

CS 7680 (Spring, 2017): Virtualization in Computer Systems

Virtualization in its most general sense has become a key technology in the evolution of the Cloud, the Datacenter, the HPC Cluster, and the emerging area of the Internet of Things. Virtualization also strongly affects the following: stateless servers and server migration; schedulers and load balancing; fault tolerance and checkpoint-restart; Linux containers (e.g., cgroups, Docker, Mesos) and virtual machines; distributed debugging and orchestration frameworks; mobile apps; etc.

In terms of a syllabus, my office hours will be during the hour after class on Tuesday and Friday. I am also generous about offering other times to meet, and I encourage you to come to my office (no prior appointment) and interrupt me and ask when we can meet. Ideally, I will arrange that within half an hour if possible.

As befitting a PhD topics course, if a student actively participates in all areas (readings, presentation, class discussion, project, written project report), then the grade will be an A.

NEWS:

Please continue working on your technical paper. You will find this link to the full instructions for writing your course technical paper. Please have ready a 2-page introduction, and a first version of the body based on figures. The schedule/deadlines are in the instructions (linked to, above) for writing your course technical paper.
If you expect to be delayed in these deadlines, please indicate that in red near the top of your document (using the \todo macro, so that I know when to edit it, as your co-author.

NEWS:

There are now two pages with background information for this course:

Computer architecture background page with information on VLSI and Supercomputing.
Elements of Parsing an ELF Header (incomplete still; to be completed) NEW!

Here are the weekly lecture notes taken by the students.

Here are the background paper readings (many of which will be presented in class). and here is the schedule for presentations.

See below for:

Topics
Course Structure
Short Essay on Course Philosphy
Paper Readings
- Presentations of Papers
The Importance of Writing (and oral presentation)
Course Projects
Example: DMTCP-style checkpointing: interposition on system calls

Topics:

A. Three Platforms for Clusters: Datacenter, Cloud, and HPC (High Performance Computing)

What are their different characteristics? For example, note the differing preferences for virtualization:

Datacenter: often uses Linux containers (e.g., Docker, CoreOS)
Cloud: often uses virtual machines, especially for IaaS (Infrastructure as a Service)
HPCL: "bare metal" is preferred; practitioners are reluctant to give up even 3% overhead. In a $100 million cluster, 3% costs 3 million dollars.

b. The Datacenter

The datacenter is both one of the oldest cluster platform (machine rooms computing the payroll, updating the employee or customer database), and one of the newest cluster platforms (server farm for customer transactions, data mining, etc.). The modern datacenter users orchestration (e.g., Mesos, Kubernetes, ...) for flexible, dynamic assignment of computer nodes; and it uses containers (e.g., Docker, CoreOS) for packaging (the application should not break during a system upgrade), and reasonable isolation (one container should not affect the performance or security of another container).

C. Virtualization on the three platforms

All three platforms benefit from virtualization. What are the modes and motivations for virtualization? Some examples are virtual machines (e.g., for IaaS), process virtualization (interposition on system calls: e.g., library OS, DMTCP plugins, stub funtions in Condor).

D. Convergence on the three platforms: to converge or not to converge

Example systems to study include the MOC (Massachusetts Open Cloud), and OpenStack.

E. O/S and programming language extensions for virtualization

Containers use three non-classical system services: namespaces (e.g., pid namespaces); cgroups or control groups; and union filesystems (a small read-write filesystem layer on top of a large, base read-only filesystem). Other examples of vitualization strategies include "Windows Subsystem for Linux", exokernels and microkernels, shadow device drivers. On the level of programming languages, there are language virtual machines (e.g., JVM), and newer systems languages with interesting implications for virtualization: Go (Docker containers and servers, static instead of dynamic linking -- no libc.so); Rust (Web browsers and web engines, safety and speed for multi-threaded programs), Scala (Spark, the successor to MapReduce for Big Data -- how does one virtualize big data?)) of dynamic linking).

F. Performance

Performance considerations often interact badly with virtualization. Some examples of newer performance optimizations for computers are: InfiniBand (and newer RDMA-based network fabrics, such as Intel Omni-Path, with a roadmap toward integration on the CPU chip); SSDs; Intel/Micron Optane memory; NVIDIA GPUs; Intel Xeon Phi; etc.

G. Optional Topics

Exascale for High Performance Computing: how to virtualize a million-core computation lasting for hours?
Newer storage systems: their characteristics for large O/S images, frequent checkpoints, etc.
Linker/loader and ELF: everyone uses it and almost no one knows the details of how it works; interesting possibilities for interposition and virtualization are -Wl,--wrap=foo and LD_PRELOAD
CPU hardware support for virtual machines
"virtualization equals security": one can only enforce what one can interpose on.
Other??

Course Structure:

At this point, I am still flexible on the course structure. But I want it to include an emphasis on these three elements:

Paper readings, including oral presentations in class
A Medium-sized, but open-ended, exploratory Software Project (The project can have software targest; or it can be based on paper-and-pencil design with reference to existing software building blocks. The practical experience of designing systems is important for gaining insights, and for critically analyzing academic research papers.)
Frequent Updates to your own technical writing, which may cover a survey of some aspect of the paper readings, or a report on discoveries from your exploration through an open-ended software project, or both. Written technical communication is a key skill. There are rules for good technical writing, and I intend to teach those rules (going beyond mere evaluation with feedback like "good", "fair", and "needs improvement"). evaluate

Short Essay on Course Philosphy:

My own view on this subject is that virtualization is totally changing the way we use computers. We see apps moved from the desktop to a Java Virtual Machine on a smartphone, or a Linux container (e.g., LXC, Docker) in the Cloud, or a Virtual Machine in the Cloud, or the Microsoft idea of a Library OS (variously tied to the idea of a "Drawbridge picoprocess" or a "Universal Windows Platform" (UWP) or a Windows Subsystem for Linux (WSL; i.e., "Ubuntu on Windows").

Even the idea of a container can be decomposed into a namespace (for pid's, network addresses, etc.) and cgroups (control groups for limiting resource usage, and a union filesystem (with a base, pre-packaged read-only system; and a read-write user-controlled filesystem on top of it). The ideas of "namespace" and "union filesystem" come close to my own research group's work on "process virtualization".

The past work of my own research group has led from checkpointing (see DMTCP --- a widely-used checkpointing project now in its second decade) to questions of what is the ultimate goal for modern checkpointing. The old problem of checkpointing is a solved problem. An application was assumed to be self-contained, and there are several good, robust solutions out there. But how does an application interact with external processes and constructs that are outside its natural home in the Cloud, HPC cluster, or especially the Datacenter? Our current answer to this question is "process virtualization", and I am interested in how process virtualization can interact with VM-based virtualization, container-based virtualization, and language-based virtualization (e.g., JVM).

Personally, I believe that all three platforms (the Cloud, HPC cluster, and the Data Center) are fast evolving and converging into a new future concept that none of us might fully recognize today. Evidence fo this is interactive use of Slurm in a batch-oriented HPC cluster, "bare-metal" Clouds and other completely new entities, and Mesos-based Data Centers that can run a Cloud and an HPC cluster inside Mesos while allowing for dynamic tradeoffs of resources. Much of this vision will be radically transformed by new memory technologies, including SSDs and the 3D XPoint (aka Optane) of Intel and Micron. How do we integrate all this information and prepare for this future convergence of today's paradigms?

So, just as the three blind men try to describe an elephant, we are all trying to describe the future of computing. I believe that Data Centers such as those based on Mesos (the newest creature on the block) may hold particular insights. But let's develop the course content together as a community project among interested students.

The Importance of Writing:

The motivation for paper readings is obvious. The motivation for medium-sized exploratory software is to "keep us honest" (see projects, below). It's easy to read papers and spin up castles in the air. But how do we know that those castles in the air are practical? What are the real challenges when we try to build our castle in the air?

And finally, the motivation for an emphasis on technical writing with frequent updates is that technical writing is a critical skill that is typically given low priority due to the pressure of other concerns. But especially in today's environment of highly competitive conference, and the need to exploit the web to propagate your technical visions, it is critical to write well. There are rules to learn good technical writing, just as there are rules to learn good principles of writing software. It will be my responsibility to serve as the "software compiler", and give you frequent feedback on issues with your writing. Probably, I will use sharelatex to easily share a single source file with the commonly used C.S. standard: LaTeX. But I am open to other suggestions. My goal is to be able to comment on and modify your writing at any time of day or night.

If it is possible, I also hope to create a web of technical writing from the writing products of the whole class. I've never done that before, and I don't know how to do that yet. But let's see how it turns out.

One last comment on writing: Good technical writing has nothing to do with whether your native language is English or something else. The content of good technical writing will shine through a translation into any language at all. It is the skill of writing good technical content that I am looking for.

Course projects:

I would like to see each student choose a course project on the general theme of virtualization. Ideally, if the student is already working on a thesis or other long-term project, then some aspect of that involving virtualization can be used as the course project. The project can even be a thesis proposal or a chapter in a thesis.

I am not concerned with the particular functionality of the project. The project may be based on actual software, or on a software paper design.

The project will be used as a vehicle for learning better technical communication skills. We will use Overleaf or Google Docs as a way to share a document between the professor and the student. In this way, I can asynchronously point out places where full communication could be improved. From time to time, I will also ask you to spontaneously communicate orally about the ideas of your projects. The goal is fluency in communication --- not polished presentation. Polished presentation will be emphasized closer to the end of the course.

Further, the emphasis of this software (or paper design) project will not be software engineering. I don't care if any of your software works at the end, or if any of your paper design has been implemented at the end. Instead, I want you to poke at large, complex software from the outside, in order to get insights into it. There are two kinds of "proof" in this world: mathematical proof and scientific proof. Mathematical proof is about formal proofs. It relates to formal verification and to semantics of programming language. It works best for programming "in the small". Scientific proof is evidence-based. It works best for programming "in the large". What are some scientific experiments that you can perform on this large software in order to gain evidence for its expected behavior? Virtualization is about interposing on large, complex software. How does one poke at it, without having to spend huge amounts of time reverse-engineering the internals of that software?

If you don't already have a project, or if you are interested in doing something different, here are some possible ideas. This list may grow over time. It is particularly DMTCP-centric because my group's research emphasizes that project. Students are welcome to substitute their favorite complex software (Hadoop?, Spark?, MPI?, a distributed system?) for DMTCP, and ask similar questions.

Microsoft supports Azure for their Cloud. Azure supports a large subset of the Linux API. (See Windows Subsystem for Linux and the Library OS paper mentioned there, for some insights into how Microsoft has used virtualization to support Linux on top of Windows, which is also being done in Microsoft Azure.) The goal of this project will be to get a free student account for Microsoft Azure, and then discover to what extent a package using many low-level Linux features, such as DMTCP, can or cannot be ported to Microsoft Azure. If it is difficult to port, what are the difficulties, and what subset of DMTCP might be able to be ported? Is there a general way to describe universal requirements for support of DMTCP on any possible operating system (even a real-time embedded O/S)?
DMTCP uses the dlsym system call to interpose on functions to create wrapper functions. This is one of the key requirements in the development of DMTCP. It is also and excellent way to do virtualization through interposition. However, Docker and the Go language use statically linked binaries. For statically linked binaries, dlsym has no effect. (After all, "dl" stands for "dynamically linked".) What are the alternatives, in order to support statically linked binaries? If you choose this, I will provide you with a tiny package that parses ELF libraries and then directly does interposition on ELF symbols. How easy is it to port DMTCP to use this new package instead of dlsym? In general, to what extent does static linking imply that we lose the ability to interpose (and hence virtualize), and to what extent can we get around this with ELF symbols, ELF relocation, trampolines, etc? (See these slides for an overview of DMTCP.)
In supercomputing, the InfiniBand network fabric has been the standard for over a decade. It is based on the concept of RDMA. There are now newer network fabrics that the world is moving toward. For example, Intel Omni-Path stands a good chance of becoming the next standard. There are also attempts to unify these standards (e.g., OpenFabric). Every time that a new network is chosen, most of the HPC software stack must be ported toward that new stack. Further, DMTCP must then create a new plugin for the new network. What is the potential for creating an abstraction for an RDMA network sufficient for interposition (and hence for virtualization). Note that OpenFabric tries to provide a general network API or intermediate layer from which one can then make calls to a lower-level layer. The question here is subtly different. Suppose we don't want a higher-level abstraction that may or may not cover the next brand new network fabric. A high-level abstraction attemps to achieve backwards compatibility with all earlier standards. But suppose that instead we want to achieve forwards compatibility! Suppose we want a single parametrized model in advance that covers any possible future network fabric. How general can we make this parametrized model so that it is future-proof?
Docker is a well-known container system. Another that has a little less mindshare, but is equally interesting, is CoreOS. To date, the largest use of both is for non-persistent applications. If a server dies in the middle of a transaction, then either some front-end will redirect that transaction, or the client (user) should re-submit the transaction. One is beginning to see an interest now in persistent container applications. A container can require more than a gigabyte for its filesystem. Even worse, the running daemons (typically launched by systemd). We can checkpoint just the application inside the container (easier) or we can checkpoint the entire container including systemd (more difficult). Checkpointing a network of QEMU Virtual Machines over a Linux kernel with KVM (with slides here) turns out to be surprisingly easy, since a guest VM appears as a process to the host. Checkpointing a container is more difficult due to issues of systemd, etc. Here's the CRIU document on how they approach this problem. With DMTCP, one would extend it to support statically linked executables, and then use the DMTCP plugin model to support networks and other services of system daemons. Compare and contrast the two approaches (and any others)? What are the pros and cons? Could we extend this to creating an operating system with a "fast restart" mode (restart the daemons from a checkpoint image), or a fast live migration?
I also intend to add some projects more closely oriented toward the datacenter and possibly Docker, CoreOs, or another Container-like system.. I'm thinking about Mesos and Kubernetes. Ask me if you're interested.
I also intend to add some projects more closely oriented toward the Cloud, with the Massachusetts Open Cloud as a major resource. Ask me if you're interested.

DMTCP-style checkpointing:

DMTCP-style checkpointing (interposition on system calls)