Web Resources for CS7680
(Special Topics in Systems)
CS 7680 (Spring, 2017): Virtualization in Computer Systems
Virtualization in its most general sense has become a key technology
in the evolution of the Cloud, the Datacenter, the HPC Cluster, and
the emerging area of the Internet of Things. Virtualization also strongly
affects the following: stateless servers and server migration;
schedulers and load balancing; fault tolerance and checkpoint-restart;
Linux containers (e.g., cgroups, Docker, Mesos) and virtual machines;
distributed debugging and orchestration frameworks; mobile apps; etc.
In terms of a syllabus, my office hours will be during the
hour after class on Tuesday and Friday. I am also generous about
offering other times to meet, and I encourage you to come to my
office (no prior appointment) and interrupt me and ask when we can
meet. Ideally, I will arrange that within half an hour if possible.
As befitting a PhD topics course, if a student actively participates in all
areas (readings, presentation, class discussion, project,
written project report), then the grade will be an A.
Please continue working on your technical paper. You will find
this link to the full instructions for writing your course technical paper.
Please have ready a 2-page introduction, and a first version
of the body based on figures. The schedule/deadlines
are in the instructions (linked to, above) for writing
your course technical paper.
If you expect to be delayed in these deadlines, please indicate
that in red near the top of your document (using the \todo
macro, so that I know when to edit it, as your co-author.
NEWS:
There are now two pages with background information for this
course:
-
Computer architecture background page
with information on VLSI and Supercomputing.
- Elements of Parsing an ELF Header
(incomplete still; to be completed)
NEW!
Here are
the weekly lecture notes taken by the students.
Here are
the background paper readings (many of which will be presented in class).
and here is the
schedule for presentations.
See below for:
- A. Three Platforms for Clusters: Datacenter, Cloud, and
HPC (High Performance Computing)
- What are their different characteristics? For example, note
the differing preferences for virtualization:
- Datacenter: often uses Linux containers (e.g., Docker, CoreOS)
- Cloud: often uses virtual machines, especially for IaaS
(Infrastructure as a Service)
- HPCL: "bare metal" is preferred; practitioners are reluctant
to give up even 3% overhead. In a $100 million cluster,
3% costs 3 million dollars.
- b. The Datacenter
- The datacenter is both one of the oldest cluster platform
(machine rooms computing
the payroll, updating the employee or customer database),
and one of the newest cluster platforms (server farm for
customer transactions, data mining, etc.). The modern
datacenter users orchestration (e.g., Mesos, Kubernetes, ...)
for flexible, dynamic assignment of computer nodes;
and it uses containers (e.g., Docker, CoreOS)
for packaging (the application should not break during a
system upgrade), and reasonable isolation (one container
should not affect the performance or security of another
container).
- C. Virtualization on the three platforms
- All three platforms benefit from virtualization. What are the
modes and motivations for virtualization?
Some examples are virtual machines (e.g., for IaaS), process
virtualization (interposition on system calls: e.g., library OS,
DMTCP plugins, stub funtions in Condor).
- D. Convergence on the three platforms: to converge or not to converge
- Example systems to study include the MOC (Massachusetts Open Cloud),
and OpenStack.
- E. O/S and programming language extensions for virtualization
- Containers use three non-classical system services: namespaces
(e.g., pid namespaces); cgroups or control groups; and
union filesystems (a small read-write filesystem layer on top
of a large, base read-only filesystem). Other examples
of vitualization strategies include "Windows Subsystem for
Linux", exokernels and microkernels,
shadow device drivers. On the level of programming
languages, there are language virtual machines (e.g., JVM),
and newer systems languages with interesting implications
for virtualization:
Go (Docker containers and
servers, static instead of dynamic linking -- no libc.so);
Rust (Web browsers and web engines, safety and speed
for multi-threaded programs),
Scala (Spark, the successor to MapReduce
for Big Data -- how does one virtualize big data?))
of dynamic linking).
- F. Performance
- Performance considerations often interact badly with virtualization.
Some examples of newer performance optimizations for computers are:
InfiniBand (and newer RDMA-based network fabrics,
such as Intel Omni-Path, with a roadmap toward integration
on the CPU chip); SSDs; Intel/Micron Optane
memory; NVIDIA GPUs; Intel Xeon Phi; etc.
- G. Optional Topics
- Exascale for High Performance Computing: how to
virtualize a million-core computation lasting
for hours?
- Newer storage systems: their characteristics
for large O/S images, frequent checkpoints, etc.
- Linker/loader and ELF: everyone uses it and almost no
one knows the details of how it works; interesting
possibilities for interposition and virtualization
are -Wl,--wrap=foo and LD_PRELOAD
- CPU hardware support for virtual machines
- "virtualization equals security": one can only enforce
what one can interpose on.
- Other??
At this point, I am still flexible on the course structure. But I want
it to include an emphasis on these three elements:
- Paper readings, including oral presentations in class
- A Medium-sized, but open-ended, exploratory Software Project
(The project can have software targest; or it can be
based on paper-and-pencil design with reference to
existing software building blocks. The practical experience
of designing systems is important for gaining insights, and
for critically analyzing academic research papers.)
- Frequent Updates to your own technical writing, which
may cover a survey of some aspect of the paper readings,
or a report on discoveries from your exploration through
an open-ended software project, or both. Written technical
communication is a key skill. There are rules for good technical
writing, and I intend to teach those rules
(going beyond mere evaluation with feedback like
"good", "fair", and "needs improvement").
evaluate
My own view on this subject is that virtualization is totally
changing the way we use computers. We see apps moved from the
desktop to a Java Virtual Machine on a smartphone, or a Linux
container (e.g., LXC, Docker) in the Cloud, or a Virtual Machine in the Cloud,
or the Microsoft idea of a Library OS (variously tied to
the idea of a "Drawbridge picoprocess" or a "Universal Windows
Platform" (UWP) or a Windows Subsystem for Linux (WSL; i.e.,
"Ubuntu on Windows").
Even the idea of a container
can be decomposed into a namespace (for pid's, network addresses, etc.)
and cgroups (control groups for limiting resource usage, and
a union filesystem (with a base, pre-packaged read-only system;
and a read-write user-controlled filesystem on top of it).
The ideas of "namespace" and "union filesystem" come close to
my own research group's work on
"process virtualization".
The past work of my own research group has led from checkpointing
(see
DMTCP --- a widely-used checkpointing project now in its
second decade) to questions of what is the ultimate goal
for modern checkpointing. The old problem of checkpointing is
a solved problem. An application was assumed to be
self-contained, and there are several good, robust
solutions out there. But how does an application
interact with external processes and constructs that
are outside its natural home in the Cloud, HPC cluster, or especially
the Datacenter? Our current answer to this question is
"process virtualization", and I am interested in how
process virtualization can interact with VM-based virtualization,
container-based virtualization, and language-based virtualization
(e.g., JVM).
Personally, I believe that all three platforms (the Cloud, HPC cluster, and
the Data Center) are fast evolving and converging into a new
future concept that none of us might fully recognize today.
Evidence fo this is interactive use of Slurm in a batch-oriented
HPC cluster, "bare-metal" Clouds and other completely new entities,
and Mesos-based Data Centers that can run a Cloud and an HPC cluster
inside Mesos while allowing for dynamic tradeoffs of resources.
Much of this vision will be radically transformed by new
memory technologies, including SSDs and the 3D XPoint (aka Optane)
of Intel and Micron. How do we integrate all this information and
prepare for this future convergence of today's paradigms?
So, just as the three blind men try to describe an elephant, we are
all trying to describe the future of computing. I believe that
Data Centers such as those based on Mesos (the newest creature on the block)
may hold particular insights. But let's develop the course content
together as a community project among interested students.
The motivation for paper readings is obvious. The motivation for
medium-sized exploratory software is to "keep us honest" (see projects, below).
It's
easy to read papers and spin up castles in the air. But how do we
know that those castles in the air are practical? What are the
real challenges when we try to build our castle in the air?
And finally, the motivation for an emphasis on technical writing with
frequent updates is that technical writing is a critical skill that
is typically given low priority due to the pressure of other concerns.
But especially in today's environment of highly competitive conference,
and the need to exploit the web to propagate your technical visions, it is
critical to write well. There are rules to learn good technical writing,
just as there are rules to learn good principles of writing software.
It will be my responsibility to serve as the "software compiler", and
give you frequent feedback on issues with your writing. Probably,
I will use sharelatex to easily share a single source file with the
commonly used C.S. standard: LaTeX. But I am open to other suggestions.
My goal is to be able to comment on and modify your writing at any
time of day or night.
If it is possible, I also hope to create a web of technical writing from
the writing products of the whole class. I've never done that before,
and I don't know how to do that yet. But let's see how it turns out.
One last comment on writing: Good technical writing has nothing to do
with whether your native language is English or something else.
The content of good technical writing will shine through a translation
into any language at all. It is the skill of writing good technical
content that I am looking for.
I would like to see each student choose a course project on the
general theme of virtualization. Ideally, if the student is already
working on a thesis or other long-term project, then some aspect
of that involving virtualization can be used as the course project.
The project can even be a thesis proposal or a chapter in a thesis.
I am not concerned with the particular functionality of the project.
The project may be based on actual software, or on a software paper design.
The project will be used as a vehicle for learning better
technical communication skills. We will use Overleaf or
Google Docs as a way to share a document between the professor and
the student. In this way, I can asynchronously point out places
where full communication could be improved.
From time to time, I will also ask you to spontaneously communicate
orally about the ideas of your projects. The goal is fluency in
communication --- not polished presentation. Polished presentation
will be emphasized closer to the end of the course.
Further, the emphasis of this software (or paper design) project will
not be software engineering. I don't care if any of your software
works at the end, or if any of your paper design has been implemented at
the end. Instead, I want you to poke at large, complex software from
the outside, in order to get insights into it. There are two kinds of
"proof" in this world: mathematical proof and scientific
proof. Mathematical proof is about formal proofs. It relates to
formal verification and to semantics of programming language. It works
best for programming "in the small". Scientific
proof is evidence-based. It works best for programming "in the large".
What are some scientific experiments that you
can perform on this large software in order to gain evidence for its
expected behavior? Virtualization is about interposing on large,
complex software. How does one poke at it, without having to spend
huge amounts of time reverse-engineering the internals of that software?
If you don't already have a project, or if you are interested in doing
something different, here are some possible ideas. This list may grow
over time. It is particularly DMTCP-centric because my group's research
emphasizes that project. Students are welcome to substitute their
favorite complex software (Hadoop?, Spark?, MPI?, a distributed system?)
for DMTCP, and ask similar questions.
- Microsoft supports Azure for their Cloud. Azure supports
a large subset of the Linux API. (See
Windows Subsystem for Linux and the
Library OS paper mentioned
there, for some insights into how Microsoft has used virtualization
to support Linux on top of Windows, which is also being done
in Microsoft Azure.) The goal of this project will be
to get a free student account for Microsoft Azure, and then
discover to what extent a package using many low-level
Linux features, such as
DMTCP, can or cannot
be ported to Microsoft Azure. If it is difficult to port,
what are the difficulties, and what subset of DMTCP might
be able to be ported? Is there a general way to describe
universal requirements for support of DMTCP on any possible
operating system (even a real-time embedded O/S)?
- DMTCP uses the dlsym system call to interpose on
functions to create wrapper functions. This is one of the key
requirements in the development of DMTCP. It is also and
excellent way to do virtualization through interposition.
However, Docker
and the Go language use statically linked binaries. For
statically linked binaries, dlsym has no effect.
(After all, "dl" stands for "dynamically linked".) What are
the alternatives, in order to support statically linked binaries?
If you choose this, I will provide you with a tiny package
that parses ELF libraries and then directly does interposition
on ELF symbols. How easy is it to port DMTCP to use this
new package instead of dlsym? In general, to what
extent does static linking imply that we lose the ability
to interpose (and hence virtualize), and to what extent
can we get around this with ELF symbols, ELF relocation,
trampolines, etc? (See these slides
for an overview of DMTCP.)
- In supercomputing, the InfiniBand network fabric has been
the standard for over a decade. It is based on the concept
of RDMA. There are now newer network fabrics that the world
is moving toward. For example, Intel Omni-Path stands a good chance of
becoming the next standard. There are also attempts to unify
these standards (e.g., OpenFabric). Every time that a new
network is chosen, most of the HPC software stack must be ported
toward that new stack. Further, DMTCP must then create a new
plugin for the new network. What is the potential for creating
an abstraction for an RDMA network sufficient for interposition
(and hence for virtualization). Note that OpenFabric tries
to provide a general network API or intermediate layer from which
one can then make calls to a lower-level layer. The question here
is subtly different. Suppose we don't want a higher-level abstraction
that may or may not cover the next brand new network fabric.
A high-level abstraction attemps to achieve backwards compatibility
with all earlier standards. But suppose that instead we want
to achieve forwards compatibility!
Suppose we want a single parametrized model in advance that
covers any possible future network fabric. How general can we
make this parametrized model so that it is future-proof?
- Docker
is a well-known container system. Another that has a little
less mindshare, but is equally interesting, is
CoreOS. To date,
the largest use of both is for non-persistent applications.
If a server dies in the middle of a transaction, then either
some front-end will redirect that transaction, or the client (user)
should re-submit the transaction. One is beginning to see
an interest now in persistent container applications.
A container can require more than a gigabyte for its filesystem.
Even worse, the running daemons (typically launched by
systemd). We can checkpoint just the application inside
the container (easier) or we can checkpoint the entire container
including systemd (more difficult). Checkpointing a
network of
QEMU Virtual Machines over a Linux kernel with KVM (with
slides here) turns out
to be surprisingly easy, since a guest VM appears as a process
to the host. Checkpointing a container is more difficult due
to issues of systemd, etc. Here's the
CRIU document on how
they approach this problem. With DMTCP, one would extend it
to support statically linked executables, and then use the
DMTCP plugin model to support networks and other services
of system daemons. Compare and contrast the two approaches
(and any others)? What are the pros and cons? Could we extend
this to creating an operating system with a "fast restart"
mode (restart the daemons from a checkpoint image), or a
fast live migration?
- I also intend to add some projects more closely oriented toward
the datacenter and possibly Docker, CoreOs, or another
Container-like system.. I'm thinking about Mesos
and Kubernetes. Ask me if you're interested.
- I also intend to add some projects more closely oriented toward
the Cloud, with the Massachusetts Open Cloud as a major resource.
Ask me if you're interested.
DMTCP-style checkpointing (interposition
on system calls)