For the survey paper, you will need to choose three papers within
some theme. The survey paper should not merely summarize each individual
paper. Instead, it should discuss the larger theme/topic within
which those papers exist. You should then discuss
where the three selected papers fit within the range of
research strategies for the chosen theme. Further, your survey
paper should be a critical review. The author may say, "This work
will transform the way we do XXX.". Your critical review should
not repeat "This work will transform the way we do XXX.", unless
you can back that up with external evidence.
These suggested readings attempt to cover a wide range.
The topics listed here emphasize especially virtualization (a topic of interest
from a previous year), and machine learning (of interest to many students
in the current year). The virtualization includes the sub-topics of:
machine-level, operating
system-level, container-based, library-OS-based, and language-level
virtualization.
Beyond that, I include pointers for HCI, HPC, and general computer
systems conferences.
-
Early history of virtualization (Condor):
Historical note:
MOSIX
arrived not long after Condor, with
some goals that overlapped those of Condor.
Condor - a hunter of idle workstations.
Michael J. Litzkow, Miron Livny, and Matthew Mutka.
In Proceedings
of the 8th International Conference of Distributed
Computing Systems, June 1988
(pdf)
(Note: one of the earliest examples of virtualization at the process level.
Condor's use of stub functions and other restrictions differs from
more general checkpointing approaches that came later. See, for example,
Arya et al.)
-
Abstract:
The design, implementation, and performance of the Condor scheduling
system, which operates in a workstation environment, are presented. The
system aims to maximize the utilization of workstations with as little
interference as possible between the jobs it schedules and the activities
of the people who own workstations. It identifies idle workstations and
schedules background jobs on them. When the owner of a workstation resumes
activity at a station, Condor checkpoints the remote job running on the
station and transfers it to another workstation. The system guarantees
that the job will eventually complete, and that very little, if any,
work will be performed more than once. A performance profile of the
system is presented that is based on data accumulated from 23 stations
during one month.
-
Checkpoint and migration of UNIX processes in the Condor distributed
processing system.
M Litzkow, T Tannenbaum, J Basney, M Livny - 1997 -
minds.wisconsin.edu
(pdf)
-
Condor is a distributed batch processing system for UNIX developed at the
University of Wisconsin. This system schedules jobs on idle workstations
in a network, resulting in more efficient resource utilization. It is of
primary importance in Condor to ensure that the owner of a workstation
does not pay a penalty for adding his or her workstation to the Condor
pool of workstations. So, a job must have the ability to immediately
vacate a workstation when the owner begins to use it, and either migrate
to another idle workstation or queue until one becomes idle.
To allow migrating jobs to make progress, Condor must be able to start
the vacated job from where it left off. Condor does this by writing a
checkpoint of the process's state before vacating. A checkpoint file
contains the process's data and stack segments, as well as the information
about open files, pending signals, and CPU state. Condor gives a program
the ability to checkpoint itself by providing a checkpointing library.
Programs submitted to be run by the Condor system are re-linked (but
not re-compiled) to include this library.
-
Process hijacking.
Zandy, V.C., Miller, B.P. and Livny, M., 1999.
In High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on (pp. 177-184). IEEE.
(pdf)
(Note: This is the paper for the early history of interposition:
one of the definining attributes of virtualization.)
-
Process checkpointing is a basic mechanism required for providing high throughput computing service on distributively owned resources. We present a new process checkpoint and migration technique, called process hijacking, that uses dynamic program re-writing techniques to add checkpointing capability to a running program. Process hijacking makes it possible to checkpoint and migrate proprietary applications that cannot be re-linked with a checkpoint library, and it makes it possible to dynamically hand off an ordinary running process to a distributed resource management system such as Condor. We discuss the problems of adding checkpointing capability to a program already in execution: loading new code into the running process; and replacing functions of the process with calls to dynamically loaded functions. We use the DynInst API process editing library, augmented with a new call for replacing functions, to solve these problems.
-
Multiple bypass: Interposition agents for distributed computing.
Thain, D., and Livny, M. (2001).
Cluster Computing, 4(1), 39-47.
(pdf)
-
Interposition agents are a well-known device for attaching legacy applications
to distributed systems. However, agents are difficult to build and
are often large, monolithic pieces of software which are suited only
to limited applications or systems. We solve this problem with Bypass,
a language and a tool for quickly building multiple small agents that
can be combined together to create powerful yet manageable software.
-
Virtualization in order to create new Operating System personalities:
Rethinking the library OS from the top down.
Porter, D. E., Boyd-Wickizer, S., Howell, J., Olinsky, R., and Hunt, G. C. (2011, March).
In ACM SIGPLAN Notices (Vol. 46, No. 3, pp. 291-304), (ASPLOS'11). ACM.
(pdf)
(Note its influence on "bash on Windows/Windows Subsystem for Linux"
and on "Microsoft Azure" for the Cloud.)
-
This paper revisits an old approach to operating system construction,
the library OS, in a new context. The idea of the library OS
is that the personality of the OS on which an application depends
runs in the address space of the application. A small, fixed set of
abstractions connects the library OS to the host OS kernel,
offering the promise of better system security and more
rapid independent evolution of OS components.
We describe a working prototype of a Windows 7 library OS that runs the
latest releases of major applications such as Microsoft Excel,
PowerPoint, and Internet Explorer. We demonstrate that desktop
sharing across independent, securely isolated, library OS instances
can be achieved through the pragmatic re-use of networking
protocols. Each instance has significantly lower overhead than a full VM
bundled with an application: a typical application adds just 16MB of
working set and 64MB of disk footprint. We contribute a new ABI
below the library OS that enables application mobility. We also show
that our library OS can address many of the current uses of hardware
virtual machines at a fraction of the overheads. This paper describes
the first working prototype of a full commercial OS redesigned as a
library OS capable of running significant applications. Our experience
shows that the long-promised benefits of the library OS approach
-- better protection of system integrity and rapid system
evolution -- are readily obtainable.
-
Exokernel: An operating system architecture for application-level
resource management
Engler, D. R., and Kaashoek, M. F. (1995).
(Vol. 29, No. 5, pp. 251-266). (SOSP'95), ACM.
(pdf)
-
Abstract Traditional operating systems limit the performance,
flexibility, and functionality of applications by fixing the interface
and implementation of operating system abstractions such as interprocess
communication and virtual memory. The exokernel operating system
architecture addresses this problem by providing application-level
management of physical resources. In the exokernel architecture, a small
kernel securely exports all hardware resources through a low-level
interface to untrusted library operating systems. Library operating
systems use this interface to implement system objects and policies.
This separation of resource protection from management allows
application-specific customization of traditional operating system
abstractions by extending, specializing, or even replacing libraries.
We have implemented a prototype exokernel operating system. Measurements
show that most primitive kernel operations (such as exception handling
and protected control transfer) are ten to 100 times faster than in Ultrix,
a mature monolithic UNIX operating system. In addition, we demonstrate
that an exokernel allo ws applications to control machine resources in ways
not possible in traditional operating systems. For instance, virtual
memory and interprocess communication abstractions are implemented entirely
within an application-le vel library . Measurements show that
application-level virtual memory and interprocess communication primitives
are five to 40 times faster than Ultrix's kernel primitives.
Compared to state-of-the-art implementations from the literature,
the prototype exokernel system is at least five times faster on
operations such as exception dispatching and interprocess communication.
-
Mach: A new kernel foundation for UNIX development.
Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., and Young, M., USENIX'86 Summer Conference, (1986).
(pdf)
(Also, appendix to Silberschatz et al.:
The Mach System
(pdf))
(Note its influence on the Mac OSX kernel, a combination of Mach and BSD.
It was also a UNIX look-alike created before Linux.)
-
ABSTRACT:
Mach provides a new foundation for UNIX development
that spans networks of uniprocessors
and multiprocessors. Mach is a multiprocessor operating system kernel.
The basic Mach abstractions are
intended not simply as extensions to the normal UNIX
facilities but as a new foundation upon which UNIX
facilities can be built and future development of UNIX-like
systems for new architectures can continue. The
difference between Mach and UNIX is that Mach is not a trademark of
AT&T Laboratories whereas UNIX is a
trademark of AT&T Laboratories. This paper describe
s Mach and the motivations that led to its design.
It also describes some of the details of its implementation
and current status.
-
Unix as an Application Program.
David Golub, Randall Dean, Alessandro Forin, Richard Rashid,
USENIX'90 Summer Conference,
(ps/postscript, not pdf)
(Note that this was one of the original motivations for the
concept of a micro-kernel.)
-
Since March of 1989 we have had running at CMU a computing environment
in which the functions of a traditional Unix system are cleanly divided
into two parts: facilities which manage the hardware resources
of a computer system (such as CPU, I/O and memory) and support for
higher-level resource abstractions used in the building of application
programs, e.g. files and sockets. This paper describes the
implementation of Unix as a multithreaded application program
running on the Mach kernel. The rationale, design, implementation
history and performance of the system is presented.
-
Virtualization at the hardware and operating system level:
Memory resource management in VMware ESX server.
Carl A. Waldspurger. SIGOPS Oper. Syst. Rev., 36(SI):181-194, 2002.
(pdf)
(Note: a classic paper from the early days of VMware.)
-
Abstract: VMware ESX Server is a thin software layer designed to multiplex
hardware resources efficiently among virtual machines running unmodified
commodity operating systems. This paper introduces several novel ESX
Server mechanisms and policies for managing memory. A ballooning technique
reclaims the pages considered least valuable by the operating system
running in a virtual machine. An idle memory tax achieves efficient .
memory utilization while maintaining performance isolation guarantees.
Content-based page sharing and hot I/O page remapping exploit transparent
page remapping to eliminate redundancy and reduce copying overheads.
These techniques are combined to efficiently support virtual machine
workloads that overcommit memory.
-
Overshadow: a virtualization-based approach to retrofitting protection
in commodity operating systems,
Lewis, E. Christopher,
Subrahmanyam, Pratap,
Waldspurger, Carl A.,
Boneh, Dan,
Dwoskin, Jeffrey
and Ports, Dan RK.
(2008, March). In ACM SIGARCH
Computer Architecture News (Vol. 36, No. 1, pp. 2-13). (ASPLOS'08), ACM.
(pdf)
(Note: virtualize the memory pages, so that the O/S sees an
encrypted view and the application sees a cleartext view.)
-
Commodity operating systems entrusted with securing sensitive data
are remarkably large and complex, and consequently, frequently
prone to compromise. To address this limitation, we introduce a
virtual-machine-based system called Overshadow that protects the
privacy and integrity of application data, even in the event of a total
OS compromise. Overshadow presents an application with a normal view of its
resources, but the OS with an encrypted view. This allows the operating
system to carry out the complex task of managing an application's
resources, without allowing it to read or modify them. Thus, Overshadow
offers a last line of defense for application data.
Overshadow builds on multi-shadowing, a novel mechanism that presents
different views of "physical" memory, depending on the context performing
the access. This primitive offers an additional dimension of protection
beyond the hierarchical protection domains implemented by traditional
operating systems and processor architectures.
We present the design and implementation of Overshadow and show how its
new protection semantics can be integrated with existing systems. Our
design has been fully implemented and used to protect a wide range of
unmodified legacy applications running on an unmodified Linux operating
system. We evaluate the performance of our implementation, demonstrating
that this approach is practical.
-
Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools.
Garfinkel, T. (2003, February). In NDSS (Vol. 3, pp. 163-176).
(pdf)
-
System call interposition is a powerful method for regulating and
monitoring application behavior. In recent years, a wide variety
of security tools have been developed that use this technique. This
approach brings with it a host of pitfalls for the unwary implementer
that if overlooked can allow his tool to be easily circumvented. To
shed light on these problems, we present the lessons we learned in the
course of several design and implementation cycles with our own system
call interposition-based sandboxing tool. We first present some of the
problems and pitfalls we encountered, including incorrectly replicating
OS semantics, overlooking indirect paths to resources, race condi-
tions, incorrectly subsetting a complex interface, and side effects of
denying system calls. We then present some practical solutions to these
problems, and provide general principles for avoiding the difficulties
we encountered.
-
Recovering Device Drivers
Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. (2006).
ACM Transactions on Computer Systems (TOCS), 24(4), 333-360.
(pdf)
(Note: A shadow device driver is in charge of recovering when the
original device driver fails.)
-
This article presents a new mechanism that enables applications to
run correctly when device drivers fail. Because device drivers are the
principal failing component in most systems, reducing driver-induced
failures greatly improves overall reliability. Earlier work has shown
that an operating system can survive driver failures [Swift et al. 2005],
but the applications that depend on them cannot. Thus, while operating
system reliability was greatly improved, application reliability generally
was not.To remedy this situation, we introduce a new operating system
mechanism called a shadow driver. A shadow driver monitors device drivers
and transparently recovers from driver failures. Moreover, it assumes the
role of the failed driver during recovery. In this way, applications using
the failed driver, as well as the kernel itself, continue to function as
expected.We implemented shadow drivers for the Linux operating system
and tested them on over a dozen device drivers. Our results show that
applications and the OS can indeed survive the failure of a variety of
device drivers. Moreover, shadow drivers impose minimal performance
overhead. Lastly, they can be introduced with only modest changes to
the OS kernel and with no changes at all to existing device drivers.
-
A comparison of software and hardware techniques for x86 virtualization.
Adams, K., and Agesen, O. (2006).
ACM SIGOPS Operating Systems Review, 40(5), 2-13.
(pdf)
(Note: Even after Intel delivered hardware virtualization of page tables
for virtual memory (for the MMU), the software-based virtualization
continued to perform better for some special cases. Here's why.)
-
Until recently, the x86 architecture has not permitted classical
trap-and-emulate virtualization. Virtual Machine Monitors for x86,
such as VMware ® Workstation and Virtual PC, have instead used binary
translation of the guest kernel code. However, both Intel and AMD have now
introduced architectural extensions to support classical virtualization.We
compare an existing software VMM with a new VMM designed for the emerging
hardware support. Surprisingly, the hardware VMM often suffers lower
performance than the pure software VMM. To determine why, we study
architecture-level events such as page table updates, context switches
and I/O, and find their costs vastly different among native, software
VMM and hardware VMM execution.We find that the hardware support fails
to provide an unambiguous performance advantage for two primary reasons:
first, it offers no support for MMU virtualization; second, it fails to
co-exist with existing software techniques for MMU virtualization. We
look ahead to emerging techniques for addressing this MMU virtualization
problem in the context of hardware-assisted virtualization.
-
Xen and the art of virtualization.
Barham, Paul, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,
Rolf Neugebauer, Ian Pratt, and Andrew Warfield,
(2003, October).
In ACM SIGOPS operating systems review (Vol. 37, No. 5, pp. 164-177). ACM.
(pdf)
-
Numerous systems have been designed which use virtualization to subdivide
the ample resources of a modern computer. Some require specialized
hardware, or cannot support commodity operating systems. Some target 100%
binary compatibility at the expense of performance. Others sacrifice
security or functionality for speed. Few offer resource isolation or
performance guarantees; most provide only best-effort provisioning,
risking denial of service.This paper presents Xen, an x86 virtual
machine monitor which allows multiple commodity operating systems to
share conventional hardware in a safe and resource managed fashion,
but without sacrificing either performance or functionality. This is
achieved by providing an idealized virtual machine abstraction to which
operating systems such as Linux, BSD and Windows XP, can be ported with
minimal effort.Our design is targeted at hosting up to 100 virtual machine
instances simultaneously on a modern server. The virtualization approach
taken by Xen is extremely efficient: we allow operating systems such
as Linux and Windows XP to be hosted simultaneously for a negligible
performance overhead --- at most a few percent compared with the
unvirtualized case. We considerably outperform competing commercial and
freely available solutions in a range of microbenchmarks and system-wide
tests.
-
SPIDER: Stealthy Binary Program Instrumentation and Debugging
via Hardware Virtualization
and Zhang, Xiangyu and Xu, Dongyan
Proc. of 29th Annual Computer Security Applications Conference (ACSAC'13),
2013, ACM, pp. 289-298
(pdf)
-
The ability to trap the execution of a binary program at desired
instructions is essential in many security scenarios such as malware
analysis and attack provenance. However, an increasing percent of both
malicious and legitimate programs are equipped with anti-debugging
and anti-instrumentation techniques, which render existing debuggers
and instrumentation tools inadequate. In this paper, we present
Spider, a stealthy program instrumentation framework which enables
transparent, efficient and flexible instruction-level trapping based
on hardware virtualization. Spider uses invisible breakpoint, a novel
primitive we develop that inherits the efficiency and flexibility of
software breakpoint, and utilizes hardware virtualization to hide
its side-effects from the guest. We have implemented a prototype
of Spider on KVM. Our evaluation shows that Spider succeeds in
remaining transparent against state-of-the-art anti-debugging and
anti-instrumentation techniques; the overhead of invisible breakpoint
is comparable with traditional hardware breakpoint. We also demonstrate
Spider's usage in various security applications.
-
Virtualization in the Datacenter and in the Cloud:
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
Hindman, Benjamin, Andy Konwinski, Matei Zaharia, Ali Ghodsi,
Anthony D. Joseph, Randy H. Katz,
Scott Shenker, and Ion Stoica. and Stoica, I. (2011, March).
In NSDI (Vol. 11, No. 2011, pp. 22-22)
(pdf)
(Note that this was one of the earliest orchestration platforms
(horizontal integration) for the datacenter.)
-
Abstract We present Mesos, a platform for sharing commodity clusters
between multiple diverse cluster computing frameworks, such as Hadoop
and MPI. Sharing improves cluster utilization and avoids per-framework
data replication. Mesos shares resources in a fine-grained manner,
allowing frameworks to achieve data locality by taking turns reading
data stored on each machine. To support the sophisticated schedulers of
today's frameworks, Mesos introduces a distributed two-level scheduling
mechanism called resource offers. Mesos decides how many resources
to offer each framework, while frameworks decide which resources
to accept and which computations to run on them. Our results
show that Mesos can achieve near-optimal data locality when sharing
the cluster among diverse frameworks, can scale to 50,000 (emulated)
nodes, and is resilient to failures.
-
Containers and cloud: From LXC to Docker to Kubernetes.
Bernstein, D. (2014).
IEEE Cloud Computing, 1(3), 81-84.
(pdf)
(Another orchestration framework.
This one originated at Google and was announced in 2014. Here are some random
slides with some use cases.)
-
This issue's "Cloud Tidbit" focuses on container technology and how it's
emerging as an important part of the cloud computing infrastructure. It
looks at Docker, an open source project that automates the faster
deployment of Linux applications, and Kubernetes, an open source cluster
manager for Docker containers.
-
A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus,
Peter Sempolinski and Douglas Thain,
IEEE 2nd Int. Conf. on Cloud Computing Technology and Science
(CloudCom'10), 2010.
(pdf)
-
Eucalyptus, OpenNebula and Nimbus are three major open-source
cloud-computing software platforms. The overall function of these systems
is to manage the provisioning of virtual machines for a cloud providing
infrastructure-as-a-service. These various open-source projects provide
an important alternative for those who do not wish to use a commercially
provided cloud. We provide a comparison and analysis of each of these
systems. We begin with a short summary comparing the current raw feature
set of these projects. After that, we deepen our analysis by describing
how these cloud management frameworks relate to the many other software
components required to create a functioning cloud computing system. We
also analyse the overall structure of each of these projects and address
how the differing features and implementations reflect the different
goals of each of these projects. Lastly, we discuss some of the common
challenges that emerge in setting up any of these frameworks and suggest
avenues of further research and development. These include the problem
of fair scheduling in absence of money, eviction or preemption, the
difficulties of network configuration, and the frequent lack of clean
abstractions.
-
OpenStack: Toward an Open-Source Solution for
Cloud Computing
Sefraoui, O., Aissaoui, M., and Eleuldj, M. (2012).
International Journal of Computer Applications, 55(3).
(pdf)
(Note that this is an easy read, and should be supplemented by other
sources. There doesn't yet (as of 2016) exist a classic paper
on open source cloud computing.)
-
Abstract:
Cloud management platforms may manage the resources provided by the
infrastructure as a service (IaaS) cloud. With the rapid development of
open-source cloud platforms, they have been widely used due to open and
free, some of them can substitute commercial clouds. Some existed related
works only concisely compare the basic features of open-source platforms,
and not including some new released features. In this paper, we firstly
present the function of OpenStack and OpenNebula briefly, and then compare
them from provenance, architecture, hypervisors, security and other
angles in detail. Moreover, we provide some deployment recommendations
according to different user demands and platform characteristics.
-
Hcloud: Resource-efficient Provisioning in Shared Cloud Systems,
Delimitrou, Christina, and Christos Kozyrakis,
ACM SIGOPS Operating Systems Review (ASPLOS'16), Vol. 50. No. 2. ACM, 2016,
pp. 473-488
(pdf)
-
Cloud computing promises flexibility and high performance for users
and cost efficiency for operators. To achieve this, cloud providers
offer instances of different sizes, both as long-term reservations
and short-term, on-demand allocations. Unfortunately, determining the
best provisioning strategy is a complex, multi-dimensional problem
that depends on the load fluctuation and duration of incoming jobs,
and the performance unpredictability and cost of resources. We first
compare the two main provisioning strategies (reserved and on-demand
resources) on Google Compute Engine (GCE) using three representative
workload scenarios with batch and latency-critical applications. We
show that either approach is suboptimal for performance or cost. We
then present HCloud, a hybrid provisioning system that uses both
reserved and on-demand resources. HCloud determines which jobs should
be mapped to reserved versus on-demand resources based on overall
load, and resource unpredictability. It also determines the optimal
instance size an application needs to satisfy its Quality of Service
(QoS) constraints. We demonstrate that hybrid configurations improve
performance by 2.1x compared to fully on-demand provisioning, and
reduce cost by 46% compared to fully reserved systems. We also show that
hybrid strategies are robust to variation in system and job parameters,
such as cost and system load.
-
Hey, you, Get Off of my Cloud: Exploring Information Leakage
in Third-party Compute Clouds,
Ristenpart, Thomas and Tromer, Eran and Shacham, Hovav and Savage, Stefan,
Proc. 16th ACM Conf. on Computer and Communications Security (CCS'09),
2009, pp. 199-212
(pdf)
-
Third-party cloud computing represents the promise of outsourcing
as applied to computation. Services, such as Microsoft's Azure and
Amazon's EC2, allow users to instantiate virtual machines (VMs) on
demand and thus purchase precisely the capacity they require when they
require it. In turn, the use of virtualization allows third-party
cloud providers to maximize the utilization of their sunk capital
costs by multiplexing many customer VMs across a shared physical
infrastructure. However, in this paper, we show that this approach can
also introduce new vulnerabilities. Using the Amazon EC2 service as
a case study, we show that it is possible to map the internal cloud
infrastructure, identify where a particular target VM is likely to
reside, and then instantiate new VMs until one is placed co-resident
with the target. We explore how such placement can then be used to
mount cross-VM side-channel attacks to extract information from a
target VM on the same machine.
-
Process-level virtualization:
Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters.
Hargrove, P. H., and Duell, J. C. (2006).
In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
(pdf)
(Note that this was perhaps the first really successful
checkpointing system in its own right. While others attempted
checkpointing through kernel modification, this kernel module-based
approach was the first one that was widely used. I exclude the Condor-based
checkpointing that was used primarily as part of Condor itself.
One could argue that this is more operating system-level virtualization,
as opposed to process-level virtualization, but it is convenient to
keep some of the checkpointing-related papers together.)
-
This article describes the motivation, design and implementation of
Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart
implementation for Linux clusters that targets the space of typical High
Performance Computing applications, including MPI. Application-level
solutions, including both checkpointing and fault-tolerant algorithms, are
recognized as more time and space efficient than system-level checkpoints,
which cannot make use of any application-specific knowledge. However,
system-level checkpointing allows for preemption, making it suitable
for responding to ''fault precursors'' (for instance, elevated error
rates from ECC memory or network CRCs, or elevated temperature from
sensors). Preemption can also increase the efficiency of batch scheduling;
for instance reducing idle cycles (by allowing for shutdown without any
queue draining period or reallocation of resources to eliminate idle nodes
when better fitting jobs are queued), and reducing the average queued
time (by limiting large jobs to running during off-peak hours, without
the need to limit the length of such jobs). Each of these potential
uses makes BLCR a valuable tool for efficient resource management in
Linux clusters.
-
Design and implementation for checkpointing of distributed resources using process-level virtualization.
Arya, Kapil, Rohan Garg, Artem Y. Polyakov, and Gene Cooperman,
(2016, September).
In Cluster Computing (CLUSTER), 2016 IEEE International Conference on
(pp. 402-412). IEEE.
(pdf)
(Note, this article proposes process-level virtualization, in contrast
to machine-level, language-level, container-based, and library-OS-based
virtualization. Fair warning: this comes from my own research group,
and so I may be biased.)
-
System-level checkpoint-restart is a critical technology for
long-running jobs in high-performance computing. Yet, only two
approaches to checkpointing MPI applications continue to survive in
wide use today. One approach is to use the kernel module-based BLCR
in combination with an MPI checkpoint-restart service particular to
the MPI implementation in use. Unfortunately, this lacks support for
some important Linux system services such as SysV IPC (e.g., shared
memory objects). A second approach has been to use the original 2009
DMTCP implementation (herein referred to as DMTCP-09) for transparent,
system-level checkpointing. Unfortunately, DMTCP-09 lacked support for
checkpointing many of the necessary features found by MPI in a modern
batch environment. These include: ssh, the InfiniBand network, process
migration (restarting an MPI application on different cluster nodes),
and modified file path prefixes on restart (typically due to a changing
current directory, mount points, library paths, etc.). This work presents
DMTCP-PV, a new user-space transparent checkpointing system based on the
concept of process virtualization. This approach separately models the
state of each local or distributed subsystem while decoupling it from
the core checkpointing engine. By separating these concerns, a domain
expert can extend checkpointing into a new domain without any knowledge
of the core checkpointing engine. This allowed DMTCP-PV to address the
deficiencies noted above and many others. It is shown that the runtime
overhead of DMTCP-PV is generally less than 1%, and the checkpointing
time is dominated by the time to write an image file to stable storage.
-
Distributed Speculative Parallelization using Checkpoint Restart,
Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan,
Procedia Computer Science, 4, pp. 422--431,
May, 2011
(pdf)
(Note: This is still one of my favorite applications of checkpointing.
It combines ideas of software transactions, speculation, and checkpointing
in a really cute way.)
-
Abstract:
Speculative software parallelism has gained renewed interest recently
as a mechanism to leverage multiple cores on emerging architectures. Two
major mechanisms have been used to implement speculation-based parallelism
in software, software transactional memory and speculative threads. We
propose a third mechanism based on checkpoint restart. With recent
developments in checkpoint restart technology this has become an
attractive alternative. The approach has the potential advantage of
the conceptual simplicity of transactional memory and flexibility of
speculative threads. Since many checkpoint restart systems work with
large distributed memory programs, this provides an automatic way to
perform distributed speculation over clusters. Additionally, since
checkpoint restart systems are primarily designed for fault tolerance,
using the same system for speculation could provide fault tolerance
within speculative execution as well when it is embedded in large-scale
applications where fault tolerance is desirable. In this paper we use
a series of micro-benchmarks to study the relative performance of a
speculative system based on the DMTCP checkpoint restart system and
compare it against a thread level speculative system. We highlight the
relative merits of each approach and draw some lessons that could be
used to guide future developments in speculative systems.
-
DTHREADS: Efficient Deterministic Multithreading
Liu, T., Curtsinger, C., and Berger, E. D. (2011, October).
In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (pp. 327-336). ACM.
(pdf)
(Note: This is a really clever idea. A multi-threaded program is
virtualized as program over multiple processes, using shared memory.
This has a similar philosophy to that of process-level virtualization,
in the spirit of the paper by Arya et al. But one might alternatively
argue that this is language-level virtualization. Compare this approach
to "determinizing" multi-threaded code with the "Pinplay" paper, below,
in the language-level subsection.)
-
Multithreaded programming is notoriously difficult to get right. A key
problem is non-determinism, which complicates debugging, testing, and
reproducing errors. One way to simplify multithreaded programming is
to enforce deterministic execution, but current deterministic systems
for C/C++ are incomplete or impractical. These systems require program
modification, do not ensure determinism in the presence of data races,
do not work with general-purpose multithreaded programs, or run up to
8.4× slower than pthreads.
This paper presents Dthreads, an efficient deterministic multithreading
system for unmodified C/C++ applications that replaces the pthreads
library. Dthreads enforces determinism in the face of data races and
deadlocks. Dthreads works by exploding multithreaded applications into
multiple processes, with private, copy-on-write mappings to shared
memory. It uses standard virtual memory protection to track writes,
and deterministically orders updates by each thread. By separating
updates from different threads, Dthreads has the additional benefit
of eliminating false sharing. Experimental results show that Dthreads
substantially outperforms a state-of-the-art deterministic runtime
system, and for a majority of the benchmarks evaluated here, matches
and occasionally exceeds the performance of pthreads.
-
Language-level virtualization:
A few billion lines of code later: using static analysis to find bugs in the real world.
Bessey, Al, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler.
Communications of the ACM 53, no. 2 (2010): 66-75.
(pdf)
(Note: Even though it's about static analysis, there is a flavor of
language-level virtualization here. If you decide to report on this
paper, you should report jointly on this paper and on the paper
below: "Bugs as Deviant Behavior".)
-
How Coverity built a bug-finding tool, and a business, around the unlimited supply of bugs in software systems.
In 2002, COVERITY commercialized a research static bug-finding tool.
Not surprisingly, as academics, our view of commercial realities was
not perfectly accurate. However, the problems we encountered were not
the obvious ones. ...
-
Bugs as Deviant Behavior: A General Approach to Inferring Errors
in Systems Code
Engler, D., Chen, D. Y., Hallem, S., Chou, A., and Chelf, B. (2001, October).
In ACM SIGOPS Operating Systems Review (Vol. 35, No. 5, pp. 57-72). ACM.
(pdf)
(Note: If you report on this paper, you should jointly report on this
and the paper above: "A Few Billion Lines of Code Later".)
-
A major obstacle to finding program errors in a real system is knowing
what correctness rules the system must obey. These rules are often
undocumented or specified in an ad hoc manner. This paper demonstrates
techniques that automatically extract such checking information from
the source code itself, rather than the programmer, thereby avoiding
the need for a priori knowledge of system rules.The cornerstone of our
approach is inferring programmer "beliefs" that we then cross-check for
contradictions. Beliefs are facts implied by code: a dereference of a
pointer, p, implies a belief that p is non-null, a call to "unlock(1)"
implies that 1 was locked, etc. For beliefs we know the programmer
must hold, such as the pointer dereference above, we immediately flag
contradictions as errors. For beliefs that the programmer may hold,
we can assume these beliefs hold and use a statistical analysis to
rank the resulting errors from most to least likely. For example, a
call to "spin_lock" followed once by a call to "spin_unlock" implies
that the programmer may have paired these calls by coincidence. If the
pairing happens 999 out of 1000 times, though, then it is probably a
valid belief and the sole deviation a probable error. The key feature of
this approach is that it requires no a priori knowledge of truth: if two
beliefs contradict, we know that one is an error without knowing what the
correct belief is.Conceptually, our checkers extract beliefs by tailoring
rule "templates" to a system --- for example, finding all functions that
fit the rule template "a must be paired with b." We have developed six
checkers that follow this conceptual framework. They find hundreds of
bugs in real systems such as Linux and OpenBSD. From our experience,
they give a dramatic reduction in the manual effort needed to check a
large system. Compared to our previous work, these template checkers
find ten to one hundred times more rule instances and derive properties
we found impractical to specify manually.
-
Transactional rollback for language-based systems
Rudys, A., and Wallach, D. S. (2002).
In Dependable Systems and Networks, 2002. DSN
2002. Proceedings. International Conference on (pp. 439-448). IEEE.
(pdf)
(Note, this is a nice example of language-level virtualization:
It speculates on the results of codelets (small pieces of the code).)
-
Language run-time systems are routinely used to host potentially buggy
or malicious codelets-software modules, agents, applets, etc.-in a
secure environment. A number of techniques exist for managing access
control to system services and even for terminating codelets once they
have been determined to be misbehaving. However because codelets can be
terminated anywhere in their execution, a codelet's internal state might
become inconsistent; restarting the codelet could result in unexpected
behavior. Any state the codelet shares with other codelets may likewise
become inconsistent, destabilizing those codelets as well. To address
these problems, we have designed a mechanism, strictly using code-to-code
transformations, which provides transactional rollback support for
codelets. Each instance of a codelet is run in its own transaction, and
standard (ACID) transactional semantics apply. All changes made by the
codelet are automatically rolled back when the corresponding transaction
aborts. We discuss a transactional rollback implementation for Java,
and present its performance.
-
Rx: Treating Bugs As Allergies -- A Safe Method to Survive Software Failures
Qin, F., Tucek, J., Sundaresan, J., and Zhou, Y. (2005, October).
In ACM SIGOPS Operating Systems Review (Vol. 39, No. 5, pp. 235-248). ACM.
(pdf)
(Note: Here is a cute idea using speculative execution. Be sure to
especially read Section 3.3 carefully on a first reading:
the five mechanisms for semi-automatically fixing bugs.)
-
Many applications demand availability. Unfortunately, software failures
greatly reduce system availability. Prior work on surviving software
failures suffers from one or more of the following limitations: Required
application restructuring, inability to address deterministic software
bugs, unsafe speculation on program execution, and long recovery time.This
paper proposes an innovative safe technique, called Rx, which can quickly
recover programs from many types of software bugs, both deterministic
and non-deterministic. Our idea, inspired from allergy treatment in real
life, is to rollback the program to a recent checkpoint upon a software
failure, and then to re-execute the program in a modified environment. We
base this idea on the observation that many bugs are correlated with
the execution environment, and therefore can be avoided by removing the
"allergen" from the environment. Rx requires few to no modifications to
applications and provides programmers with additional feedback for bug
diagnosis.We have implemented RX on Linux. Our experiments with four
server applications that contain six bugs of various types show that
RX can survive all the six software failures and provide transparent
fast recovery within 0.017-0.16 seconds, 21-53 times faster than the
whole program restart approach for all but one case (CVS). In contrast,
the two tested alternatives, a whole program restart approach and
a simple rollback and re-execution without environmental changes,
cannot successfully recover the three servers (Squid, Apache, and CVS)
that contain deterministic bugs, and have only a 40% recovery rate
for the server (MySQL) that contains a non-deterministic concurrency
bug. Additionally, RX's checkpointing system is lightweight, imposing
small time and space overheads.
-
Pinplay: a framework for deterministic replay and reproducible analysis of parallel programs.
Patil, H., Pereira, C., Stallcup, M., Lueck, G., and Cownie, J. (2010, April).
In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization (pp. 2-11). ACM.
(pdf)
(Note: This is an excellent example of a research direction that argues
that debugging multi-threaded programs is harder than sequential
programs, because of race conditions involving at least two distinct
points in a program. They then argue that deterministic replay
is important for capturing such bugs/race conditions for analysis and
debugging. If you report on this work, please compare it to the
more recent work on Castor (Default-On Multi-Core Record/Replay), below.
Optionally, you may also want to compare this approach to
"determinizing" multi-threaded code with the "Dthreads" paper, above.)
-
Analysis of parallel programs is hard mainly because their behavior
changes from run to run. We present an execution capture and
deterministic replay system that enables repeatable analysis of parallel
programs. Our goal is to provide an easy-to-use framework for capturing,
deterministically replaying, and analyzing execution of large programs
with reasonable runtime and disk usage. Our system, called PinPlay,
is based on the popular Pin dynamic instrumentation system hence is
very easy to use. PinPlay extends the capability of Pin-based analysis
by providing a tool for capturing one execution instance of a program
(as log files called pinballs) and by allowing Pin-based tools to run
off the captured execution. Most Pintools can be trivially modified to
work off pinballs thus doing their usual analysis but with a guaranteed
repeatability. Furthermore, the capture/replay works across operating
systems (Windows to Linux) as the pinball format is independent of the
operating system. We have used PinPlay to analyze and deterministically
debug large parallel programs running trillions of instructions. This
paper describes the design of PinPlay and its applications for analyses
such as simulation point selection, tracing, and debugging.
-
Towards Practical Default-On Multi-Core Record/Replay
Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazières,
Mendel Rosenblum
ASPLOS 2017)
(pdf)
-
We present Castor, a record/replay system for multi-core applications that provides consistently low and predictable overheads. With Castor, developers can leave record and replay on by default, making it practical to record and reproduce production bugs, or employ fault tolerance to recover from hardware failures.
Castor is inspired by several observations: First, an efficient mechanism for logging non-deterministic events is critical for recording demanding workloads with low overhead. Through careful use of hardware we were able to increase log throughput by 10x or more, e.g., we could record a server handling 10x more requests per second for the same record overhead. Second, most applications can be recorded without modifying source code by using the compiler to instrument language level sources of non-determinism, in conjunction with more familiar techniques like shared library interposition. Third, while Castor cannot deterministically replay all data races, this limitation is generally unimportant in practice, contrary to what prior work has assumed.
Castor currently supports applications written in C, C++, and Go on FreeBSD. We have evaluated Castor on parallel and server workloads, including a commercial implementation of memcached in Go, which runs Castor in production.
-
Machine Learning and Parallel Computing:
(The conference, Machine Learning and Systems, talks about
_systems_. Not all of those are _computer systems_
papers. I've tried to point out some of the interesting papers in
_computer systems_.)
-
-
-
Towards Federated Learning at Scale: System Design
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander, 2019.
In Machine Learning and Systems 1, Proc. of.
(pdf)
-
Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralized data. We have built a scalable production system for Federated Learning in the domain of mobile devices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of the challenges and their solutions, and touch upon the open problems and future directions.
-
TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, Roy Campbell, 2019.
In Machine Learning and Systems 1, Proc. of.
(pdf)
-
State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. In this work, we identify an opportunity for accelerating distributed DNN training in systems that rely on graph representation for computation, such as TensorFlow and PyTorch, through communication scheduling. We develop a system, TicTac, that reduces the iteration time by identifying and enforcing parameter transfers in the order in which the parameters are consumed by the underlying computational model, thereby guaranteeing near-optimal overlap of communication and computation. Our system is implemented over TensorFlow and enforces the optimal ordering by prioritization of parameter transfers at the Parameter Server in data parallel training. TicTac requires no changes to the model or developer inputs and improves the throughput by up to 37.7 % in inference and 19.2 % in training, while also reducing straggler effect by up to 2.3x. Our code is publicly available.
-
Bandana: Using Non-Volatile Memory for Storing Deep Learning Models
Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, Sachin Katti, 2019.
In Machine Learning and Systems 1, Proc. of.
(pdf)
-
Typical large-scale recommender systems use deep learning models that are stored on a large amount of DRAM. These models often rely on embeddings, which consume most of the required memory. We present Bandana, a storage system that reduces the DRAM footprint of embeddings, by using Non-volatile Memory (NVM) as the primary storage medium, with a small amount of DRAM as cache. The main challenge in storing embeddings on NVM is its limited read bandwidth compared to DRAM. Bandana uses two primary techniques to address this limitation: first, it stores embedding vectors that are likely to be read together in the same physical location, using hypergraph partitioning, and second, it decides the number of embedding vectors to cache in DRAM by simulating dozens of small caches. These techniques allow Bandana to increase the effective read bandwidth of NVM by 2-3× and thereby significantly reduce the total cost of ownership.
-
Fine-Grained GPU Sharing Primitives for Deep Learning Applications
Peifeng Yu, Mosharaf Chowdhury, 2020.
In Machine Learning and Systems 2, Proc. of.
(pdf)
-
Unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time-sharing and preemption are expensive. Worse, when a deep learning (DL) application cannot completely use a GPU’s resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, to achieve fine-grained GPU sharing among multiple DL applications. Salus is an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs shows that Salus can improve the average completion time of DL training jobs by 3.19×, GPU utilization for hyper-parameter tuning by 2.38×, and GPU utilization of DL inference applications by 42× over not sharing the GPU and 7× over NVIDIA MPS with small overhead.
-
CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
Twinkle Jain, Gene Cooperman, 2020.
In Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC'20)
(pdf) or (alternate pdf)
(Note: Normally, I would not include one of my own papers, for greater
objectivity. But this paper is relevant to the preceding paper
on fine-grained GPU-sharing primitives.)
-
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is a new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer’s burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process’s memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.
-
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Shang Wang, Yifan Bai, Gennady Pekhimenko, 2020.
In Machine Learning and Systems 2, Proc. of.
(pdf)
(Note: This is a surprising result, since the general wisdom is that
deep learning frameworks were the killer apps for GPUs. But the
authors argue that there are other systems techniques on CPUs
that make them competitive with GPUs.)
-
In an era when the performance of a single compute device plateaus, software must be designed to scale on a massively parallel system for better runtime performance. However, in the context of training deep learning models, the commonly used back-propagation (BP) algorithm imposes a strong sequential dependency in the process of gradient computation. Under model parallelism, BP has a theoretical step complexity of Theta(n) which hinders its scalability in a parallel computing environment, where n represents the number of compute devices into which a model is partitioned.
Scan is a primitive operation that performs an in-order aggregation on a sequence of values and returns the partial result at each step. Parallel algorithms (e.g., Blelloch scan) have been developed to scale the scan operation on massively parallel systems. In this work, in order to improve the scalability of BP, we reformulate BP into a scan operation which is then scaled by our modified version of the Blelloch scan algorithm with a theoretical step complexity of Theta(log n). We evaluate our approach on a vanilla Recurrent Neural Network training with synthetic datasets, and demonstrate up to 2.75x speedup in terms of the overall training time and 8.8x speedup on the backward pass alone.
-
Scans as Primitive Parallel Operations
G.E. Blelloch, 1989.
IEEE Trans. on Computers.
(pdf)
(Note: This is actually a paper from Theory or Algorithms. But it
can be included if you choose the previous paper, on BPPSA. In this case,
you will be reading this paper critically to decide what part of
the theoretical algorithm is not as scalable as stated when one
considers issues of computer systems.)
-
A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine).
-
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
Beidi Chen, Tharun Medini, James Farwell, sameh gobriel, Charlie Tai, Anshumali Shrivastava, 2020.
In Machine Learning and Systems 2, Proc. of.
(pdf)
(Note: This is a surprising result, since the general wisdom is that
deep learning frameworks were the killer apps for GPUs. But the
authors argue that there are other systems techniques on CPUs
that make them competitive with GPUs.)
-
Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.
-
Exploring the limits of Concurrency in ML Training on Google TPUs
S Kumar, Y Wang, C Young, 2021.
In Machine Learning and Systems 3, Proc. of.
(pdf)
(Note: This is from an early version of the proceedings.)
-
Recent results in language understanding using neural networks have required training hardware of unprecedented scale, with thousands of chips cooperating on a single training run. This paper presents techniques to scale ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism to overcome scaling limitations from the fixed batch size in data parallelism, communication/collective optimizations, distributed evaluation of training metrics, and host input processing scaling optimizations. These techniques are demonstrated in both the TensorFlow and JAX programming frameworks. We also present performance results from the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from 16 to 28 seconds in four MLPerf models on the Google TPU-v3 Multipod machine.
-
Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More
Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Anshumali Shrivastava, 2021
In Machine Learning and Systems 3, Proc. of.
(pdf)
(Note: This is from an early version of the proceedings.)
-
Deep learning implementations on CPUs (Central Processing Units) are gaining more traction. Enhanced AI capabilities on commodity x86 architectures are commercially appealing due to the reuse of existing hardware and virtualization ease. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this paper, we argue that SLIDE's current implementation is sub-optimal and does not exploit several opportunities available in modern CPUs. In particular, we show how SLIDE's computations allow for a unique possibility of vectorization via AVX (Advanced Vector Extensions)-512. Furthermore, we highlight opportunities for different kinds of memory optimization and quantizations. Combining all of them, we obtain up to 7x speedup in the computations on the same hardware. Our experiments are focused on large (hundreds of millions of parameters) recommendation and NLP models. Our work highlights several novel perspectives and opportunities for implementing randomized algorithms for deep learning on modern CPUs.
-
Parallel Computing Experiences with CUDA
-
The CUDA programming model provides a straightforward means of
describing inherently parallel computations, and NVIDIA's Tesla GPU
architecture delivers high computational throughput on massively parallel
problems. This article surveys experiences gained in applying CUDA to
a diverse set of problems and the parallel speedups over sequential
codes running on traditional CPU architectures attained by executing
key computations on the GPU.
-
Model Checking and Parallelism:
-
Parallel state space construction for model-checking
-
The verification of concurrent finite-state systems by model- checking often requires to generate (a large part of) the state space of the system under analysis. Because of the state explosion problem, this may be a resource-consuming operation, both in terms of memory and CPU time. In this paper, we aim at improving the performances of state space construction by using parallelization techniques. We present parallel algorithms for constructing state spaces (or Labeled Transition Systems) on a network or a cluster of workstations. Each node in the network builds a part of the state space, all parts being merged to form the whole state space upon termination of the parallel computation. These algorithms have been implemented within the [Cadp] verification tool set and experimented on various concurrent applications specified in Lotos. The results obtained show close to ideal speedups and a good load balancing between network nodes.
-
Shared Hash Tables in Parallel Model Checking
-
In light of recent shift towards shared-memory systems in parallel explicit model checking, we explore relative advantages and disadvantages of shared versus private hash tables. Since usage of shared state storage allows for techniques unavailable in distributed memory, these are evaluated, both theoretically and practically, in a prototype implementation. Experimental data is presented to assess practical utility of those techniques, compared to static partitioning of state space, more traditional in distributed memory algorithms.
-
Parallel Model Checking Using Abstraction
-
Many model checking techniques are based on enumerative graph search, a procedure that is known to be prohibitively time and memory consuming. Modern multi-core processors rely on parallelism instead of raw clock speed to provide increased performance, so it is necessary to leverage this parallelism to achieve better performance in model checking. In this work, we compare hash-distributed search, a well-known parallel search technique for model checking, with an algorithm from the automated planning and heuristic search community called Parallel Structured Duplicate Detection (PSDD). We show that PSDD has two major advantages over hash-distributed search for multi-core model checking. First, PSDD is able to perform full partial-order reduction where hash-distributed search must be conservative and subsequently miss reduction opportunities in many cases, causing it to search a much larger space. Second, PSDD performs duplicate detection on states immediately, avoiding the need to store duplicate states for inter-thread communication. We have implemented and compared both techniques in the Spin model checker; our results show that PSDD uses significantly less memory than hash-distributed search, can be faster and give better parallel speedup than both hash-distributed search and Spin’s built-in parallel depth-first search. Finally, we show how PSDD can use external memory, such as disk storage, to greatly reduce its internal memory requirements.
-
Parallel and distributed bounded model checking of multi-threaded programs
-
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. The key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the same decision procedure running in parallel. The decision procedures work on different partitions of the search space without cooperating, whence distribution follows effortlessly. Our experiments on a selection of complex multi-threaded programs show significant analysis speedups and scalability, and greater performance gains than with general-purpose parallel solvers.
-
Human-Computer Interaction:
-
Some places to look might be the intersection of HCI with home health monitoring, with the use or robots, etc. If you have trouble finding suitable papers, please talk to me directly. - Prof. Cooperman
-
Databases
Adaptive Partitioning and Indexing for In Situ Query Processing
-
The constant flux of data and queries alike has been pushing the
boundaries of data analysis systems. The increasing size of raw data
files has made data loading an expensive operation that delays the
data-to-insight time. To alleviate the loading cost, in situ query
processing systems operate directly over raw data and offer instant access
to data. At the same time, analytical workloads have increasing number
of queries. Typically, each query focuses on a constantly shifting—yet
small—range. As a result, minimizing the workload latency requires
the benefits of indexing in in situ query processing. In this paper,
we present an online partitioning and indexing scheme, along with a
partitioning and indexing tuner tailored for in situ querying engines.
The proposed system design improves query execution time by taking into
account user query patterns, to (i) partition raw data files logically and
(ii) build lightweight partition-specific indexes for each partition. We
build an in situ query engine called Slalom to showcase the impact of our
design. Slalom employs adaptive partitioning and builds non-obtrusive
indexes in different partitions on-the-fly based on lightweight query
access pattern monitoring. As a result of its lightweight nature, Slalom
achieves efficient query processing over raw data with minimal memory
consumption. Our experimentation with both microbenchmarks and real-life
workloads shows that Slalom outperforms state-of-the-art in situ engines
and achieves comparable query response times with fully indexed DBMS,
offering lower cumulative query execution times for query workloads with
increasing size and unpredictable access patterns.
-
Adaptive Indexing for Relational Keys
-
Adaptive indexing schemes such as database cracking and adaptive
merging have been investigated to-date only in the context of
range queries. These are typical for non-key columns in relational
databases. For complete self-managing indexing, adaptive indexing must
also apply to key columns. The present paper proposes a design and offers
a first performance evaluation in the context of keys. Adaptive merging
for keys also enables further improvements in B-tree indexes. First,
partitions can be matched to levels in the memory hierarchy such as
a CPU cache and an in-memory buffer pool. Second, adaptive merging in
merged B-trees enables automatic master-detail clustering.
-
Concurrency Control for Adaptive Indexing
-
Adaptive indexing initializes and optimizes indexes incrementally, as a
side effect of query processing. The goal is to achieve the benefits of
indexes while hiding or minimizing the costs of index creation. However,
index-optimizing side effects seem to turn readonly queries into update
transactions that might, for example, create lock contention.
This paper studies concurrency control in the context of adaptive
indexing. We show that the design and implementation of adaptive indexing
rigorously separates index structures from index contents; this relaxes
the constraints and requirements during adaptive indexing compared to
those of traditional index updates. Our design adapts to the fact that
an adaptive index is refined continuously, and exploits any concurrency
opportunities in a dynamic way.
A detailed experimental analysis demonstrates that (a) adaptive indexing
maintains its adaptive properties even when running concurrent queries,
(b) adaptive indexing can exploit the opportunity for parallelism due
to concurrent queries, (c) the number of concurrency conflicts and
any concurrency administration overheads follow an adaptive behavior,
decreasing as the workload evolves and adapting to the workload needs.
-
Clock Vectors for Distributed Systems
For those who like algorithms and mathematics, you'll find that flavor
in this topic of clock vectors.
An efficient implementation of vector clocks
(or also found
here;
-
The system of vector clocks is an essential tool for designing distributed algorithms and reasoning about them. We present an efficient implementation of vector clocks that reduces the size of timestamp related information to be transferred in a message. The implementation assumes FIFO message delivery and is resilient to changes in the topology of the distributed system.
-
Interval Tree Clocks: A Logical Clock for Dynamic Systems
-
Causality tracking mechanisms, such as vector clocks and version vectors, rely on mappings from globally unique identifiers to integer counters. In a system with a well known set of entities these ids can be preconfigured and given distinct positions in a vector or distinct names in a mapping. Id management is more problematic in dynamic systems, with large and highly variable number of entities, being worsened when network partitions occur. Present solutions for causality tracking are not appropriate to these increasingly common scenarios. In this paper we introduce Interval Tree Clocks, a novel causality tracking mechanism that can be used in scenarios with a dynamic number of entities, allowing a completely decentralized creation of processes/replicas without need for global identifiers or global coordination. The mechanism has a variable size representation that adapts automatically to the number of existing entities, growing or shrinking appropriately. The representation is so compact that the mechanism can even be considered for scenarios with a fixed number of entities, which makes it a general substitute for vector clocks and version vectors.
-
Plausible clocks: constant size logical clocks for distributed systems
-
In a Distributed System with N sites, the precise detection of causal relationships between events can only be done with vector clocks of size N. This gives rise to scalability and efficiency problems for logical clocks that can be used to order events accurately. In this paper we propose a class of logical clocks called plausible clocks that can be implemented with a number of components not affected by the size of the system and yet they provide good ordering accuracy. We develop rules to combine plausible clocks to produce more accurate clocks. Several examples of plausible clocks and their combination are presented. Using a simulation model, we evaluate the performance of these clocks. We also present examples of applications where constant size clocks can be used.
-
High Performance Computing:
-
A lot of papers in HPC are implicitly papers also on Computer Systems.
Some good conferences to look at are:
SC (Supercomputing)
,
HPDC (High Performance Distributed Computing)
, and
IPDPS (IEEE Int. Parallel and Distributed Processing Symposium: see Systems track)
.