CS U685 / CS G185
(Research in High Performance Computing)
Instructor: Gene Cooperman
Spring, 2009
We will meet in 164 WVH today for the last talks. At 4:30,
we may need to move next door to 166 WVH.
We will meet for our last meeting on Wednesday, 2:50.
We are meeting in 415 Shilman Hall from 2:50 - 4:25 on Monday and
Wednesday.
Some main projects to be done in teams
are now listed at the
end of this page. Please start thinking about them, and ask questions
if you aren't sure what you would prefer. We continue with the five
minute oral presentations on Mon., Feb. 2 for
the mini-projects. One- or two-page
writeups of the mini-project will be due Mon., Feb. 2.
Most class members should now have access to the
course wiki,
to be used for the projects. It requires a CCIS account
to login to the Wiki (same username/password as CCIS account). If you
do not yet have a CCIS account, let me know when you have it, and I'll
link your account to the Wiki.
Please also check the systems debugging tips
from time to time, to see if any of the tips are helpful for you.
These tips will be extended from time to time.
As of Spring, 2009, High Performance Computing is a new course. It is
intended to provide a gentle introduction to research for
undergraduates and Master's students. At its core, research is
messier than the highly structure courses that one more typically
sees, but it can be very exciting to see things that no one has
ever seen before. For this reason, the course requires highly
motivated students who will operate semi-autonomously, while reporting
back to the class at regular intervals. The course will have
a small to moderate enrollment with the opportunity for more personal
attention.
For the prerequisites, it is assumed that students will be comfortable
programming in C (including pointers). Further, it is assumed that
students will be comfortable learning and using new system calls that they
have never seen before. The remaining background knowledge (including
systems concepts) will be introduced/reviewed in the course.
If you want to privately test yourself on the
prerequisites, then read man mmap, and try writing a short
C program that uses the mmap system call. Also, write a short
amount of testing code to verify that the system call produced the
result that you expected.
For 2009, the course will be project-based, and will leverage the
research of the
High Performance Computing Laboratory. It will
emphasize two research vehices:
- Roomy:
Roomy is a new mini-language created by
Daniel Kunkle.
It allows one to use the parallel disks of a computer cluster
to treat disk as an extension of RAM. This approach becomes
more credible when one realizes that the 50 local disks
of a typical cluster have about the same bandwidth as a
single RAM subsystem.
The research question then becomes to what extent interesting
storage-intensive algorithms can use the latency avoidance
features of Roomy. A positive result would support one of the
slogans of our lab: Disk is the New RAM.
- DMTCP:
DMTCP (Distributed
MultiThreaded CheckPointing) is an open source
package freely available from Sourceforge and developed by
a team originating in the High Performance Computing Laboratory.
It transparently
checkpoints the state of a process or computation to disk.
It does so in user space (no modification to the Linux kernel).
dmtcp_checkpoint a.out # run a.out under checkpoint control
dmtcp_command -c # checkpoint the current process
dmtcp_restart ckpt_a.out-*.dmtcp # restart process from disk
DMTCP transparently follows the creation of new threads,
the forking of child processes, and the spawning of remote
processes via ssh. It currently does not checkpoint certain
processes involving X-Windows, the ptrace system call (e.g. gdb),
or suspended processes (^Z). The research question is how well
DMTCP can checkpoint common processes (without modifying the kernel),
and how well it can be extended to novel applications (checkpointing
GUIs using X-Windows, creation of a reversible debugger by
checkpointing gdb, etc.). For example, an interesting novelty
would be the ability to checkpoint some open windows of your
current session, and carry them home with you on your USB key.
Instructor Information:
Office: 336 WVH
(and also look in my High Performance Computing Lab, 370 WVH)
Office Hours:
After class: 4:20 - 5:30, Monday and Wednesday; and by appointment
Text:
There is no textbook. Internal documents and pointers to resources
on the Web will be provided. Please also note the two reference books
on systems programming listed at the end of this web page.
Grades:
Grades will be determined by the sophistication of the project, along with
the quality of the reports to the class (both oral and written reports).
Both individual and joint projects are possible. Students will be
encouraged to first do a (warm-up) mini-project, followed by a full
project that need not be on the same topic.
Research consists of exploration into the unknown.
Since all research is speculative, research results consist both of
positive and negative results. In geographical terms, the discovery
of a new mountain range (a new barrier) is just as interesting
as the discovery of a new river (a new exploration route).
GDB and other UNIX resources:
Some help files for UNIX and its compilers,
editors, etc. are also available.
In particular, the use of gdb (the GNU debugger) is especially encouraged
as an important productivity tool.
Syllabus:
(Note that the overlap of certain weeks is intentional.)
WEEKS 1 and 2: Introduction to research topics; students
choose mini-project
WEEKS 3 and 4: Continuing lectures on research topics; students
complete mini-projects
WEEK 4: Students present results of mini-project
(oral and written)
WEEKS 4 and 5: Students choose course project.
WEEKS 4 through 8: Lectures guided by needs of students for
projects.
WEEK 6: Interim project reports by students (oral and written).
WEEK 9: Further interim project reports (oral and written).
WEEKS 10 through 12: Students lead discussions of lessons from
research: results of research topics to date; potential for
new research directions; interaction with other research results
in the literature
WEEK 12: Final project presentations (oral and written)
Resources
Here are two books that are useful for systems programming concepts.
Choose a chapter of interest, rather than reading it from front to
back. The Rochkind book is a good first book,
with example source code showing useful programs. The book by Robert Love
provides more technical details, and may be more complete.
-
Northeastern Library Reserve:
Advanced UNIX Programming, Second Edition,
by Marc J. Rochkind, Addison Wesley, 2004 (Library copy goes
on reserve Jan. 15)
- Online: Linux System Programming by Love, 2007 (free online version
accessible from Northeastern computer network via
Safari Books Online)
- If using from another ISP outside of Northeastern U.,
then try tunneling using your CCIS account:
ssh -L1234:0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu:80 denali.ccs.neu.edu
or:
ssh -L1234:safari.oreilly.com:80 denali.ccs.neu.edu
Then point one's browser at the URL http://localhost:1234/
We will add more description to these suggested mini-projects. But rather
than hide the course direction, we will use this space to
allow you to see more of the course roadmap.
Warning: there is a bug in Ubuntu 8.10 (Intrepid)/gcc-4.3;
An Ubuntu patch causes gcc to try to inline a DMTCP function, and then
fail to do so with the error statement: unimplemented functionality.
The workarounds are: downgrade to gcc-4.2 and:
env CC=ggc-4.2 CXX=g++-4.2 ./configure
or when 'make' fails, look at the 'g++' compile statement, execute
it by hand with a lower optimization level (-O2, -O1, -O0??). After
that .o object file is built, execute 'make' again.
- Roomy (Implement one of the following search methods in Roomy;
First, look over the example code
for Towers of Hanoi)
- lossy hash tables : Mary Ellen investigated Murphi; It remains
to be seen if unvisited queue (frontier) can be handled
well within Roomy; Mary Ellen will talk to Dan)
- bidirectional search : Perhaad took Knapsack problem as example,
and wrote code so it could generalize to branch-and-bound
- iterative deepening
- a variant of A* (parallel A*, etc.)
- another heuristic search (heuristic BFS)
- frontier search
- landmarks
- branch and bound (applications such as integer programming, game tree search)
- dynamic programming
- DMTCP (Familiarize yourself now with the code.
Then attempt to diagnose one of our confirmed bugs.
Use tools such as gdb. Read the QUICK-INSTALL file for more
tips about DMTCP and its debugging tools. Use this as an
excuse to start understanding the DMTCP code. Never mind
if you actually fix the bug.)
- gcl is broken. It fails to checkpoint consistently.
You may need to first install gcl (GNU Common Lisp):
(Note, debug gcl binary, not shell script.)
Nick found crash on ckpt i class dmtcp::KernelDeviceToConnection,
DbgSpamFds() or std_map<std::string:ConnectionIdentifier>;
issue w/ Stage 2: issue w/ confirming if a file descriptor
is okay and printing it.
- Bash is reported broken when the DMTCP gzip option is enabled.
It works when gzip is disabled.
- Screen doesn't work for checkpointing. Why?
Alex finds that simple screen (no commands in subshell) crashes
on checkpoint, because JASSERT_UTIL_LIBS is unset between
end of dmtcp_checkpoint and beginning of DmtcpWorker constructor
when we exec to screen process.
Project Software
Roomy Resources
If you have questions about Roomy, please send e-mail to
Dan Kunkle and me. The username of Dan Kunkle is his
last name (all lower case) and: @ccs.neu.edu
Please look in the
Roomy directory.
DMTCP
If you have questions about DMTCP, please send e-mail to
Kapil Arya and me. The username of Kapil Arya is his
first name (all lower case) and: @ccs.neu.edu
DMTCP is available
through the sourceforge web page.
The easiest way to start is (in Linux) to type:
svn co https://dmtcp.svn.sourceforge.net/svnroot/dmtcp dmtcp
cd dmtcp
./configure
[ OR: ./configure --enable-debug ]
make
make check [OPTIONAL]
Then read the QUICK-START file in the top-level dmtcp directory.
From there start browsing the source code.
If the suggestions are unclear, use "man" to find out more
about the commands.
-
pkill -9 a.out
, where you replace
a.out by the name of your binary.
- The use of gdb is essential. Note the
introduction to UNIX tools.
-
gdb a.out PID
OR
gdb a.out
(gdb) attach PID
(where the attach command is given within gdb).
-
gdb a.out `pgrep a.out | tail -1`
-
strace -o outputFile a.out
(trace system calls;
decide in advance if it should trace all child processes or not)
-
ltrace -o outputFile a.out
(not as useful as
strace, but sometimes interesting: trace library
calls instead of system calls).
-
xtrace
(not available on all systems, but has the
ability to trace function calls as well as system calls)
-
ps auxw | grep a.out
-
pstree -pu $USER
or pstree -lu $USER
(tree of processes and child
process; names in curly braces are additional threads);
Note idioms like: pstree -p | grep -C2 a.out
- When your program runs too slowly, it might not be CPU-bound. Check
man iostat
man vmstat
for
disk/file I/O (Blk_read/s / Blk_wrtn/s),
and paging to disk (bi/bo/id), respectively. A local disk (not
on the network; SANs are different) can sequentially
read or write (not both at once) roughly at a rate
from 50 MB/s to 100 MB/s.
If you are accessing files mostly and you don't
see that bandwidth, then your program is not efficient. If you are
paging to disk and you do see a bandwidth
anywhere near that bandwidth, then you are using too much RAM.
- Search for SUBSTRING in all dmtcp/src/* files:
find dmtcp/src | xargs grep SUBSTRING
-
less /proc/PID/maps
-
ls -l /proc/PID/fd
-
lsof
(list open file descriptors)
-
nm a.out
(or nm library.so)
Note the form nm -o for printing out filenames. This
can be useful with brute force strategies:
nm -o /usr/lib/lib* | grep MY_SYMBOL
(Also see discussion of readelf and objdump below.)
-
man ld.so
-
env LD_DEBUG=help a.out
(and try other
options to LD_DEBUG)
- Replace PID in following:
pushd /proc/PID;
ls -l exe;
echo -n "cmdline: "; cat -v cmdline;
echo ""; cat -v environ; echo "";
popd
- Add a system call, sleep(10), or else a delay loop in
DMTCP/MTCP code to force it to pause at some line while
you attach to the running process.
- DMTCP:
./configure --enable-debug
, and look at
/tmp/jassert.PID
files.
- MTCP: Look at mtcp/Makefile and modify CFLAGS in it to
use
-DDEBUG
-
ldd a.out
(for some binary, a.out)
-
strings a.out
(for some binary, a.out)
-
top
-
watch -d COMMAND
OR: watch -d "pstree -l | grep -A1 `basename $SHELL`"
(repeatedly execute COMMAND)
- Using gdb with C++ : For a C++ function with namespace, class,
and signature (e.g.: dmtcp::myClass::foo(int, bool) ),
try listing it first:
l 'dmtcp::myC<TAB>
It will autocomplete. Extend it, and type the final quote mark (').
Once you're sure you can list it, you can do things like set
a breakpoint:
b 'dmtcp::myC<TAB> (and complete it with quote mark
as before).
- Using gdb with errno: In glibc, the global variable errno (see
man errno) is a macro that is redefined to:
*(int *)__errno_location()
If you want to p errno within gdb, you will have
to modify this into p *(int *)__errno_location()
On 64-bit Linux, glibc seems to do something even more complicated,
requiring a more complicated solution.
- If you look at gdb and some call frames on the stack have
no information (only a hex address and "?"), then find out where
the call frames come from. Look at the hexadecimal address.
Then do:
cat /proc/PID/maps (for PID the
pid of the process being debugged). Find which library or
other memory segment the unknown hexadecimal address came from.
Knowing which library was called is useful, but you may be able
to find out more.
If it comes from libc.so (or some other well-known library),
then see the next two tips for how to get the library
to show you its internal debugging information.
- (Continued) If you need a libc.so (or other well-known library)
with debugging symbols, then:
- Install the
package libc6-dbg. (The package name might differ for you.
Also, this assumes you have root privilege on your Linux.)
This will install a special libc.so in the directory
/usr/lib/debug .
Please note that the CCIS Ubuntu Linux machines already have
a debugging version of libc installed, currently as
/usr/lib/debug/libc-2.7.so .
- Next, do:
env LD_LIBRARY_PATH=/usr/lib/debug dmtcp_checkpoint a.out
(Presumably, after you checkpoint, the restarted a.out process
will be using the pre-checkpoint libraries and hence the
debugging versions. So, probably you don't need to
use env LD_LIBRARY_PATH=/usr/lib/debug for the restart
command. But if you're unsure, it doesn't hurt.)
-
The a.out process above should now be using a debugging version
of libc.so and perhaps other libraries.
You can verify this by looking at /proc/PID/maps
for your process. Now, in gdb, you will see the symbol information
in the call frame and a source code file and line number.
-
To read the corresponding source code, you can either download it from
the main source code location:
http://www.gnu.org/software/libc/libc.html#Availability
(try to choose the same libc version, and note that
the line numbers may be different in your Linux distro),
or download the source package for your particular Linux distro.
- (Continued) If gdb still shows some call frames with "?", and
you have the full pathname of the library on disk, then you
can often fix it as follows. (Once you understand this procedure,
you may want to try the bin/gdb-add-symbol-file shell
script found in DMTCP.)
- In /proc/PID/maps look up the full pathname
of the library you need to load. The address of the call frame
with missing information should be in the address range of that
library.
- In gdb, read help add-symbol-file
- In gdb, type
add-symbol-file FILE ADDR
where FILE is the full pathname you identified in the
/proc/PID/maps file. The ADDR
will be the hexadecimal sum of:
- beginning of text segment address (text segment normally
has r-x permission) in /proc/PID/maps; and
- hexadecimal address for Addr heading corresponding
to .text when you look it up under Headers:
with either of the following command:
readelf -S FILE
objdump -h FILE
- In the last step, the maps file provided the beginning address of
the whole segment, but the binary library on disk contains many
sections for a segment, and the .text section need not be the
first section in the file. So, we must add the offset of the
.text section, found by analyzing the binary libary on disk.
- In gdb, a convenient way to add hexadecimal numbers is:
p/x addr1 + addr2
where addr1 and addr2 are the two addresses we discussed. If those
addresses are in hexadecimal, make sure to include 0x
at the beginning of each hexadecimal number.
- Now do where in gdb, and you should see full call
frame information.
- The two commands readelf and objdump are useful
for inspecting the contents of binary files. These are related
to the other commands, nm and strings, but
these commands have many more options, including the ability
to disassemble into assembly code, the ability to display
section headers, etc. Scan the man pages
quickly to see if something might be useful for you.
- In gdb, you are sometimes forced to descend to assembly
language (hopefully, only after coming as close as possible
to the offending code using techinques above). To do so, do:
(gdb) x/10i $pc
(gdb) stepi
As always, read the gdb help for further information.
x/10i $pc says to examine the next ten instructions
after the program counter. If you get tired or constantly
typing x/10i $pc, type in gdb just once:
(gdb) display/10i $pc
and then continue to type stepi (or simply carriage
return to repeat the last stepi command).
- For an assembly level listing as you do stepi in gdb,
try
objdump -S a.out > a.out.listing
where a.out should be replaced by your binary.
For a more verbose form, try one of:
gcc -c -g -Wa,-alh,-L file.c > file.s
gcc -c -g -Wa,-ahls=file.s file.c
Variations of this can also produce assembly code that can be
directly assembled by gcc or by as.
For example, if you want to modify and re-compile the source
code for libc.so,
this is normally quite painful. A nice trick is to disassemble
libc.so into assembly, and then cut or copy out the particular assembly
routines that you want to assemble into a modified library.
-
UNIX system calls, by Open Group;
(enter system call in search box);
This is the clearest, most precise man pages for system calls
you will ever find.
- Valgrind (Memory and leak detection
utility); This is easy-to-use and surprisingly powerful.
- If you want to see the stack just before a segfault, try the glibc
call backtrace (
man backtrace
). It mangles any
C++ names, but they are mostly readable (and utilities exist
for demangling the names). Read the notes of man backtrace
(for example, compile with gcc -rdynamic
to get
symbol names. Also, note man addr2line.
Look at the example file, backtrace.c
for this course.
Also, for any call frames with no symbol
name, look up the hex address in /proc/<PID>/maps
.
Use addr2line
to translate hex addresses into
line numbers in source code. (If it's a .so dynamic library,
give it the offset, the hex address minus the beginning
library address as shown by /proc/<PID>/maps
.
- An interesting Linux command: addr2line
- In comparing two versions of software, consider programs such as:
kompare, kdiff3, meld, gvimdiff (or text-based vimdiff).
- To see which virtual memory pages are currently mapped to physical memory,
see /proc/PID/pagemap .
The main projects are listed below. We will also set up a course Wiki,
where you will describe the status of your projects.
The Wiki will also have a space for general issues/comments
in supporting Roomy and DMTCP.
Roomy
- Roomy Applications
- Explicit State Space Verification (e.g. Murφ/Murphi) ---
used for protocol and program verification
- BDD (Binary Decision Diagrams) --- widely used for
hardware and protocol verification
- SAT solver --- used for program verification
- Integer programming --- application of branch-and-bound;
extremely widely used; will require further reading on
best implementations of integer programming
- Roomy Under the Hood
- Dynamic data compression (saving space on disk)
- Zero-mapped pages (uninitialized pages need not be stored,
just as in operating systems)
- In-RAM version of Roomy
- Load balancing
- Fault-tolerant Roomy (DMTCP: Roomy files with serial writes are handled
automatically by DMTCP. For files with random updates, one
needs to save the file, or else save a previous version of the
file with a log of updates)
DMTCP
- Checkpoint screen
- Integrate checkpointing into iPython (part of SciPy (Scientific Python)).
We know the developers, and they can help us.
- Checkpoint the job control/suspend feature of your favorite shell (^Z)
- Checkpoint gdb (requiring understanding of 'man ptrace')
(This project is ongoing; talk to me first)
-
Example of ptrace usage
-
Tracing tricks with ptrace
- Have MTCP use a standard ELF linker script. (This project is
ongoing; talk to me first)
- Fix reports of fragility of checkpointing bash/etc.
(The following were reported not working at one tiem.)
- make check-dmtcpaware1 on 32-bit Ubuntu
- Checkpoint matlab (mostly works now, but some details don't)
- Add virtualization of pid's so that waitpid() works.
This also fixes a case for TightVNC.
- With --enable-debug _only_, the simple script:
#!/bin/bash<newline>ls<newline>
will non-deterministically fail about 20% of the time.
We may be unstable with *all* bash scripts.
The script bash-bug seems
to fail a lot of the time.
- If it fails due to lack of disk space or lack of disk quota,
this reason ("insufficient disk space or quota") should be
reported to the user.
- Modify test/autotest.py so that it will accept commends
from standard input to pass to a.out (process being checkpointed)
before checkpoint or before restart. For example, bash
may successfully restart and then fail if given "ls" command,
since that exercises code that creates a child process.
- Checkpointing of file locks should also be handled
- There are other things to virtualize besides pid's
(e.g. uid's (see "setuid" and "seteuid"), etc.) However,
this does not seem urgent for current use cases.
- Hijack/Attach to already running process and checkpoint
(There is a question here about how to follow socket connections,
if that process is already talking to other processes.)
- Thread race condition detector: A traditional race condition
eventually causes a crash. But since it's a race condition,
it doesn't always crash at that location.
Experiment with different checkpoints,
until you find a checkpoint location for which the process
always crashes upon restart. Then modify MTCP to only allow
a subset of the threads to resume, and keep the other threads
suspended. By trial and error, discover which two threads
have a race condition.
- Memory leak detector: If a memory leak occurs later in the
program, valgrind runs too slowly to easily find it.
So, use a malloc debugger or your own memory/free interceptor.
This defines regions created through malloc. Late in the program,
it will be easy to find a region of memory that is a memory
leak (that no one ever touches again). Using checkpoint/restart
tricks, find the last time that anyone touched that memory
segment. Report that line of code using a standard tool to
convert between a line of assembly language and the source code line.
To guarantee that no one ever uses that memory again, remove
read-write protection from that region of memory and add
a segfault handler to trap any accesses. Then automate the
many checkpoint-restart to automatically find where the memory
segment was last touched.
- Portable Linux Apps: DMTCP checkpoint images include any
libraries that have been loaded. If the environment variable
LD_BIND_NOW is set (set to anything), then the loader will preload
every library that it will need. This should enable one to copy
a checkpoint image from Debian Linux to OpenSuse Linux to RedHat Linux
to Ubuntu Linux to (etc.). Does this work? If not, what's needed
to make it work?
- Incremental Checkpoint: DMTCP may want to keep multiple
checkpoint images, so that it can return to any of several
execution points in the past. This would normally require a lot
of disk space. How does one efficiently store a diff between
checkpoints. (This may be a somewhat easier project, for those
who are looking for that. With other projects, one will often
find that at the end of the semester, one has to report that some
parts are still not working, and why. This project offers the
opportunity of finishing most of the project, if there are no
surprises.)