CS U685 / CS G185 (Research in High Performance Computing)

Instructor: Gene Cooperman
Spring, 2009 We will meet in 164 WVH today for the last talks. At 4:30, we may need to move next door to 166 WVH. We will meet for our last meeting on Wednesday, 2:50.

We are meeting in 415 Shilman Hall from 2:50 - 4:25 on Monday and Wednesday.

Some main projects to be done in teams are now listed at the end of this page. Please start thinking about them, and ask questions if you aren't sure what you would prefer. We continue with the five minute oral presentations on Mon., Feb. 2 for the mini-projects. One- or two-page writeups of the mini-project will be due Mon., Feb. 2.

Most class members should now have access to the course wiki, to be used for the projects. It requires a CCIS account to login to the Wiki (same username/password as CCIS account). If you do not yet have a CCIS account, let me know when you have it, and I'll link your account to the Wiki.

Please also check the systems debugging tips from time to time, to see if any of the tips are helpful for you. These tips will be extended from time to time.

As of Spring, 2009, High Performance Computing is a new course. It is intended to provide a gentle introduction to research for undergraduates and Master's students. At its core, research is messier than the highly structure courses that one more typically sees, but it can be very exciting to see things that no one has ever seen before. For this reason, the course requires highly motivated students who will operate semi-autonomously, while reporting back to the class at regular intervals. The course will have a small to moderate enrollment with the opportunity for more personal attention.

For the prerequisites, it is assumed that students will be comfortable programming in C (including pointers). Further, it is assumed that students will be comfortable learning and using new system calls that they have never seen before. The remaining background knowledge (including systems concepts) will be introduced/reviewed in the course. If you want to privately test yourself on the prerequisites, then read man mmap, and try writing a short C program that uses the mmap system call. Also, write a short amount of testing code to verify that the system call produced the result that you expected.

For 2009, the course will be project-based, and will leverage the research of the High Performance Computing Laboratory. It will emphasize two research vehices:

Roomy:
Roomy is a new mini-language created by Daniel Kunkle. It allows one to use the parallel disks of a computer cluster to treat disk as an extension of RAM. This approach becomes more credible when one realizes that the 50 local disks of a typical cluster have about the same bandwidth as a single RAM subsystem. The research question then becomes to what extent interesting storage-intensive algorithms can use the latency avoidance features of Roomy. A positive result would support one of the slogans of our lab: Disk is the New RAM.
DMTCP:
DMTCP (Distributed MultiThreaded CheckPointing) is an open source package freely available from Sourceforge and developed by a team originating in the High Performance Computing Laboratory. It transparently checkpoints the state of a process or computation to disk. It does so in user space (no modification to the Linux kernel). dmtcp_checkpoint a.out # run a.out under checkpoint control dmtcp_command -c # checkpoint the current process dmtcp_restart ckpt_a.out-*.dmtcp # restart process from disk DMTCP transparently follows the creation of new threads, the forking of child processes, and the spawning of remote processes via ssh. It currently does not checkpoint certain processes involving X-Windows, the ptrace system call (e.g. gdb), or suspended processes (^Z). The research question is how well DMTCP can checkpoint common processes (without modifying the kernel), and how well it can be extended to novel applications (checkpointing GUIs using X-Windows, creation of a reversible debugger by checkpointing gdb, etc.). For example, an interesting novelty would be the ability to checkpoint some open windows of your current session, and carry them home with you on your USB key.

Instructor Information: Office: 336 WVH (and also look in my High Performance Computing Lab, 370 WVH)

Office Hours: After class: 4:20 - 5:30, Monday and Wednesday; and by appointment

Text: There is no textbook. Internal documents and pointers to resources on the Web will be provided. Please also note the two reference books on systems programming listed at the end of this web page.

Grades: Grades will be determined by the sophistication of the project, along with the quality of the reports to the class (both oral and written reports). Both individual and joint projects are possible. Students will be encouraged to first do a (warm-up) mini-project, followed by a full project that need not be on the same topic.

Research consists of exploration into the unknown. Since all research is speculative, research results consist both of positive and negative results. In geographical terms, the discovery of a new mountain range (a new barrier) is just as interesting as the discovery of a new river (a new exploration route).

GDB and other UNIX resources: Some help files for UNIX and its compilers, editors, etc. are also available. In particular, the use of gdb (the GNU debugger) is especially encouraged as an important productivity tool.

Syllabus: (Note that the overlap of certain weeks is intentional.) WEEKS 1 and 2: Introduction to research topics; students choose mini-project WEEKS 3 and 4: Continuing lectures on research topics; students complete mini-projects WEEK 4: Students present results of mini-project (oral and written) WEEKS 4 and 5: Students choose course project. WEEKS 4 through 8: Lectures guided by needs of students for projects. WEEK 6: Interim project reports by students (oral and written). WEEK 9: Further interim project reports (oral and written). WEEKS 10 through 12: Students lead discussions of lessons from research: results of research topics to date; potential for new research directions; interaction with other research results in the literature WEEK 12: Final project presentations (oral and written)

Resources

Here are two books that are useful for systems programming concepts. Choose a chapter of interest, rather than reading it from front to back. The Rochkind book is a good first book, with example source code showing useful programs. The book by Robert Love provides more technical details, and may be more complete.

Northeastern Library Reserve: Advanced UNIX Programming, Second Edition, by Marc J. Rochkind, Addison Wesley, 2004 (Library copy goes on reserve Jan. 15)
Online: Linux System Programming by Love, 2007 (free online version accessible from Northeastern computer network via Safari Books Online)
- If using from another ISP outside of Northeastern U., then try tunneling using your CCIS account:
  ssh -L1234:0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu:80 denali.ccs.neu.edu
  or:
  ssh -L1234:safari.oreilly.com:80 denali.ccs.neu.edu
  Then point one's browser at the URL http://localhost:1234/

Mini-Projects

We will add more description to these suggested mini-projects. But rather than hide the course direction, we will use this space to allow you to see more of the course roadmap.
Warning: there is a bug in Ubuntu 8.10 (Intrepid)/gcc-4.3; An Ubuntu patch causes gcc to try to inline a DMTCP function, and then fail to do so with the error statement: unimplemented functionality. The workarounds are: downgrade to gcc-4.2 and:
env CC=ggc-4.2 CXX=g++-4.2 ./configure or when 'make' fails, look at the 'g++' compile statement, execute it by hand with a lower optimization level (-O2, -O1, -O0??). After that .o object file is built, execute 'make' again.

Roomy (Implement one of the following search methods in Roomy; First, look over the example code for Towers of Hanoi)
1. lossy hash tables : Mary Ellen investigated Murphi; It remains to be seen if unvisited queue (frontier) can be handled well within Roomy; Mary Ellen will talk to Dan)
2. bidirectional search : Perhaad took Knapsack problem as example, and wrote code so it could generalize to branch-and-bound
3. iterative deepening
4. a variant of A* (parallel A*, etc.)
5. another heuristic search (heuristic BFS)
6. frontier search
7. landmarks
8. branch and bound (applications such as integer programming, game tree search)
9. dynamic programming
DMTCP (Familiarize yourself now with the code. Then attempt to diagnose one of our confirmed bugs. Use tools such as gdb. Read the QUICK-INSTALL file for more tips about DMTCP and its debugging tools. Use this as an excuse to start understanding the DMTCP code. Never mind if you actually fix the bug.)
1. gcl is broken. It fails to checkpoint consistently. You may need to first install gcl (GNU Common Lisp): (Note, debug gcl binary, not shell script.) Nick found crash on ckpt i class dmtcp::KernelDeviceToConnection, DbgSpamFds() or std_map<std::string:ConnectionIdentifier>; issue w/ Stage 2: issue w/ confirming if a file descriptor is okay and printing it.
2. Bash is reported broken when the DMTCP gzip option is enabled. It works when gzip is disabled.
3. Screen doesn't work for checkpointing. Why? Alex finds that simple screen (no commands in subshell) crashes on checkpoint, because JASSERT_UTIL_LIBS is unset between end of dmtcp_checkpoint and beginning of DmtcpWorker constructor when we exec to screen process.

Project Software

Roomy Resources

If you have questions about Roomy, please send e-mail to Dan Kunkle and me. The username of Dan Kunkle is his last name (all lower case) and: @ccs.neu.edu
Please look in the Roomy directory.

DMTCP

If you have questions about DMTCP, please send e-mail to Kapil Arya and me. The username of Kapil Arya is his first name (all lower case) and: @ccs.neu.edu

DMTCP is available through the sourceforge web page. The easiest way to start is (in Linux) to type:

  svn co https://dmtcp.svn.sourceforge.net/svnroot/dmtcp dmtcp 
  cd dmtcp
  ./configure
  [ OR:   ./configure --enable-debug ]
  make
  make check  [OPTIONAL]

Then read the QUICK-START file in the top-level dmtcp directory. From there start browsing the source code.

Debugging Tricks

If the suggestions are unclear, use "man" to find out more about the commands.

pkill -9 a.out, where you replace a.out by the name of your binary.
The use of gdb is essential. Note the introduction to UNIX tools.
gdb a.out PID OR
gdb a.out (gdb) attach PID (where the attach command is given within gdb).
gdb a.out `pgrep a.out | tail -1`
strace -o outputFile a.out (trace system calls; decide in advance if it should trace all child processes or not)
ltrace -o outputFile a.out (not as useful as strace, but sometimes interesting: trace library calls instead of system calls).
xtrace (not available on all systems, but has the ability to trace function calls as well as system calls)
ps auxw | grep a.out
pstree -pu $USER or pstree -lu $USER (tree of processes and child process; names in curly braces are additional threads); Note idioms like: pstree -p | grep -C2 a.out
When your program runs too slowly, it might not be CPU-bound. Check man iostat man vmstat for disk/file I/O (Blk_read/s / Blk_wrtn/s), and paging to disk (bi/bo/id), respectively. A local disk (not on the network; SANs are different) can sequentially read or write (not both at once) roughly at a rate from 50 MB/s to 100 MB/s. If you are accessing files mostly and you don't see that bandwidth, then your program is not efficient. If you are paging to disk and you do see a bandwidth anywhere near that bandwidth, then you are using too much RAM.
Search for SUBSTRING in all dmtcp/src/* files: find dmtcp/src | xargs grep SUBSTRING
less /proc/PID/maps
ls -l /proc/PID/fd
lsof (list open file descriptors)
nm a.out (or nm library.so) Note the form nm -o for printing out filenames. This can be useful with brute force strategies:
nm -o /usr/lib/lib* | grep MY_SYMBOL
(Also see discussion of readelf and objdump below.)
man ld.so
env LD_DEBUG=help a.out (and try other options to LD_DEBUG)
Replace PID in following:
pushd /proc/PID; ls -l exe; echo -n "cmdline: "; cat -v cmdline; echo ""; cat -v environ; echo ""; popd
Add a system call, sleep(10), or else a delay loop in DMTCP/MTCP code to force it to pause at some line while you attach to the running process.
DMTCP: ./configure --enable-debug, and look at /tmp/jassert.PID files.
MTCP: Look at mtcp/Makefile and modify CFLAGS in it to use -DDEBUG
ldd a.out (for some binary, a.out)
strings a.out (for some binary, a.out)
top
watch -d COMMAND OR: watch -d "pstree -l | grep -A1 `basename $SHELL`" (repeatedly execute COMMAND)
Using gdb with C++ : For a C++ function with namespace, class, and signature (e.g.: dmtcp::myClass::foo(int, bool) ), try listing it first:
l 'dmtcp::myC<TAB>
It will autocomplete. Extend it, and type the final quote mark ('). Once you're sure you can list it, you can do things like set a breakpoint:
b 'dmtcp::myC<TAB> (and complete it with quote mark as before).
Using gdb with errno: In glibc, the global variable errno (see man errno) is a macro that is redefined to:
*(int *)__errno_location()
If you want to p errno within gdb, you will have to modify this into p *(int *)__errno_location() On 64-bit Linux, glibc seems to do something even more complicated, requiring a more complicated solution.
If you look at gdb and some call frames on the stack have no information (only a hex address and "?"), then find out where the call frames come from. Look at the hexadecimal address. Then do:
cat /proc/PID/maps (for PID the pid of the process being debugged). Find which library or other memory segment the unknown hexadecimal address came from. Knowing which library was called is useful, but you may be able to find out more. If it comes from libc.so (or some other well-known library), then see the next two tips for how to get the library to show you its internal debugging information.
(Continued) If you need a libc.so (or other well-known library) with debugging symbols, then:
1. Install the package libc6-dbg. (The package name might differ for you. Also, this assumes you have root privilege on your Linux.) This will install a special libc.so in the directory /usr/lib/debug . Please note that the CCIS Ubuntu Linux machines already have a debugging version of libc installed, currently as /usr/lib/debug/libc-2.7.so .
2. Next, do:
  env LD_LIBRARY_PATH=/usr/lib/debug dmtcp_checkpoint a.out (Presumably, after you checkpoint, the restarted a.out process will be using the pre-checkpoint libraries and hence the debugging versions. So, probably you don't need to use env LD_LIBRARY_PATH=/usr/lib/debug for the restart command. But if you're unsure, it doesn't hurt.)
3. The a.out process above should now be using a debugging version of libc.so and perhaps other libraries. You can verify this by looking at /proc/PID/maps for your process. Now, in gdb, you will see the symbol information in the call frame and a source code file and line number.
4. To read the corresponding source code, you can either download it from the main source code location: http://www.gnu.org/software/libc/libc.html#Availability (try to choose the same libc version, and note that the line numbers may be different in your Linux distro), or download the source package for your particular Linux distro.
(Continued) If gdb still shows some call frames with "?", and you have the full pathname of the library on disk, then you can often fix it as follows. (Once you understand this procedure, you may want to try the bin/gdb-add-symbol-file shell script found in DMTCP.)
1. In /proc/PID/maps look up the full pathname of the library you need to load. The address of the call frame with missing information should be in the address range of that library.
2. In gdb, read help add-symbol-file
3. In gdb, type add-symbol-file FILE ADDR where FILE is the full pathname you identified in the /proc/PID/maps file. The ADDR will be the hexadecimal sum of:
  1. beginning of text segment address (text segment normally has r-x permission) in /proc/PID/maps; and
  2. hexadecimal address for Addr heading corresponding to .text when you look it up under Headers: with either of the following command: readelf -S FILE objdump -h FILE
4. In the last step, the maps file provided the beginning address of the whole segment, but the binary library on disk contains many sections for a segment, and the .text section need not be the first section in the file. So, we must add the offset of the .text section, found by analyzing the binary libary on disk.
5. In gdb, a convenient way to add hexadecimal numbers is:
  p/x addr1 + addr2 where addr1 and addr2 are the two addresses we discussed. If those addresses are in hexadecimal, make sure to include 0x at the beginning of each hexadecimal number.
6. Now do where in gdb, and you should see full call frame information.
The two commands readelf and objdump are useful for inspecting the contents of binary files. These are related to the other commands, nm and strings, but these commands have many more options, including the ability to disassemble into assembly code, the ability to display section headers, etc. Scan the man pages quickly to see if something might be useful for you.
In gdb, you are sometimes forced to descend to assembly language (hopefully, only after coming as close as possible to the offending code using techinques above). To do so, do: (gdb) x/10i $pc (gdb) stepi As always, read the gdb help for further information. x/10i $pc says to examine the next ten instructions after the program counter. If you get tired or constantly typing x/10i $pc, type in gdb just once: (gdb) display/10i $pc and then continue to type stepi (or simply carriage return to repeat the last stepi command).
For an assembly level listing as you do stepi in gdb, try objdump -S a.out > a.out.listing where a.out should be replaced by your binary. For a more verbose form, try one of: gcc -c -g -Wa,-alh,-L file.c > file.s gcc -c -g -Wa,-ahls=file.s file.c Variations of this can also produce assembly code that can be directly assembled by gcc or by as. For example, if you want to modify and re-compile the source code for libc.so, this is normally quite painful. A nice trick is to disassemble libc.so into assembly, and then cut or copy out the particular assembly routines that you want to assemble into a modified library.
UNIX system calls, by Open Group; (enter system call in search box); This is the clearest, most precise man pages for system calls you will ever find.
Valgrind (Memory and leak detection utility); This is easy-to-use and surprisingly powerful.
If you want to see the stack just before a segfault, try the glibc call backtrace (man backtrace). It mangles any C++ names, but they are mostly readable (and utilities exist for demangling the names). Read the notes of man backtrace (for example, compile with gcc -rdynamic to get symbol names. Also, note man addr2line.
Look at the example file, backtrace.c for this course.
Also, for any call frames with no symbol name, look up the hex address in /proc/<PID>/maps. Use addr2line to translate hex addresses into line numbers in source code. (If it's a .so dynamic library, give it the offset, the hex address minus the beginning library address as shown by /proc/<PID>/maps.
An interesting Linux command: addr2line
In comparing two versions of software, consider programs such as: kompare, kdiff3, meld, gvimdiff (or text-based vimdiff).
To see which virtual memory pages are currently mapped to physical memory, see /proc/PID/pagemap .

Main Projects
The main projects are listed below. We will also set up a course Wiki, where you will describe the status of your projects. The Wiki will also have a space for general issues/comments in supporting Roomy and DMTCP.
Roomy

Roomy Applications

Explicit State Space Verification (e.g. Murφ/Murphi) --- used for protocol and program verification
BDD (Binary Decision Diagrams) --- widely used for hardware and protocol verification
SAT solver --- used for program verification
Integer programming --- application of branch-and-bound; extremely widely used; will require further reading on best implementations of integer programming

Roomy Under the Hood

Dynamic data compression (saving space on disk)
Zero-mapped pages (uninitialized pages need not be stored, just as in operating systems)
In-RAM version of Roomy
Load balancing
Fault-tolerant Roomy (DMTCP: Roomy files with serial writes are handled automatically by DMTCP. For files with random updates, one needs to save the file, or else save a previous version of the file with a log of updates)

DMTCP

Checkpoint screen
Integrate checkpointing into iPython (part of SciPy (Scientific Python)). We know the developers, and they can help us.
Checkpoint the job control/suspend feature of your favorite shell (^Z)
Checkpoint gdb (requiring understanding of 'man ptrace') (This project is ongoing; talk to me first)

Example of ptrace usage
Tracing tricks with ptrace

Have MTCP use a standard ELF linker script. (This project is ongoing; talk to me first)
Fix reports of fragility of checkpointing bash/etc. (The following were reported not working at one tiem.)

make check-dmtcpaware1 on 32-bit Ubuntu
Checkpoint matlab (mostly works now, but some details don't)
Add virtualization of pid's so that waitpid() works. This also fixes a case for TightVNC.
With --enable-debug _only_, the simple script: #!/bin/bash<newline>ls<newline>
will non-deterministically fail about 20% of the time. We may be unstable with *all* bash scripts. The script bash-bug seems to fail a lot of the time.
If it fails due to lack of disk space or lack of disk quota, this reason ("insufficient disk space or quota") should be reported to the user.
Modify test/autotest.py so that it will accept commends from standard input to pass to a.out (process being checkpointed) before checkpoint or before restart. For example, bash may successfully restart and then fail if given "ls" command, since that exercises code that creates a child process.
Checkpointing of file locks should also be handled
There are other things to virtualize besides pid's (e.g. uid's (see "setuid" and "seteuid"), etc.) However, this does not seem urgent for current use cases.

Hijack/Attach to already running process and checkpoint (There is a question here about how to follow socket connections, if that process is already talking to other processes.)
Thread race condition detector: A traditional race condition eventually causes a crash. But since it's a race condition, it doesn't always crash at that location. Experiment with different checkpoints, until you find a checkpoint location for which the process always crashes upon restart. Then modify MTCP to only allow a subset of the threads to resume, and keep the other threads suspended. By trial and error, discover which two threads have a race condition.
Memory leak detector: If a memory leak occurs later in the program, valgrind runs too slowly to easily find it. So, use a malloc debugger or your own memory/free interceptor. This defines regions created through malloc. Late in the program, it will be easy to find a region of memory that is a memory leak (that no one ever touches again). Using checkpoint/restart tricks, find the last time that anyone touched that memory segment. Report that line of code using a standard tool to convert between a line of assembly language and the source code line. To guarantee that no one ever uses that memory again, remove read-write protection from that region of memory and add a segfault handler to trap any accesses. Then automate the many checkpoint-restart to automatically find where the memory segment was last touched.
Portable Linux Apps: DMTCP checkpoint images include any libraries that have been loaded. If the environment variable LD_BIND_NOW is set (set to anything), then the loader will preload every library that it will need. This should enable one to copy a checkpoint image from Debian Linux to OpenSuse Linux to RedHat Linux to Ubuntu Linux to (etc.). Does this work? If not, what's needed to make it work?
Incremental Checkpoint: DMTCP may want to keep multiple checkpoint images, so that it can return to any of several execution points in the past. This would normally require a lot of disk space. How does one efficiently store a diff between checkpoints. (This may be a somewhat easier project, for those who are looking for that. With other projects, one will often find that at the end of the semester, one has to report that some parts are still not working, and why. This project offers the opportunity of finishing most of the project, if there are no surprises.)