Debugging Programs Generated by Buggy Compilers

8.14

Debugging Programs Generated by Buggy Compilers🔗

Federico Cassano

A test failed and produced an unexpected output. What do you do now? Sometimes staring at the phases of your compiler won’t help you in finding the bug. And even worse, sometimes staring at the generated assembly code won’t help you either. When this happens, the best course of action is to run your program and see what happens. This is exactly what dynamic analysis is about; it helps you understand what your code does by running it. In this blog post, I will explain how to debug a compiler by running the generated code. I will use a buggy version of the Adder compiler and Diamondback compiler as an example.

1 Setting up the environment🔗

1.1 Dynamic debugging🔗

To debug the Adder Compiler, we need to set up a few things. Firstly, we need a debugger, and we will use The GNU Project Debugger (GDB). On linux, GDB is usually installed by default. However, vanilla GDB does not have the best debugging experience, as you often have to type long commands to display the state of the program. To improve it, we will use GDB Enhanced Features (GEF). This is a plugin for GDB that adds many useful features, such as a context panel that shows the values of the registers, a stack panel that shows the state of the stack, a disassembly panel that shows the disassembly of the current function, and a trace panel that shows the call history of the program. To install it, use the install script provided in the repository. WARNING: On the Khoury machines, GEF won’t work because of a bug in the Python version installed on the machines. To fix it, you can either locally update the Python version or install peda instead of GEF; it’s pretty similar. There are also many other distributions of GDB that you can use, I highly recommend researching them and finding the one that suits you best.

1.2 Static debugging🔗

Another way to debug the Adder Compiler is to use static analysis. This will allow you to debug each phase of your compiler separately. The compiler is written in OCaml, and we can use Dune to build it. Dune is a build system for OCaml that is similar to Make. It is very easy to use and it is the standard build system for OCaml projects. To install it, follow the instructions on the website. Newer versions of the Snake Compiler allow to print a trace of the compiler’s execution by using the -t flag on the main binary. However, the Adder Compiler does not have this feature, therefore we will use dune utop to trace the compilation process. dune utop is a command that starts an OCaml REPL with the project loaded. It is very useful for debugging and experimenting with the code, as it will allow you to see the output of every function call and to modify the code on the fly. In this blog, we will focus on dynamic debugging, but I highly recommend using dune utop or the pipeline trace to debug the compiler before trying to use GDB.

2 Debugging the Adder Compiler🔗

Do Now!
Precompiled binary for this example

Our motivating example for this blog post will be the following program:

(let ((x 1))
  (let ((y 2))
    (add1 y)))

The expected output of this program is 3. However, the Adder Compiler generates the following assembly code, which leads to an output of 2:

section .text
global our_code_starts_here
our_code_starts_here:
        mov [RSP + 8*-1], RAX
        mov RAX, 1
        mov [RSP + 8*-2], RAX
        mov RAX, 2
        mov RAX, [RSP + 8*-2]
        add RAX, 1
        ret

Most people will immediately notice exactly what is wrong with this code, but let’s ignore that for now. Let’s see how we can debug this program using GDB.

2.1 Running the program🔗

Let’s start by running the program. We can do this by writing our program in a file inside the ./input directory, in this case ./input/buggy.adder. We can then run make output/buggy.run to compile the program to a binary. We can then run the binary in GDB by typing gdb output/buggy.run. If you did everything correctly, you should see the following output:

$ gdb output/buggy.run
<wall of text omitted...>
Reading symbols from output/buggy.run...
gef➤

Great! We are now inside the GDB debugger. Let’s run the program by typing run or r.

gef➤  run
Starting program: /home/federico/adder/output/buggy.run
[Inferior 1 (process 1) exited normally]
2

Alright, so the program exited normally and printed 2 as noted above. Let’s see what happened. We can use the disassemble or disas command to see the assembly code of a function. But first, we want to set the assembly code to be displayed in Intel syntax, as it is the one we are used to. We can do this by typing set disassembly-flavor intel.

gef➤  set disassembly-flavor intel
gef➤  disas our_code_starts_here
Dump of assembler code for function our_code_starts_here:
   0x00005555555551c0 <+0>: mov    QWORD PTR [rsp-0x8],rax
   0x00005555555551c5 <+5>: mov    eax,0x1
   0x00005555555551ca <+10>:  mov    QWORD PTR [rsp-0x10],rax
   0x00005555555551cf <+15>:  mov    eax,0x2
   0x00005555555551d4 <+20>:  mov    rax,QWORD PTR [rsp-0x10]
   0x00005555555551d9 <+25>:  add    rax,0x1
   0x00005555555551dd <+29>:  ret

Cool, so far nothing new. We can now break at a specific line of the assembly code to stop the program at that point. We can do this by typing break *0x00005555555551c0, to stop at the first instruction of the function — or, since that address is the same as our label, break *our_code_starts_here. We can then run the program again by typing run or r. This time, the program will stop at the first instruction of the function.

gef➤  break *our_code_starts_here
Breakpoint 1 at 0x5555555551c0
gef➤  run

2.2 Viewing the context🔗

Now, you will see A LOT on your screen. This is because GEF is displaying the context of the program. The context is the state of the program at the current point of execution. You can bring up the context panel by typing context or ctx. The context panel is divided into 4 sections: the registers, the stack, the disassembly, and the trace (also threads, but we don’t care about that in our case). Each value on the screen will be colored in a different way, depending on its type; GEF will display a legend at the top of the context panel to explain what each color means. For example, if the the register was modified by the last instruction, it will be colored in red.

Let’s start to make sense of this. First, let’s look at the registers.

───────────────────────────────────────────────── registers ────
$rax   : 0x0
$rbx   : 0x007fffffffe198  →  0x007fffffffe51d  →  "omitted"
$rcx   : 0x00555555557dc8  →  0x005555555550f0  →   endbr64
$rdx   : 0x007fffffffe1a8  →  0x007fffffffe563  →  "SHELL=/usr/bin/zsh"
$rsp   : 0x007fffffffe058  →  0x00555555555186  →  <main+54> mov rsi, rax
$rbp   : 0x007fffffffe080  →  0x0000000000000001
$rsi   : 0x007fffffffe198  →  0x007fffffffe51d  →  "omitted"
$rdi   : 0x1
$rip   : 0x005555555551c0  →  <our_code_starts_here+0>
$r8    : 0x0
$r9    : 0x007ffff7fced70  →   endbr64
$r10   : 0x007fffffffddb0  →  0x0000000000800000
$r11   : 0x202
$r12   : 0x0
$r13   : 0x007fffffffe1a8  →  0x007fffffffe563  →  "SHELL=/usr/bin/zsh"
$r14   : 0x00555555557dc8  →  0x005555555550f0  →   endbr64
$r15   : 0x007ffff7ffd000  →  0x007ffff7ffe2c0  →  0x00555555554000

Each register will display its name and its value. If the value is a pointer, GEF will follow the pointer and display the value it points to, recursively. For example, the $rsp register points to 0x007fffffffe058, which points to 0x00555555555186, which points to <main+54> mov rsi, rax. This is the return address of the main function in the C runtime, where the program will return after the our_code_starts_here function returns.

Let’s now look at the stack.

───────────────────────────────────────────────── stack ────
0x007fffffffe058│+0x0000: 0x00555555555186  →  <main+54> ← $rsp
0x007fffffffe060│+0x0008: 0x0000000000000000
0x007fffffffe068│+0x0010: 0x007fffffffe198  →  0x007fffffffe51d  →  "..."
0x007fffffffe070│+0x0018: 0x0000000000000001
0x007fffffffe078│+0x0020: 0x07503e7a1bd6eb00
0x007fffffffe080│+0x0028: 0x0000000000000001   ← $rbp
0x007fffffffe088│+0x0030: 0x007ffff7ddb790  →   mov edi, eax
0x007fffffffe090│+0x0038: 0x007fffffffe188  →  0x00000000000038 ("8"?)

On this panel, we can see the a window of the stack, with addresses increasing from top to bottom. More notably GEF will display where the $rsp and $rbp registers are pointing to. Additionally, GEF will display the offsets of the stack from the $rsp register. This will be useful to visualize the stack when we will start to modify it.

Finally, let’s look at the disassembly.

───────────────────────────────────────────────── code:x86:64 ────
   0x5555555551b6 <main+102>       ret
   0x5555555551b7 <main+103>       call   0x555555555030 <__stack_chk_fail@plt>
   0x5555555551bc                  nop    DWORD PTR [rax+0x0]
 → 0x5555555551c0 <our_code_starts_here+0> mov    QWORD PTR [rsp-0x8], rax
   0x5555555551c5 <our_code_starts_here+5> mov    eax, 0x1
   0x5555555551ca <our_code_starts_here+10> mov    QWORD PTR [rsp-0x10], rax
   0x5555555551cf <our_code_starts_here+15> mov    eax, 0x2
   0x5555555551d4 <our_code_starts_here+20> mov    rax, QWORD PTR [rsp-0x10]
   0x5555555551d9 <our_code_starts_here+25> add    rax, 0x1

This panel displays the disassembly of the code. We can see that the program is currently executing the first instruction of the our_code_starts_here function, which is mov QWORD PTR [rsp-0x8], rax. This instruction is moving the value of the $rax register into the memory address pointed by $rsp-0x8. The next instruction is mov eax, 0x1, which is moving the value 0x1 into the $rax register.

2.3 Stepping through the program🔗

We can step through every instruction of the program by typing stepi or si. Every time we step, GEF will update the context panel to reflect the new state of the program.

gef➤  si

rax contains the value 0x0, so now we would expect the value of the memory address pointed by $rsp-0x8 to be 0x0. Let’s check.

gef➤  x/gx $rsp-0x8
0x7fffffffe050: 0x0000000000000000

x is the examine command, which allows us to examine the memory. gx is a specifier, which tells GEF to display the value of the memory address as a hexadecimal (the x) and to display it as a 64-bit value (the g). $rsp-0x8 is the address of the memory we want to examine.

Let’s continue stepping through the program.

gef➤  si

Now, rax contains the value 0x1, we can see this by looking at the context panel.

2.4 Modifying the program🔗

We can modify the value of the registers by typing set $register=value. For example, let’s set the value of $rax to 0x2.

gef➤  set $rax=0x2

Now, if we step through the program, to the next instruction (mov QWORD PTR [rsp-0x10], rax), we can see that the value of the memory address pointed by $rsp-0x10 is now 0x2, which is what we wanted to see given our input program.

gef➤  si
gef➤  x/gx $rsp-0x10
0x7fffffffe048: 0x0000000000000002

Let’s now break at the end to see what our_code_starts_here returns.

gef➤  disas our_code_starts_here
Dump of assembler code for function our_code_starts_here:
   0x00005555555551c0 <+0>: mov    QWORD PTR [rsp-0x8],rax
   0x00005555555551c5 <+5>: mov    eax,0x1
   0x00005555555551ca <+10>:  mov    QWORD PTR [rsp-0x10],rax
   0x00005555555551cf <+15>:  mov    eax,0x2
   0x00005555555551d4 <+20>:  mov    rax,QWORD PTR [rsp-0x10]
   0x00005555555551d9 <+25>:  add    rax,0x1
   0x00005555555551dd <+29>:  ret
End of assembler dump.
gef➤  b *our_code_starts_here+29
Breakpoint 1 at 0x5555555551dd

We can resume the program by typing continue or c. We will then hit the breakpoint we just set.

gef➤  c
Continuing.

We can see that the value of $rax is 0x3, which is the value we wanted to see.

gef➤  p $rax
$1 = 3

2.5 Diagnosis🔗

So why was our program not working? Well, we were not setting the value of $rax to 0x2 before the mov QWORD PTR [rsp-0x10], rax instruction. Instead, we were setting the value of $rax to 0x2 after the mov QWORD PTR [rsp-0x10], rax instruction. This means that the value of the memory address pointed by $rsp-0x10 was 0x1, because of the compilation of the previous binding, x.

3 Debugging Call Stack Issues🔗

Do Now!
Precompiled binary for this example

Now that we are familiar with basic debugging, let’s look at a more complex example. We will take a look at a compiler with support for basic function definitions and calls. We will use the following program as input to our compiler.

def add(x, y):
  let res = x + y in
  res

let n1 = add(1, 2) in
let n2 = add(4, 5) in
add(n1, n2)

When the program is compiled and run, it should print 12. However, when we run the program, we get the following output.

$ ./output/buggy_stack.run
Segmentation fault (core dumped)

A segmentation fault is a type of error that occurs when a program tries to access memory that it does not have access to. This is usually caused by a bug in the program, which typically involves a value that is wrongfully interpreted as a memory address.

Let’s look at the assembly produced by our compiler.

section .text
global our_code_starts_here
add:
  push RBP ; save old base pointer
  mov RBP, RSP ; set new base pointer
  sub RSP, 8 ; allocate space for local vars
  mov RAX, [RBP+16]
  add RAX, [RBP+24]
  mov [RBP-8], RAX
  mov RAX, [RBP-8]
  add RSP, 8 ; pop local vars
  mov RSP, RBP ; restore stack pointer
  pop RBP ; restore base pointer
  ret
our_code_starts_here:
  push RBP ; save old base pointer
  mov RBP, RSP ; set new base pointer
  sub RSP, 16 ; allocate space for local vars
  mov R11, 4
  push R11 ; push arg number 1
  mov R11, 2
  push R11 ; push arg number 2
  call add
  mov [RBP-8], RAX
  mov R11, 10
  push R11 ; push arg number 1
  mov R11, 8
  push R11 ; push arg number 2
  call add
  mov [RBP-16], RAX
  mov R11, [RBP-16]
  push R11 ; push arg number 1
  mov R11, [RBP-8]
  push R11 ; push arg number 2
  call add
  add RSP, 16 ; pop local vars
  pop RBP
  ret

Unless you have worked with stack frames before, the issue is not immediately obvious. Let’s debug the program in GDB to see when the segmentation fault occurs.

$ gdb output/buggy_stack.run

To debug these kinds of call stack issues, it’s useful to add breakpoints before and after every function call. This way, we can see what the call stack looks like in between calls.

gef➤  b *&our_code_starts_here+24 # break at the first call to add
Breakpoint 1 at 0x5555555556c9
gef➤  b *&our_code_starts_here+49 # break at the second call to add
Breakpoint 2 at 0x5555555556e2
gef➤  b *&our_code_starts_here+70 # break at the third call to add
Breakpoint 3 at 0x5555555556f7
gef➤  b *&our_code_starts_here+80 # break on return instruction
Breakpoint 4 at 0x555555555701

We can run our program with r and cycle through the breakpoints with n.

gef➤  r
gef➤  n
gef➤  n
gef➤  n
gef➤  n

Having reached the last breakpoint without the program crashing, we can see in the stack panel that the value of $rsp is misaligned, which is causing the segmentation fault.

───────────────────────────────────────────────── stack ────
0x007fffffffdfe8│+0x0000: 0x0000000000000a ("\n"?)   ← $rsp
0x007fffffffdff0│+0x0008: 0x0000000000000002
0x007fffffffdff8│+0x0010: 0x0000000000000004
0x007fffffffe000│+0x0018: 0x0000000000000012
0x007fffffffe008│+0x0020: 0x0000000000000006
0x007fffffffe010│+0x0028: 0x007fffffffe040  →  0x0000000000000001
0x007fffffffe018│+0x0030: 0x00555555555652  →  <main+50> mov QWORD PTR [rsp], rax
0x007fffffffe020│+0x0038: 0x0000000000000000

When the ret address is executed, the program tries to jump to the address pointed by $rsp, which is 0x0000000000000a, which is not a valid address. Instead, $rsp should have pointed to the 0x00555555555652 address, which is the address of the mov QWORD PTR [rsp], rax instruction, making the control flow of the program jump to the frame in the stack of the main function.

Let’s rewind the program to the first breakpoint and look at the stack before the first call

gef➤  r
───────────────────────────────────────────────── stack ────
0x007fffffffdff0│+0x0000: 0x0000000000000002   ← $rsp
0x007fffffffdff8│+0x0008: 0x0000000000000004
0x007fffffffe000│+0x0010: 0x0000000000000000
0x007fffffffe008│+0x0018: 0x0000000000000000
0x007fffffffe010│+0x0020: 0x007fffffffe040  →  0x0000000000000001  ← $rbp
0x007fffffffe018│+0x0028: 0x00555555555652  →  <main+50> mov QWORD PTR [rsp], rax
0x007fffffffe020│+0x0030: 0x0000000000000000

We can see that both 2 (1 in snakeval) and 4 (2 in snakeval) are pushed to the stack, nothing wrong here. Let’s continue to the next breakpoint.

gef➤  n
───────────────────────────────────────────────── stack ────
0x007fffffffdfe0│+0x0000: 0x0000000000000008   ← $rsp
0x007fffffffdfe8│+0x0008: 0x0000000000000a ("\n"?)
0x007fffffffdff0│+0x0010: 0x0000000000000002
0x007fffffffdff8│+0x0018: 0x0000000000000004
0x007fffffffe000│+0x0020: 0x0000000000000000
0x007fffffffe008│+0x0028: 0x0000000000000006
0x007fffffffe010│+0x0030: 0x007fffffffe040  →  0x0000000000000001  ← $rbp
0x007fffffffe018│+0x0038: 0x00555555555652  →  <main+50> mov QWORD PTR [rsp], rax

Now, this is interesting. We can see that 2 and 4 are still on the stack, but 8 (4 in snakeval) and 10 (5 in snakeval) are also on the stack. What should have happened is that 2 and 4 should have been popped off the stack after the first call to add, and 8 and 10 should have taken their place. Additionally, we see that the $rsp register is advanced by 16 more bytes than it should be.

Let’s go back to the assembly code of our_code_starts_here and see what’s going on.

gef➤  disas our_code_starts_here
Dump of assembler code for function our_code_starts_here:
   0x00005555555556b1 <+0>: push   rbp
   0x00005555555556b2 <+1>: mov    rbp,rsp
   0x00005555555556b5 <+4>: sub    rsp,0x10
   0x00005555555556b9 <+8>: mov    r11d,0x4
   0x00005555555556bf <+14>:  push   r11
   0x00005555555556c1 <+16>:  mov    r11d,0x2
   0x00005555555556c7 <+22>:  push   r11
   0x00005555555556c9 <+24>:  call   0x555555555690 <add>
   0x00005555555556ce <+29>:  mov    QWORD PTR [rbp-0x8],rax
   0x00005555555556d2 <+33>:  mov    r11d,0xa
   0x00005555555556d8 <+39>:  push   r11
   0x00005555555556da <+41>:  mov    r11d,0x8
   0x00005555555556e0 <+47>:  push   r11
=> 0x00005555555556e2 <+49>:  call   0x555555555690 <add>
   0x00005555555556e7 <+54>:  mov    QWORD PTR [rbp-0x10],rax
   0x00005555555556eb <+58>:  mov    r11,QWORD PTR [rbp-0x10]
   0x00005555555556ef <+62>:  push   r11
   0x00005555555556f1 <+64>:  mov    r11,QWORD PTR [rbp-0x8]
   0x00005555555556f5 <+68>:  push   r11
   0x00005555555556f7 <+70>:  call   0x555555555690 <add>
   0x00005555555556fc <+75>:  add    rsp,0x10
   0x0000555555555700 <+79>:  pop    rbp
   0x0000555555555701 <+80>:  ret
End of assembler dump.

We are pushing 2 and 4 to the stack, then calling add, then pushing 8 and 10 to the stack, then calling add again. However, we are never popping the arguments off the stack after the function returns. This is why the stack is messed up.

3.1 Diagnosis🔗

Whenever our_code_starts_here calls the add function, it pushes the arguments to the stack. However, it does not pop the arguments off the stack after the function returns. The easiest way to fix this is to add a add RSP, 16 instruction after every call add, which will add 16 to the value of $rsp, which will effectively pop the arguments off the stack.