Lecture 8: Intro to Concurrency

Introduction

We learned how to create new processes in our programs using fork
We mostly used fork to execute other programs
We can use fork to perform work concurrently or in parallel

Concurrent vs parallel

Concurrent = happening at at the same time, interleaving, sharing resources
- Imagine 2 queues for a single coffee machine
- Multiple tasks in progress at the same time
- Dealing with multiple things at once
Parallel = happening at the same time, progressing independently
- How about 2 coffee machines - 1 for each queue?
- Multiple tasks executing at the same time
- Doing multiple things at once
- Simultaneous execution

Parallel ⊆ Concurrent
Parallel = ideal
some problems are inherently parallel, or contain parallel subproblems (merge sort)
Remember map/foldl?
- map is parallel - each function application is independent of the other one
- foldl not, but often can be split into multiple smaller child foldls and a parent foldl
- This is how Google (initially) quickly utilized it’s massive data centers of consumer-grade PCs -> MapReduce

In a modern OS: concurrency is everywhere

Remember:
- many processes,
- running at the “same” time,
- accessing same resources (disk, memory, CPU, network, display, …)
Shell allows you to start and run processes concurrently

Approaches

Process-Based
- fork different processes
- Each process has its own private address space - sharing has to be explicitly “requested”
Event-Based
- Programmer manually interleaves multiple logical flows and polls for events
- All flows share the same address space
- Uses technique called I/O multiplexing
Thread-based
- Kernel automatically interleaves multiple logical flows
- Each flow shares the same address space
- Hybrid of process-based and event-based.

Concurrency with Threads

We know we can do concurrency with processes using fork (and mmap if needed)
Any problems with that?
- processes are kinda “heavy” - each has its own address space
- sharing information is a little complicated
Most concurrent computations will share data, as well as code
Is there a more lightweight solution?
Threads!
Threads are “lightweight” processes - they are separate, well, threads of execution within the same address space
They share heap, data, code… what about the stack?
They cannot share stack because they might call different functions
So local storage is different, as is the state of registers, but otherwise there’s sharing
This means they share the same page table and all that bookkeeping stuff
Consequence: it’s (usually) cheaper to create them, faster to switch between them, faster to “reap” them once they are done
They can still run on separate cores

Using Threads

POSIX Thread API = pthreads - implemented in the pthread library (man 7 pthreads or man 3 pthread)
a simple interface for creating and managing threads
Threads are created using pthread_create
Arguments:
- thread handle pointer (this is an out argument and will be overwritten by the function)
- attributes (just use NULL to use the default attributes)
- the start routine (a function pointer), which may take a single argument
- the argument to the start routine
Let’s do a hello thread
How do we wait? - pthread_join (take the thread handle and a pointer for the return value)
Other useful functions:
- pthread_self()
- pthread_cancel()
- pthread_exit()
Rewrite sum using threads
Confirm that we still have a data race
Try timing?
Next time: how to avoid data races and what new problems does that cause?

Concurrency Problems

Data race - what we saw with sum
- Imagine you check your fridge and find there is no milk
- So you run to the store
- Then moments later your roommate checks the fridge and finds it is empty
- So they run to the store
- Roommate # 3 comes and notices the same
- …
- Conditions for a data race:
  1. At least two processes
  2. Access the same memory
  3. At least one of them writes to the memory
Deadlock - we’ll see examples later
- Grid lock in a traffic jam
- Each car prevents others from going through a shared resource (the intersection).
- (One car needs a piece of the intersection in order to move forward)
Starvation
- Stop sign at a busy intersection
- OS has to be very careful not to starve processes of CPU or other parts of the machine

Data Races

We ran into data races in both implementations: at some point our two processes / threads got an inconsistent view of the shared variable
Reason - typical example scenario:
1. Thread A reads sum from memory
2. Thread A gets interrupted by the OS
3. Thread B reads sum from memory
4. Thread B modifies the value of sum
5. Thread B stores sum in memory
6. (Optionally) Thread B performs a few more iterations of steps 3-5
7. Thread B reads sum from memory
8. Thread B gets interrupted by the OS (eventually)
9. Thread A modifies the its value of sum
10. Thread A stores sum in memory (discarding the work done by Thread B)
In general, data races occur, when:
1. At least two processes / threads run concurrently
2. They share data
3. At least one process writes

Aside: When is data shared?

Forked processes: Shared memory (mmap with MAP_SHARED)
Threads: global variables, local static variables

Global variables
- declared outside of a function
- stored in the .data segment (if initialized)
- fork: not shared
- threads: shared
Local variables
- declared inside of a function (no static keyword)
- fork: not shared
- threads: not shared (each thread has its own stack)
Local static variables
- declared inside of a function with the static specifier
- fork: not shared
- threads: shared (only visible in the function it’s declared in)

Fixing Data Races

We need mutual exclusion, i.e., threads are mutually excluded from running a piece of code that needs the shared variable (a critical section)
Idea: some sort of “lock”
We (try to) acquire a lock before we enter a critical piece of code accessing/modifying shared memory
When we acquire the lock, we do our modification
After we are done, we release the lock

Roughly (and incorrectly):

while (is_locked(var_lock)) {
    sleep(1);
}
lock(var_lock);
do_important_stuff(var);
unlock(var_lock);

If we do it using just plain shared variables, we run into the same problem: data race (why?)
We need support from the OS to ensure locking and unlocking are performed atomically
A few ways to achieve this

Mutexes

Downsides: performance
Gives atomic access to a special “mutex” variable
Atomic wait+lock
Atomic unlock
Ops:
- pthread_mutex_init()
- pthread_mutex_lock()
- pthread_mutex_unlock()
- also pthread_mutex_trylock()
Before using a mutex: init with attributes - returns 0 on success
Initially, a mutex is unlocked
Just before entering a critical section, acquire using lock() (0 on success)
If the mutex is locked, this will block (wait)
When exiting the critical section, release using unlock() (0 on success)
At this point one of the lock() calls in other threads will return, acquiring the mutex

Semaphores

A more general primitive
A semaphore - just an integer with some ops associated
A lock (mutex) is just a special case of a semaphore
Idea: if the semaphore is 0, we have to wait, if the semaphore > 0, we’re good to go
sem_init - initialize a semaphore
sem_wait - waits for semaphore to become != 0, decrements it by 1 atomically
sem_post - increments semaphore by one atomically
If we want the semaphore to be shared, we need to allocate it as a shared variable

Example - using a semaphore as a lock (max value 1)

sem = 1

proc A             proc B             sem
sem_wait(sem)                         1
do_work();         sem_wait(sem);     0        // sem_wait blocks in B
do_more_work();    .                  0
.                  .                  0
.                  .                  0
.                  .                  0
sem_post(sem);     .                  1        // sem_wait returns in B
.                  do_work();         0
sem_wait(sem);     do_more_work();    0        // sem_wait blocks in A
.                  .                  0
.                  sem_post(sem);     1        // sem_wait returns in A
do_work();         .                  0

See sum_semaphores.c
Problem? It’s slower than the sequential example!!
Try
```
$ time ./sem-sum
```
and observe how much time is spent in the kernel. Compare this to the other versions
Kernel is doing a lot of extra work managing our semaphore
Semaphores with higher values are used, for example, for counting resources

Locking problems: Deadlocks

Happen when we introduce more than one lock and we end up in a “circular wait”
Process 1: acquire lock A -> wait for lock B
Process 2: acquire lock B -> wait for lock A
No process can make any progress because they are waiting for each other to release the other lock
“Easy” solution: enforce order on locks
If lock A is ordered before lock B, then lock A cannot be acquired after lock B, that is, if we want both, we must acquire A first, and B only if we have A already
See lock.c for an artificial example of a deadlock

Avoiding locking

Better approach: change the design of our program to avoid using locks as much as possible
Think back to Fundies I: In general, the functional programming style introduced there is very amenable for parallelization
Read also about MapReduce