Correctness of Concurrent Programs

The correctness problem for concurrent programs is a level up from that of sequential programs.

Correctness Criteria

If a program is supposed to terminate, there are two kinds of correctness criteria:

Partial Correctness
If the preconditions hold and the program terminates, then the postconditions will hold.
Total Correctness
If the preconditions hold, then the program will terminate and the postconditions will hold.

For concurrent programs that are supposed to terminate, we prefix these definitions with "For all possible interleaved execution sequences."

Exercise: What does this imply about developing by running the program, looking at the output, fixing something, running it again, etc.?

For programs that are not supposed to terminate, we have to write correctness criteria in terms of properties that must always hold (safety properties) and those that must eventually hold (liveness properties). Both are important: a program that does nothing is safe: it never screws up.

Safety Properties

Here's an example of how some interleavings can produce correct results and others produce incorrect results. First consider two threads:

Thread 1Thread 2
Transfer $50 from account A to account B Deposit $100 to account B.
  A: load value of acctA into reg R0
  B: subtract 50 from R0
  C: store value in R0 into acctA
  D: load value of acctB into reg R0
  E: add 50 to R0
  F: store R0 into acctB
  x: load value of acctB into reg R2
  y: add 100 to R2
  z: store R2 into acctB

Suppose account A initially has $500 and B has $700. We expect, after the transfer and the deposit, for A to have $450 and B to have $850. Now consider the interleaving [A B x C D E y F z]:

Instruction Account A value Account B value Value in R0 Value in R2
500 700 ? ?
A 500
B 450
x 700
C 450
D 700
E 750
y 800
F 750
z 800

Yikes. $50 completely disappeared. Why? There was contention for a shared resource. Thread-2 wiped out Thread-1's transfer.

Race Conditions

A race condition (a.k.a. race hazard, or just race) is a property of multithreaded code in which the result depends on the relative timing of events in the threads. If all of the possible results are okay, the race condition is benign.

Races are almost always caused by threads stepping on each other, and are fixed by making sure threads have mutually exclusive access to shared data. We need to be able to ensure a (safety) property that "multiple threads will never update the same account at the same time."

Critical Regions

In terms of the interleaved instruction sequence model, we avoid races by chunking subsequences into a single atomic instruction.

Example: In the above example with accounts, updating an account needs to be an atomic operation. Thread 1 should really have two instructions: ABC and DEF. Thread 2 should have only one: xyz. Then the possible interleavings are:
  1. [ABC DEF xyz]
  2. [ABC xyz DEF]
  3. [xyz ABC DEF]
Example: Consider two threads where thread 1 updates the position of a ship
    loop
        x += dx;
        y += dy;
        ...
    end loop
and another thread draws the ship
    loop
        clear();
        drawBackground();
        drawShipAt(x, y);
        swapBuffers();
        ...
    end loop
If Thread-2's drawShipAt gets executed between Thread-1's adjusting x and adjusting y, the ship gets drawn at a bad position. We have to do both updates in a critical region.

We often see this kind of code:

    .
    .
    .
    doNonCriticalStuff
    ENTRY_PROTOCOL
    doCriticalRegion
    EXIT_PROTOCOL
    doNonCriticalStuff
    .
    .
    .

where

Issues

Critical regions can prevent race conditions, but introduce a bunch of potential problems of their own:

Deadlock

Two threads are deadlocked if each is blocked waiting for an event that only the other blocked thread can generate.

Exercise: Extend this definition to n threads.

Livelock

Livelock occurs when no progress is made because each thread tries to enter its critical section, then times out, then tries again, then times out, and so on, forever.

Starvation

A thread is starved when it is prevented from entering its critical section because the other threads conspire (or appear to be conspiring) to lock it out.

Failure in the Absence of Contention

A (poorly designed) system is said to fail in the abscence of contention when a thread can't enter its critical section unless another thread is waiting to enter its critical section.

Reliability

What if a thread is interrupted, is suspended, or crashes inside its critical section? In the middle of the critical section, the system may be in an inconsistent state (remember the example of updating a ship's coordinates?). Not only that, the thread is holding a lock and if it dies no other thread waiting on that lock can proceed! Critical sections have to be treated as transactions and must always be allowed to finish.

Developers must ensure critical regions are very short and always terminate.

Liveness Properties

Some properties we may like to prove to show a system is progressing (not starving):

Weak Fairness
If a thread continually makes a request it will eventually be granted.
Strong Fairness
If a thread makes a request infinitely often it will eventually be granted.
Linear Waiting
If a thread makes a request, it will be granted before any other thread is granted a request more than once.
FIFO
If a thread makes a request it will be granted before any other thread making a later request.

Illustration of a weakly fair but not strongly fair system:

fairness.png

Exercise: True or false: The "fairer" the system, the more throughput may degrade. Explain.