Weak Memory Models are a Strong Reminder for Programmers to use Synchronization Primitives

There are many types of memory models[4], we will, however, concern ourselves with only two basic types, that we will call: strongly ordered and weakly ordered. The strongly-ordered model is natural for programmers, here the order that a program writes data to memory is the order in which the data is observed being written into memory, that is, other programs sharing the data will "see" the same ordering. This consistent view is sometimes called the property of observabilty[5]; for example, if processor writes a new X then writes a new Y, all other processors that subsequently execute a read Y then a read X, will access either the new Y and new X, the old Y and the new X, or old X and the old Y: but no processor will access the new Y and the old X. This assumption of strong ordering was, at one time, reasonable. Current computer manufactures, however, recommend that programmers not rely on memory ordering. This is because newer memory management systems attempt to reorder memory access for optimization purposes. Systems that are allowed to reorder memory request are called weakly-ordered memory systems (models). To examine how a reordering might be used to improve performance, consider the following assembler code [2].

Load reg1, A                 // register1 = contents of memory A
Load reg2, B                 // register2 = contents of memory B
ADD reg3, reg1, reg2     // register3 = register1 + register2
Store reg3, C                 // contents of memory C = contents of register3

If we assume that location B is currently in cache and location A is not cached, then loading A will take longer than B. Instead of waiting for A, the CPU can fetch B from its cache, hiding B’s latency: thus the CPU can perform the addition as soon as A is available. By relaxing the strong (sequential) memory model of execution (i.e., A must load first, followed by B), greater performance is possible----but reordering may not be transparent to software. Consider the code fragment below, it is part of the code that can be used to implements a spinlock semaphore [2].

 

Load reg1, S                 // register1 = contents of memory S
ADD reg2, reg1, 1         // increment S
Store Reg2, S                 // S = contents of register2
CLR reg3                         // reg3 = 0
Store reg3, lock                 // lock = contents of register3

In order for the above code fragment to work, memory location S must be updated prior to location lock; otherwise, any while-loop waiting on lock may proceed with a possibly invalid value of S ;e.g., while(lock != 0). Thus any algorithm that is based on a strong ordering assumption may fail in a system that is allowed to reorder memory request, Dijkstra’s and Peterson’s synchronization algorithms fail [1,2,5], for example. It is important to note that, individual processors each have a consistent view of their own memory: thus, processor A will see all its accesses consistently; yet, another processor may see processors A’s memory accesses differently (cf. Paragraph 1).

 

E.G., Hardware and Memory Consistence

 

Intel’s Pentium [6] and TI’s SuperSparc [2] processors each contain a FIFO queue called a store buffer. The fundamental idea behind this is that the CPU need not waste time writing back each datum to memory, instead its writes are queued---this acts like a cache. When the CPU needs to fetch a datum it first checks its store buffer, then memory (or external cache). The problem with this scheme is that new values of data are not immediately visible to the rest of the system, and this model is called processor consistency [4]. In a multi-processor system this creates a problem, that is, two or more processors can each have different values for the same variable. (Again algorithms that rely on strong-ordering fail.) Thus, the results of simultaneous loads and stores to identical locations in multi processing systems are non-deterministic. To synchronize such systems, processors typically provide instructions that force the store buffer to flush, i.e., dump its contents to memory.

 

Another example of where reorder can occur is in a system that employs a multi-level switching network between processors and memory [5]. When say, two writes are sent to the switching network, they travel to their respective memory locations, but because the paths may differ, the first write may arrive at its location last. While the processor that issued the writes will not observe any miss-ordering, other processors attempting to access these same locations may.

 

E.G., Compilers and Consistency [1]

 

In multi-threaded programs (e.g., shared memory model) two or more threads of control can access the same variable. This, of course can create race conditions, but concern here is the problem that a code optimizer can cause. Albeit contrived, examine the program fragment below. Assume that a thread through foo() loops while A==1, while a thread through boo() waits on a signal to alter A. Consider that in a single-threaded programming model A’s scope is global, and if foo() is called, there is no possibility of boo(), or any other piece of code altering A: now let’s take advantage of this.

foo() 
{
A=1;
while (A==1) { 
//do stuff ...
}
}
boo()
{
while ( wait_for_signal) ; //spin wheels
A=0;

 

In the optimization phase of compilation, an optimizer may remove the test A==1 from the while-loop of foo(), since it is neither altered nor reference again [7], and replace it with a goto (i.e., jmp instruction), removing the need for a test-and-branch. Clearly, the problem is that the optimizer has not accounted for A being altered by either another thread, interrupt, or process. In the C programming language, to prevent the compiler from removing or relocating a variable the type-qualifier "volatile" can be inserted (e.g., volatile int A) [3].

References
[1] Kleiman, Shah, and Smaalders, "Programming with Threads," Prentice-Hall.
[2] Schimmel, "Unix Systems for Modern Architecture," Addison-Wesley.
[3] Kernighan, Ritchie, "The C Programming Language," Prentice-Hall.
[4] Gharachorloo, "Memory Consistency Models for Shared-Memory Multiprocessors," DEC report, Dec 9, 1995.
[5] Stone, "High-Performance Computer Architecture," Addison-Wesley.
[6] Intel Corp., "Pentium Processor User’s Manual," Vol.1.
[7] Aho, Sethi, and Ullman, "Compiler Principles, Techniques, and Tools," Addison-Wesley.