A Volatile Situation
Table of Contents
As a hardware engineer, most of the code that I write is either in assembly or low-level C (apart from the shell scripts to run simulation jobs or Perl/Python to parse log files). I am therefore very familiar with the volatile type qualifier in the C language. Referring to my well-worn copy of the Kernighan and Ritchie “The C Programming Language” book, the section on volatile states:
The purpose of
volatileis to force an implementation to suppress optimization that could otherwise occur. For example, for a machine with memory-mapped input/output, a pointer to a device register might be declared as a pointer tovolatile, in order to prevent the compiler from removing apparently redundant references through the pointer.
There is also one interesting tidbit about it in the previous paragraph:
There are no implementation independent semantics for
volatileobjects.
Why am I talking about volatile? I ran into a situation where not using volatile in a multi-threaded code exposed an interesting compiler behavior.
###
The situation
Recently, in my spare time, I wrote some code to demonstrate memory consistency models (this will likely be a future blog post). The code has a producer thread and a consumer thread and the goal is to look at how ordering between different load and store operations work. I could have written this in assembly, but I wanted to try it across different architectures, so writing it in C made more sense. The entire code can be found here.
The table below shows the functions executed by the producer and consumer threads side by side:
| Producer Thread | Consumer Thread |
|
|
As you can see, in every iteration i, the producer thread writes a value into dat_buf[i] and then sets the flag value to i. Meanwhile, the consumer thread is polling on flag, waiting for it to equal i: i.e. waiting for producer to set the flag. It then checks whether the dat_buf[i] value matches the value we expect the producer to have written. If strict write ordering is guaranteed by the processor, then num_errors will always be 0. Note that the barrier at the start of each iteration ensures that the producer and consumer threads run each iteration in lockstep.
###
It’s volatile!
When I first wrote this code, I did not declare the flag pointer as volatile. I compiled it without any compiler optimization flags on my laptop with an AMD Zen2 CPU:
gcc -Wall -march=native store_order.c -o store_order
I ran the executable and it worked without any issues. I then recompiled it with -O2 optimizations added, and running the new optimized binary would result in a hang within the first couple of iterations. I couldn’t immediately understand why optimizations caused this difference. Only when I compiled the code to assembly and diff’ed the read_data function between two did I realize the problem. The assembly1 snippet below corresponds to the C line:
while (*flag != i);
| Without Optimization | With -O2 Optimization |
|
|
You’ll notice that the loading of flag into eax happens within the loop in the unoptimized case, but that load is moved outside the loop. So, if the load of flag doesn’t get the expected value when it is read, then the .L7 compare loop on the right will never exit (infinite loop). This is when I realized that flag pointer also needs to be declared as volatile! The volatile qualifier applies when the pointer references anything that can be changed outside the current execution thread: that could be done by hardware (in case of hardware registers), but also by other threads in a shared memory system.
I then added the volatile keyword to the flag pointer declaration:
volatile uint32_t *flag;
I recompiled the code with -O2 and then the program worked correctly – it did not hang. A quick look at the new assembly reveals what changed:
mov rdx, QWORD PTR flag[rip]
.L7:
mov eax, DWORD PTR [rdx] ;; read of flag
cmp eax, ebx
jne .L7
The read of flag is now moved to be within the .L7 compare loop! This simple example shows what optimizations get suppressed with volatile that might otherwise occur without it.
So, if your multi-threaded program execution becomes “volatile” with optimizations, look to see if you need to add volatile qualifier to pointers in your code.
###
Epilogue
I recently read a blog post titled “C and C++ Prioritize Performance over Correctness” – associated HN discussion here. This got me thinking of the code that gcc produced with -O2 but without the volatile qualifier:
mov eax, DWORD PTR [rax] ;; loads flag into eax
.L7:
cmp eax, ebx ;; ebx holds value of 'i' and compared
jne .L7
What is the point of this code? Neither eax nor ebx can ever change inside this .L7 loop. Without volatile gcc is assuming that flag is not going to change, and therefore only reads it once outside the loop. Why didn’t gcc then change this while loop into an if statement and check flag once before moving on? Why insert the infinite loop?
Thinking this might be a problem with gcc, I compiled the code with LLVM/clang using the same commandline options. Without the volatile keyword, clang produced the following assembly:
mov rax, qword ptr [rip + flag]
mov eax, dword ptr [rax] ;; loads flag into eax
cmp r12, rax ;; r12 holds value of 'i'
jne .LBB2_2
.LBB2_2:
jmp .LBB2_2
This is even more absurd: the cmp and the following jne looks like an if statement like I was suggesting earlier. But then the jmp at .LBB2_2 is an explicit infinite loop! Why, clang? Why?
This brings back memories from my childhood where I would write infinite loops in BASIC and leave it running to annoy others:
10 PRINT " "
20 GOTO 10
-
Assembly in Intel Syntax ↩︎