# Sequential Consistency in C11

October 24, 2022

#### Duc Than Nguyen

University of Illinois at Chicago

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

## 

#### Introduction

- \* Programmer's goal-fully understand what they've written
- \* Compiler's responsibility-make code as optimized as possible
- Simple compiler optimizations can generate unexpected behaviors in concurrent setting
- ✓ Memory model—core of concurrent semantics of shared memory → alleviate tension between goals of programmer & compiler
- C11 memory model defines semantics in C progrogramming
- Batty et al. [2011] <sup>1</sup> formalized concurrency model in C++ standard
- 🥮 Several issues discovered with semantics of SC C11 accesses
- Model has evolved with fixes and revisions (e.g., Vafeiadis et al. [2015], Batty et al. [2016], Lahav et al. [2017])

<sup>&</sup>lt;sup>1</sup>Mathematizing C++ Concurrency. In POPL 2011.

 Introduction
 Background
 Common OT are invalid
 Overhauling SC atomics in Cii
 Repairing SC in Cii
 Selected critique

 000
 000000
 000000
 0000
 0000000000
 00

Papers

| Authors          | Paper                                                                                               | Conference |
|------------------|-----------------------------------------------------------------------------------------------------|------------|
| Vafeiadis et al. | Common compiler optimisations<br>are invalid in the C11 memory model<br>and what we can do about it | POPL 2015  |
| Batty et al.     | Overhauling SC atomics<br>in C11 and OpenCL                                                         | POPL 2016  |
| Lahav et al.     | Repairing sequential consistency<br>in C/C++11                                                      | PLDI 2017  |

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

00000

... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. (Lamport [1979]<sup>2</sup>)

x := 0; y := 0;

$$x := 1$$
  
 $a := {}^{*}y; //load 0 \qquad y := 1$   
 $b := {}^{*}x; //load 0$ 

- \* Thread-local variables a = b = 0?
- **\*** 3 possible outcomes: (1) a = b = 0, (2) a = 0, b = 1, (3) a = 1, b = 0
- \* a = b = 0 cannot happen (no interleaving yields that result)
- \* Modern systems allow its behavior

<sup>2</sup>How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Computers 28, 9 (1979)

#### C11 memory model

#### $\mathtt{na}\sqsubset\mathtt{rlx}\sqsubset\{\mathtt{acq},\mathtt{rel}\}\sqsubset\mathtt{sc}$

The full C/C++11 is more general

- sc- Sequentially consistent accesses
- rel Release, and acq Acquire accesses
- rlx- Relaxed accesses
  - na- Non-atomic accesses (a.k.a. normal data accesses)

Declarative/Axiomatic memory model & Semantics

$$g: W^{na}(x, 0) \xrightarrow{h: W^{na}(y, 0)} mo$$

$$k: W^{sc}(x, 1) \xrightarrow{sw} l: R^{acq}(x, 1) \xrightarrow{rf} n: R^{acq}(y, 1) \xrightarrow{rf} y: W^{sc}(y, 1)$$

$$hb \xrightarrow{w} n: R^{sc}(y, 0) \xrightarrow{rb} o: R^{sc}(x, 0)$$

$$\begin{array}{c} x :=_{na} O; \quad y :=_{na} O; \\ x :=_{sc} I; \quad \left\| \begin{array}{c} a :=^{*acq} x \ //1 \\ c :=^{*sc} y; \ //0 \end{array} \right\| \begin{array}{c} b :=^{*acq} y; \ //1 \\ d :=^{*acq} x; \ //0 \end{array} \right\| \begin{array}{c} y :=_{sc} I; \\ \end{array}$$

events - nodes & relations - edges

000000

W(x, 0) event -x set to 0, and R(x, 0) event -x returns 0

- sb- Sequenced-before : order of intra-thread memory accesses
- rf- Reads-from: connect two write and read accesses
- mo- Modification order: total order on all writes to same location

Declarative/Axiomatic memory model & Semantics

000000

$$g: W^{na}(x, 0) \xrightarrow{h: W^{na}(y, 0)} mo$$

$$k: W^{sc}(x, 1) \xrightarrow{sw} l: R^{acq}(x, 1) \xrightarrow{rf} n: R^{acq}(y, 1) \xrightarrow{rf} y: W^{sc}(y, 1)$$

$$hb \xrightarrow{g: W^{na}(x, 0)} \xrightarrow{rf} y: W^{sc}(y, 0)$$

$$x :=_{sc} 1; \ \left\| \begin{array}{c} a :=^{*acq} x \ //1 \\ c :=^{*sc} y; \ //0 \end{array} \right\| \begin{array}{c} b :=^{*acq} y; \ //1 \\ d :=^{*acq} x; \ //0 \end{array} \right\| y :=_{sc} 1;$$

X := ... O. V := ... O.

#### Additional relational types

- $\mathsf{sw}-\mathsf{Synchronized}\mathsf{-with}\mathsf{:}$  an acq (sc) read reads from a rel (sc) write
- hb– Happens-before:  $hb = (sb \cup sw)^+$  (transitive closure)
- $\label{eq:rb-Reads-before: rb = rf^{-1}; mo \ [E] read access reads from write that is mo- before, exclude update event from itself$

#### Semantics

Rules of total order of sequentially consistent operations

- 1. Total order on SC operations: any two SC operations must be ordered w.r.t. each other
- 2. Be consistent with hb and mo restricted to SC atomics
- 3. SC reads (SCReads axiom) must either
  - $\circ~$  read from most recent SC write before them in SC order, or
  - read from non-SC write that does not happen-before most recent SC write to that location

Introduction

Background

#### Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

#### Optimization transformations (OT)

- \* Compiler converts program to architecture, preserving its semantics
- \* Compiler must be aware of memory consistency models of programming language & architecture
- \* C11 compilers (GCC, LLVM) rarely perform C code compilation in a single step
- Front-end translate C11 to intermediate representations (IR), compiler executes several OT on IR then generates target code
- \* Source & target code maintains same consistency model
- \* OT: reordering independent memory & removing redundant memory accesses

\* 
$$x :=_{rel} 1; y :=_{na} 1; \rightsquigarrow y :=_{na} 1; x :=_{rel} 1;$$

#### Monotonicity

#### "Adding synchronisation should not introduce new behaviors"

- \* Adding a memory fence
- \* Strengthening the access mode of an operation
- \* Reducing parallelism  $C_1 || C_2 \rightsquigarrow C_1; C_2$
- \* Roach motel reorderings

Strengthening access mode of an operation

000000

(Recall) SCReads axiom — SC reads must either (1) read from most recent SC write before them in SC order, or (2) read from non-SC write that does **not** happen-before most recent SC write to that location

|                        |                        |                                          | $  s_1 := {}^{* r l x} x; //1$            |
|------------------------|------------------------|------------------------------------------|-------------------------------------------|
|                        |                        |                                          | $s_2 := {}^{*rlx} x; //2$                 |
| $x :=_{\text{rlx}} 1;$ | $x :=_{\text{rlx}} 3;$ | $y :=_{sc} 3;$                           | $s_3 := {}^{*rlx} x; //3$                 |
| $x :=_{sc} 2;$         | $y :=_{sc} 2;$         | $y :=_{sc} 3;$<br>$r := {}^{*sc} x; //1$ | $t_1 := {}^{* \operatorname{rlx}} y; //1$ |
| $y :=_{sc} 1;$         |                        |                                          | $t_2 := {}^{* r l x} y; //2$              |
|                        |                        |                                          | $t_3 := {}^{*rlx}y; //3$                  |

? Allowed  $-r = s_1 = t_1 = 1 \land s_2 = t_2 = 2 \land s_3 = t_3 = 3$ 

If it holds,  $x :=_{sc} 2$  – immediate SC prior write w.r.t.  $r :=^{sc} x$ 

Reading 1 of r := \*sc x from  $x :=_{rlx} 1$  is **not** valid ( $x :=_{rlx} 1$  happens before  $x :=_{sc} 2$  via sb) 

# Strengthening access mode of an operation

(Recall) SCReads axiom — SC reads must either (1) read from most recent SC write before them in SC order, or (2) read from non-SC write that does **not** happen-before most recent SC write to that location

- ? Strengthening  $x :=_{rlx} 3$  into  $x :=_{sc} 3$
- Establish SC order from  $y :=_{sc} 1$  to  $x :=_{sc} 3$
- $x :=_{sc} 3$  is immediately prior in SC order of  $r :=^{sc} x$
- Reading 1 of  $r := x^{sc} x$  is valid (since  $x :=_{rlx} 1$  does not hb  $x :=_{sc} 3$ )
- ✗ New behavior introduced

Correcting the SCReads Axiom

00000

$$\forall a, b. rf(b) = a \land isSC(b) \Rightarrow imm(scr, a, b) \lor \neg isSC(a)$$

$$\land \nexists x. hb(a, x) \land imm(scr, x, b)$$
(1)

**1** isSC(a) = mode(a) = sc  
**2** imm(R, a, b) = R(a, b) 
$$\land \nexists c. R(a, c) \land R(c, b)$$
  
**3** scr(a, b) = sc(a, b)  $\land$  iswrite<sub>loc(b)</sub>(a)  
**4** iswrite<sub>l</sub>(a) =  $\exists v. (\exists X, v'. lab(a) \in \{W^X(l, v), U^X(l, v', v)\})$ 

- ? Problem: disallows a non-SC write that hb another SC-write that is immediately prior in SC order to same location
- ✓ Fix: change imm(scr, x, b) into scr(x, b) constrains all SC prior same-location writes instead of only immediate prior write on same location
- ✓ No hb between write rf(b) and any same location write sc-prior to read → r := \*sc x reading from  $x :=_{rlx} 1$  not valid

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

# Derived sets and relations

They are formalized as C11 axioms by referencing the C11 standard

| 0 | $irr(S; r_1)$                                                                                                              | where $r_1 = hb$                                                              |  |
|---|----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|--|
| 2 | $irr(S; r_2)$                                                                                                              | where $r_2 = ([F^{sc}]; sb)^?; mo; (sb; [F^{sc}])^?$                          |  |
| 8 | $irr(S; r_3)$                                                                                                              | where $r_3 = rf^{-1}; [E^{sc}]; mo$                                           |  |
| 4 | irr(( $S \setminus (mo; S)$ ); $r_4$ ) where $r_4 = rf^{-1}$ ; hbl; [ $E^W$ ],<br>and hbl $-hb$ to events on same location |                                                                               |  |
| 6 | $irr(S; r_5)$                                                                                                              | where $r_5 = ([F^{sc}]; sb); rb$                                              |  |
| 6 | $irr(S; r_6)$                                                                                                              | where $r_6 = \mathbf{rb}$ ; ( $\mathbf{sb}$ ; [ $\mathbf{F}^{\mathbf{sc}}$ ]) |  |
| 7 | $irr(S; r_7)$                                                                                                              | where $r_7 = ([F^{sc}]; sb); rb; (sb; [F^{sc}])$                              |  |

- S- Sequential consistency order
- Axioms 3 4 refer to as SCReads axiom
- Axioms **6 6 7** govern SC fences
  - \* All axioms express as irr(S; r) except ④

# Construction partial order on SC operations

 $\bigoplus$  irr((S \ (mo; S));  $r_4$ ), where  $r_4 = rf^{-1}$ ; hbl; [E<sup>W</sup>], and hbl is hb to events on same location

 $\mathscr{O}$  Replace S \ (mo; S) with S, obtain new **4** (rr(S;  $r_4$ )

- ●✓ coincides with revised model offered by Vafeiadis et al. [2015] in "Correcting the SCReads Axiom"
- ●✓ disallows an SC read to see any write that happens before any SC write in S

 $\operatorname{acyclic}([\mathsf{E}^{sc}]; (r_1 \cup r_2 \cup r_3 \cup r_4 \cup r_5 \cup r_6 \cup r_7); [\mathsf{E}^{sc}]) \qquad (\mathsf{S}_{\mathsf{partial}})$ 

■ Proved  $S_{partial} = \exists S. WfS \land 0 \land 0$ 

- ❀ S<sub>partial</sub> does not require S anymore
- ✤ This axiom is faster to simulate!
- ✔ Existing compilation schemes (x86 and Power) remain valid

#### Stronger & simpler SC axiom

- ? Question: Strengthen SC semantics without requiring changes to compilation schemes of C11 target architectures (x86 and Power)?
  - ● and contain rb at begin or finish at a fence
- G✓ contains rb since  $r_4 = rf^{-1}$ ; hbl; [E<sup>W</sup>] where hbl is hb to events on same location →  $r_4 = rf^{-1}$ ; mo and rb =  $rf^{-1}$ ; mo
  - contains  $r_3 = rf^{-1}; [E^{sc}]; mo$
  - ℅ Remove  $[E^{sc}]$ , we have  $r_3 = rb$  and new **③**  $\checkmark$  irr(S; rb)

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

- $\bullet \quad [E^{sc}]; hb; [E^{sc}] \subseteq S$
- $2 \quad [E^{sc}]; mo; [E^{sc}] \subseteq S$
- ④ ⑤ ⑥ ⑦ indicate that S is required to comply with a few more conditions about SC fences.

Same location: Power & ARM ensure compilation preserves condition
& Some location
forces between accesses of *different locations*, requiring insertion of fence instructions

Introduction Background Common OT are invalid Overhauling SC atomics in C11 **Repairing SC in C11** Selected critique

#### Compilation to Power is Broken



$$\begin{array}{c|c} x :=_{sc} I; \\ & a :=^{*acq} x \ //1 \\ & c :=^{*sc} y; \ //0 \\ & d :=^{*acq} x; \ //0 \\ \end{array} \right| \begin{array}{c} y :=_{sc} I; \\ & d :=^{*acq} x; \\ & //0 \\ \end{array}$$

- \* Independent Reads Independent Writes (IRIW)
- Constraints by [Batty et al. 2016] too strong to preserve compiler correctness from C11 to Power and ARMv7

INTRODUCTION BACKGROUND COMMON OT ARE INVALID OVERHAULING SC ATOMICS IN CII REPAIRING SC IN CII SELECTED CRITIQUE

#### Compilation to Power is Broken



- **※** S(*p*, *k*) via sc; S(*k*, *m*) via hb (where hb = (sw ∪ sb)<sup>+</sup>); then sc(*p*, *m*) via transitivity.
- **\*** S(m, p) via rb (where rb = rf<sup>-1</sup>; mo)  $\rightarrow$  S contains cycles
- 🗱 banned by C11
- ✓ permitted by compilation into Power

# Fixing the model

- $\bullet \ [E^{sc}]; hb; [E^{sc}] \subseteq S- \text{ same or different location}$
- \* same location: hardware maintain the order
- ☆ different location: a sync fence between SC accesses
   → start and ends hb with sb, i.e., sb; hb; sb
- $\mathscr{F}$  Fix: replace hb with (sb  $\cup$  sb; hb; sb  $\cup$  hb|<sub>=loc</sub>)

New model 🛛 🗸

 $\operatorname{acyclic}([E^{sc}]; (sb \cup sb; hb; sb \cup hb|_{=loc} \cup mo \cup rb); [E^{sc}]) \quad (\textcircled{1})$ 

● ✓ forbids elimination of SC write immediately followed by SC write to same location, and SC read immediately followed by SC read from same location

#### Enabling Elimination of SC Accesses



- ●✓ forbids elimination of SC write immediately followed by SC write to same location, and SC read immediately followed by SC read from same location
  - \* Eliminate  $x :=_{sc} 1$ ; (event  $m : W^{sc}(x, 1)$ )
  - \* Create cycle  $n \to k \to l \to o \to p \to n$
  - ✗ Violate condition ❶✓ must be acyclic

# Fixing the model

 $\label{eq:acyclic} acyclic([E^{sc}]; (sb \cup sb; hb; sb \cup hb|_{=loc} \cup mo \cup rb); [E^{sc}]) \quad ( \blacksquare \checkmark )$ 

 Fix: weaken condition by replacing sb; hb; sb with sb|<sub>≠loc</sub>; hb; sb|<sub>≠loc</sub> where sb|<sub>≠loc</sub> – sb edges to different location

New condition (named SC-before) requires acyclicity of [E<sup>sc</sup>]; scb; [E<sup>sc</sup>]

 $\mathsf{scb} = \mathsf{sb} \cup \mathsf{sb}|_{\neq \mathsf{loc}}; \mathsf{hb}; \mathsf{sb}|_{\neq \mathsf{loc}} \cup \mathsf{hb}|_{=\mathsf{loc}} \cup \mathsf{mo} \cup \mathsf{rb}$ 

# SC Fences are Too Weak

? Question: Shall adding SC fences between every pair of shared memory restore interleaving behavior?

- 🗱 Original C11 & Batty et al. [2016] do **not** hold
- ✔ hold for Power & ARM
- Replace hb ∪ mo ∪ rb with scb where scb = sb ∪ sb|<sub>≠loc</sub>; hb; sb|<sub>≠loc</sub> ∪ hb|<sub>=loc</sub> ∪ mo ∪ rb
- ℜ Require acyclicity of psc<sub>1</sub>

 $\mathsf{psc}_1 = ([\mathsf{E}^{\mathsf{sc}}] \cup [\mathsf{F}^{\mathsf{sc}}]; \mathsf{sb}^?); \mathsf{scb}; ([\mathsf{E}^{\mathsf{sc}}] \cup \mathsf{sb}^?; [\mathsf{F}^{\mathsf{sc}}])$ 

#### SC Fences are Too Weak



- $\ref{eq: path f_1 to f_2 via sb \to rb \to sb }$
- Path  $f_2$  to  $f_1$  via sb  $\rightarrow$  rb  $\rightarrow$  rf  $\rightarrow$  sb Applying psc<sub>1</sub> = ([E<sup>sc</sup>]  $\cup$  [F<sup>sc</sup>]; sb<sup>?</sup>); scb; ([E<sup>sc</sup>]  $\cup$  sb<sup>?</sup>; [F<sup>sc</sup>]) is acyclic Path  $f_2$  to  $f_1$  contributes neither Batty et al. [2016] nor psc<sub>1</sub>

### Fixing model

- $\mathcal{F}$  Extended-coherence-order relation eco =  $(rf \cup mo \cup rb)^+$
- Disallow the weak behavior, it is impossible for either psc1 or [F<sup>sc</sup>]; sb; eco; sb; [F<sup>sc</sup>] containing cycles

 $\operatorname{acyclic}(\operatorname{psc}_1 \cup [\operatorname{F}^{\operatorname{sc}}]; \operatorname{sb}; \operatorname{eco}; \operatorname{sb}; [\operatorname{F}^{\operatorname{sc}}])$ 

Introduction

Background

Common OT are invalid

Overhauling SC atomics in C11

Repairing SC in C11

- ◆ (Vafeiadis et al. 2015) disallows reordering of non-atomic load and store operations, when offered solution to out-of-thin-air reads that restricts (hb ∪ rf) cycle (source or destination of rf edge is non-atomic)
- (Batty et al. 2016) strong to preserve compiler correctness from C11 to Power and ARMv7
- (Lahav et al. 2017) RC11 directly map high-level primitive to sequence of machine instructions without code optimizations, then challenging to apply methods to verify optimization passes

Thank you for listening!

backup slides

Strengthening access mode of an operation

