











## Frère Jacques



## Frère Jacques



| ½ note   |  |
|----------|--|
| 1/4 note |  |
| ½ note   |  |

| ½ note                            |                                 |
|-----------------------------------|---------------------------------|
| ½ note                            |                                 |
| ½ note                            | ) 11 J                          |
| <sup>1</sup> / <sub>16</sub> note | <b>&gt; , , , , , , , , , ,</b> |

















A note is **beat-aligned** if it starts at a whole multiple of from the start of the bar.

A note is **beat-aligned** if it starts at a whole multiple of from the start of the bar.

Syncopated = not beat-aligned



# An object x is N-byte-aligned if its memory address is kN

where  $N = 2^n$ 

# An object x is N-byte-aligned if (uintptr\_t)&x % (1 << n) == 0 where $N=2^n$

#### About Me: Tomer Vromen – תומר פרומן

Working @ DVLLTechnologies

C++, Python

PowerFlex Ultra

We're hiring

Haifa/Glil Yam/Be'er Sheva

→ Tomer.Vromen@dell.com







Object types have *alignment requirements* which place restrictions on the addresses at which an object of that type may be allocated.

[basic.align]

An *alignment* is an **implementation-defined** integer value representing the number of bytes between successive addresses at which a given object can be allocated. [...]

Attempting to create an object in storage that does not meet the alignment requirements of the object's type is **undefined behavior**.

[basic.align]

An **alignof** expression yields the alignment requirement of its operand type.

[expr.alignof]

In a declaration, an **alignas(...)** attribute can be used to **increase** the default alignment requirement.

[dcl.align], paraphrased

#### Demo

https://godbolt.org/z/cM6exnMvo

#### Keeping Things Aligned

- Compiler ensures that all created objects are aligned according to C++ rules
- ABI = Abstract Binary Interface
  - Each platform has a different ABI
- ABI defines proper alignment
  - Constraints & invariants
- The x86\_64 Stack Frame: "The end of the input argument area shall be aligned on a 16 byte boundary" (x86\_64 ABI)

#### Keeping Things Aligned

- Global variables:
  - Compiler puts them in aligned position
- Stack-allocated (local) objects
  - **ABI promises** that stack is 16-byte aligned when control is transferred to the function entry point.
  - Higher alignment achieved by bitwise ANDing the stack register.

#### Keeping Things Aligned: Heap-Allocated

```
MyClass *p = new MyClass{"hello", 42};
```

- Call operator new(sizeof(MyClass))
- 2. Call c'tor with arguments
  - The address (this) is the value returned by operator new

```
Calls to operator new(std::size_t) are guaranteed to be aligned by ___STDCPP_DEFAULT_NEW_ALIGNMENT__
```

```
For larger alignment requirements, operator new(std::size_t, std::align_val_t) is called. (since C++17)
```



Attempting to create an object in storage that does not meet the alignment requirements of the object's type is **undefined behavior**.

[basic.align]



#### Alignment In Practice

| CPU                             | Allowed? | Performance |
|---------------------------------|----------|-------------|
| Recent x86, x86_64 (Intel, AMD) | Yes      | Good        |
| ARMv8+                          | Yes      | Good        |
| POWER9+ (IBM)                   | Yes      | Good        |

Modern architectures don't mind unaligned memory access!

#### Alignment In Practice

| CPU                             | Allowed? | Performance |
|---------------------------------|----------|-------------|
| Recent x86, x86_64 (Intel, AMD) | Yes      | Good        |
| ARMv8+                          | Yes      | Good        |
| POWER9+ (IBM)                   | Yes      | Good        |
| x86, x86_64, Ivy Bridge & older | Yes      | Depends     |

Modern architectures don't mind unaligned memory access!

### Alignment In Practice

| CPU                             | Allowed?     | Performance    |     |
|---------------------------------|--------------|----------------|-----|
| Recent x86, x86_64 (Intel, AMD) | Yes          | Good           |     |
| ARMv8+                          | Yes          | Good           |     |
| POWER9+ (IBM)                   | Yes          | Good           |     |
| x86, x86_64, Ivy Bridge & older | Yes          | Depends        |     |
| POWER8                          | No           |                |     |
| SPARC                           |              |                |     |
| MIPS                            | Breaks ato   | omicity!       |     |
| ARM M-ser                       |              |                |     |
| RISC-V int prctl(PR_SET_U       | INALIGN, sig | gned long flag | g); |

Pass PR\_UNALIGN\_NOPRINT to silently fix up unaligned user accesses, or

PR\_UNALIGN\_SIGBUS to generate SIGBUS on unaligned user access.

Modern architectures don't mind unaligned memory access!

Still relevant for older\embedded architectures

## Alignment In Practice \*

**Fundamental types** 

## Alignment In Practice \*

#### **Fundamental types:**

Natural alignment

ABI for x86\_64 --->

| Туре      | С                            | sizeof | Alignment (bytes) |
|-----------|------------------------------|--------|-------------------|
|           | _Bool <sup>†</sup>           | 1      | 1                 |
|           | char                         | 1      | 1                 |
|           | signed char                  |        |                   |
|           | unsigned char                | 1      | 1                 |
|           | short                        | 2      | 2                 |
|           | signed short                 |        |                   |
|           | unsigned short               | 2      | 2                 |
|           | int                          | 4      | 4                 |
| Integral  | signed int                   |        |                   |
|           | enum <sup>†††</sup>          |        |                   |
|           | unsigned int                 | 4      | 4                 |
|           | long                         | 8      | 8                 |
|           | signed long                  |        |                   |
|           | long long                    |        |                   |
|           | signed long long             |        |                   |
|           | unsigned long                | 8      | 8                 |
|           | unsigned long long           | 8      | 8                 |
|           | int128 <sup>††</sup>         | 16     | 16                |
|           | signedint128 <sup>††</sup>   | 16     | 16                |
|           | unsignedint128 <sup>††</sup> | 16     | 16                |
| Pointer   | any-type *                   | 8      | 8                 |
|           | any-type (*)()               |        |                   |
| Floating- | float                        | 4      | 4                 |
| point     | double                       | 8      | 8                 |
|           | long double                  | 16     | 16                |
|           | float128 <sup>††</sup>       | 16     | 16                |

# Alignment In Practice ★

#### **Fundamental types:**

alignof(T) == sizeof(T)

Natural alignment

# Alignment In Practice ☆

#### **Fundamental types:**

Compound types (struct, class, union):

The alignment is that of the largest non-static member

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
                    6
                                 10
                                    11 12 13 14 15
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                                 10 11 12 13 14 15
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                                 10 11 12 13 14 15
```

<sup>☆</sup> ABI-defined

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                 5
                               9
                                     11
                                         12 13
                                               14 15 16
                                                              18
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                                 10 11 12 13 14 15 16 17
                                                            18 19
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                5
                                10 11 12 13 14 15 16 17
                                                          18
                                                            19 20
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                               10 11 12 13 14 15 16 17 18 19
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                              10 11 12 13 14 15 16 17 18 19 20 21 22
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
                             10 11 12 13 14 15 16 17 18 19 20 21 22 23
```

<sup>★</sup> ABI-defined

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
                                                sizeof(S) == 25 ???
};
                      8
                            10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
                                              sizeof(S) == 32
};
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
```

The whole is greater than the sum of its parts

```
struct S
    char a;
    char e;
    short c;
    int b;
    double d;
};
                                               sizeof(S) == 16
                                 10 11 12 13 14 15
```

The whole is greater than the sum of its parts

```
#pragma pack(push, 1)
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
#pragma pack(pop)
                                                  sizeof(S) == 16
                                   10 11 12 13 14 15
```

The whole is greater than the sum of its parts



The whole is greater than the sum of its parts

```
#pragma pack(push, 1)
struct S
    char a;
    int b;
    short c;
    double d;
    char e;
};
#pragma pack(pop)
```

```
s.b = 42;
```

arm32 disassembly:

```
r3, #0
movs
       r3, r3, #42
orr
strb r3, [r7, #1]
       r3, #0
movs
       r3, [r7, #2]
strb
       r3, #0
movs
strb
       r3, [r7, #3]
       r3, #0
movs
strb
       r3, [r7, #4]
```



The whole is greater than the sum of its parts



<sup>&</sup>lt;sup>☆</sup> ABI + compiler extension

#### SIMD

Single
Instruction
Multiple
Data

Intel's documentation --->

#### MOVAPD—Move Aligned Packed Double Precision Floating-Point Values

| Opcode/<br>Instruction |     | Op/En | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag | Description                                                                           |
|------------------------|-----|-------|------------------------------|-----------------------|---------------------------------------------------------------------------------------|
| 66 OF 28 /r<br>MOVAPD  | 128 | А     | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm2/mem to xmm1. |
| 66 OF                  | nm1 | В     | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm1 to xmm2/mem. |
|                        | 120 | А     | V/V                          | AVX                   | Move aligned packed double precision floating-                                        |

"When the source or destination operand is a memory operand, the operand <u>must be aligned</u>"

ked double precision floatingxmm1 to xmm2/mem. ked double precision floatingymm2/mem to ymm1. ked double precision floatingymm1 to ymm2/mem.

ked double precision floating-

| VMOVAPD xmm1 {k1}{z}, xmm2/m128                            |   | V/V | AVX512F) OR<br>AVX10.1                  | point values from xmm2/m128 to xmm1 using writemask k1.                                                         |
|------------------------------------------------------------|---|-----|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| EVEX.256.66.0F.W1 28 /r<br>VMOVAPD ymm1 {k1}{z}, ymm2/m256 | С | V/V | (AVX512VL AND<br>AVX512F) OR<br>AVX10.1 | Move aligned packed double precision floating-<br>point values from ymm2/m256 to ymm1 using<br>writemask k1.    |
| EVEX.512.66.0F.W1 28 /r<br>VMOVAPD zmm1 {k1}{z}, zmm2/m512 | С | V/V | AVX512F<br>OR AVX10.1                   | Move aligned packed double precision floating-<br>point values from zmm2/m512 to zmm1 using<br>writemask k1.    |
| EVEX.128.66.0F.W1 29 /r<br>VMOVAPD xmm2/m128 {k1}{z}, xmm1 | D | V/V | (AVX512VL AND<br>AVX512F) OR<br>AVX10.1 | Move aligned packed double precision floating-<br>point values from xmm1 to xmm2/m128 using<br>writemask k1.    |
| EVEX.256.66.0F.W1 29 /r<br>VMOVAPD ymm2/m256 {k1}{z}, ymm1 | D | V/V | (AVX512VL AND<br>AVX512F) OR<br>AVX10.1 | Move aligned packed double precision floating-<br>point values from ymm1 to ymm2/m256 using<br>writemask k1.    |
| EVEX.512.66.0F.W1 29 /r<br>VMOVAPD zmm2/m512 {k1}{z}, zmm1 | D | V/V | AVX512F<br>OR AVX10.1                   | Move aligned packed double precision floating-<br>point values from zmm1 to zmm2/m512 using 60<br>writemask k1. |

#### SIMD

Single
Instruction
Multiple
Data

Intel's documenta

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values

| Opcode/<br>Instruction |      | Op/En | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag | Description                                                                           |
|------------------------|------|-------|------------------------------|-----------------------|---------------------------------------------------------------------------------------|
| 66 OF 28 /r<br>MOVAPD  | 28   | А     | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm2/mem to xmm1. |
| 66 OF                  | nm1  | В     | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm1 to xmm2/mem. |
|                        | -120 | А     | V/V                          | AVX                   | Move aligned packed double precision floating-                                        |

"When the source or destination operand is a memory operand, the operand <u>must be aligned</u>"

 $[\ldots]$ 

"To move double precision floating-point values to and from unaligned memory locations, use the (V)MOV<u>U</u>PD instruction."

xmm1 to xmm2/mem.

ked double precision floatingymm2/mem to ymm1.

ked double precision floatingymm1 to ymm2/mem.

ked double precision floatingxmm2/m128 to xmm1 using

ked double precision floating-

ked double precision floatingymm2/m256 to ymm1 using



#### SIMD

Single Instruction Multiple

Intel's documenta

Data

MOVAPD—Move Aligned Packed Double Precision Floating-Point Values

| Opcode/<br>Instruction | Op/En |   | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag | Description                                                                           |  |  |
|------------------------|-------|---|------------------------------|-----------------------|---------------------------------------------------------------------------------------|--|--|
| 66 OF 28 /r<br>MOVAPD  | 28    | А | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm2/mem to xmm1. |  |  |
| 66 0F                  | ım1   | В | V/V                          | SSE2                  | Move aligned packed double precision floating-<br>point values from xmm1 to xmm2/mem. |  |  |

Floating point XMM and YMM instructions

| "When the s | Instruction            | Operands | μοps<br>fused<br>domain | μορs<br>unfused<br>domain | μορs each port | Latency | Recipro-<br>cal<br>through<br>put | Comments      |
|-------------|------------------------|----------|-------------------------|---------------------------|----------------|---------|-----------------------------------|---------------|
|             | Move instruc-<br>tions |          |                         |                           |                | 7       |                                   |               |
|             | MOVAPS/D               | x,x      | 1                       | 1                         | p015           | 0-1     | 0.25                              | may eliminate |
| "To move    | VMOVAPS/D              | y,y      | 1                       | 1                         | p015           | 0-1     | 0.25                              | may eliminate |
| unaligned n | MOVAPS/D<br>MOVUPS/D   | x,m128   | 1                       | 1                         | p23            | 2       | 0.5                               | 3921          |
|             | VMOVAPS/D<br>VMOVUPS/D | y,m256   | 1                       | 1                         | p23            | 3       | 0.5                               | AVX           |
|             | MOVAPS/D<br>MOVUPS/D   | m128,x   | 1                       | 2                         | p237 p4        | 3       | Source:                           | Agner Fog     |

writemask k1. The unaligned version V/V Move aligned packed double precision floating-(AVX512VL AND AVX512F) OR point values from xmm1 to xmm2/m128 using kmm1 must be slower... AVX10.1 writemask k1. right? Move aligned packed double precision floating-V/V (AVX512VL AND K1{z}, ymm1 AVX512F) OR

point values from ymm1 to ymm2/m256 using writemask k1. AVX10.1 V/V AVX512F Move aligned packed double precision floatingpoint values from zmm1 to zmm2/m512 using PD zmm2/m512  $\{k1\}\{z\}$ , zmm1 OR AVX10.1 writemask k1.

# Alignment is Still Relevant!

(Even on Modern Platforms)



#### Cache Lines

64 bytes Data 64 bytes

Cache

#### Cache Lines



#### Cache Lines

64 bytes 64 bytes Data Cache

# Cache Lines & Locking



## Cache Lines & Locking



# Cache Lines & Locking



#### Benchmark

```
struct StructAligned
    int a = 42;
    char b = ' \ 0';
};
```

```
#pragma pack(push, 1)
struct StructUnaligned
    int a = 42;
    char b = ' \ 0';
#pragma pack(pop)
```

#### Benchmark

```
#pragma pack(push, 1)
                             struct AtomicUnaligned
struct AtomicAligned
    atomic<int> a = 42;
                                 atomic<int> a = 42;
    char b = ' \ 0';
                                 char b = ' \ 0';
};
                             #pragma pack(pop)
```

#### Benchmark

```
BENCHMARK(Runner<StructAligned>);
template <typename T>
                                         BENCHMARK(Runner<StructUnaligned>);
static void Runner(State& state)
                                         BENCHMARK(Runner<AtomicAligned>);
                                         BENCHMARK(Runner<AtomicUnaligned>);
    constexpr size_t N = 100;
   T s[N];
    for (auto _ : state) {
        for (int i = 0; i < N; ++i) {
            int t = ++s[i].a;
            DoNotOptimize(t);
```

Benchmark Time CPU
-----Runner<StructAligned> 39.8 ns 39.7 ns

| Benchmark                                  | Time    | CPU                                  |
|--------------------------------------------|---------|--------------------------------------|
| Runner <structaligned></structaligned>     | 39.8 ns | Cache line split: <b>78</b> % slower |
| Runner <structunaligned></structunaligned> | 70.8 ns | 70.6 ns                              |

| Time    | CPU                                  |
|---------|--------------------------------------|
|         |                                      |
| 39.8 ns | Cache line split: <b>78</b> % slower |
| 70.8 ns | 70 6 nc                              |
| 669 ns  | Atomic write: <b>9.5x</b> slower     |
|         | 39.8 ns<br>70.8 ns                   |

Benchmark Time CPU Runner<StructAligned> 39.8 ns Cache line split: 78% slower Runner<StructUnaligned> 70.8 ns 70 6 ns Atomic write: 9.5x slower Runner<AtomicAligned> 669 ns Runner<AtomicUnaligned> 3443049 ns 2424070 Atomic write with cache line split: 5000x slower! **Split lock**: locks the whole memory bus!

# Cache Lines & Locking & Multithread



# Cache Lines & Locking & Multithread



# Cache Lines & Locking & Multithread



# Benchmark: False Sharing



# Benchmark: False Sharing

```
alignas(64)
struct AtomicAligned4

{
   atomic<int> a = 42;
};

sizeof(Aligned4) == 4

alignas(64)
struct AtomicAligned64

{
   atomic<int> a = 42;
};

sizeof(Aligned4) == 64
```

### Benchmark: False Sharing

Benchmark Time CPU

------

Runner<AtomicAligned4> 1208372885 ns 260510 ns

Runner<AtomicAligned64> 802320730 ns 221603 ns

Avoiding false sharing: 33.6% faster

## Benchmark: False Sharing, No Locks



# Benchmark: False Sharing, No Locks

```
alignas(64)
                            struct Aligned64
struct Aligned4
    int a = 42;
                                int a = 42;
};
sizeof(Aligned4) == 4
                            sizeof(Aligned64) == 64
```

#### Benchmark: False Sharing, No Locks

Benchmark Time CPU

------

Runner<Aligned4> 726761 ns 76867 ns

Avoiding false sharing: 12.5% faster





## Alignment – Yes or No?

- C++ alignment rules are simplistic, and maybe outdated
  - Undefined behavior → Implementation-defined?
- Only really needed for embedded
- Modern CPUs don't mind unaligned data too much
- C++ will pad structs to enforce alignment
  - Good if you need it, but wasteful otherwise
  - Reorder members to reduce padding
  - Use #pragma pack to decrease alignment, carefully
- Cache alignment does matter for performance!
- Multi-threaded: use alignas (64) to avoid false sharing

