Skip to content

C++ Learning Journey

Core notes distilled from talks by Scott Meyers, Rainer Grimm, Perplexity, and Gemini 3 Pro.

Performance

As c++ programmers, we care about performance. If we don't care about performance, we would be using python. In the words of Scott Meyers during a cpp conference talk, "Over in the next room".

At a foundation, we must understand what happens at each step in a life cycle of a c++ program.

  • 1 the compiler
    • the compiler will unroll loops, remove unused code, place variables in registers, etc
  • 2 the processor
    • the processor will run operations out of order, and make branch predictions; branch mispredictions are very costly; try to write branchless code or with fewer branches
  • 3 the cache
    • in a traditional von neumenn cpu architecture, the cache will prefetch the next instructions or data

Each part works hard to optimize its own respective stage.

Caches

Because modern processor's cpu speed is increasing much faster than that of memory access, much of processing's delays comes from memory read/writes. To allay these delays, modern processors with multiple cores are always designed with multiple caches, designated L1 (private, small, fastest), L2 (private, larger, fast) and L3 (shared, slower).

Cacheline

A cache line is the smallest fixed-size block of data (typically 64 bytes on modern x86/ARM CPUs) that a CPU cache transfers between main memory (L3) and each core's private local caches.

Loading data at least 64 bytes at a time exploits data's spatial locality, meaning data is most likely to be located sequentially in memory. Loading an entire block from L3 (DRAM) takes 100-200 cycles whereas reading from L1 takes 3-5 cycles.

Cache coherence

However, a problem with caches is that the view of memory held by two different processors is through their individual caches, which, without any additional precautions, could end up seeing two different values. This is the cache coherence problem. This CC problem exists because there is a global state in the main memory (L3) and private local states (L1, L2).

In multi-core systems, cache lines are the unit of coherency under protocols like MESI. When one core updates data in a cache line, the MESI protocol invalidates copies of the same data in other cores, forcing them to refetch them on the next read.

To resolve this cache coherence problem, a common protocol is for the writing processor to acquire bus access (this sounds like a lock to me)- to broadcast the invalidated address. All processors continuously snoop on the bus. If an address appears that is in their cache, the processor invalidates the local copy of the data.

Memory model

C++ 11 introduced a memory model that specifies behaviour for multi-threaded programs. These rules guarantees correctness, robustness and portability for multi-threaded programs. In essence, a memory model specifies what happens when two threads access the same memory location.

If two or more threads access the same memory location and one thread wants to modify it, then there will be a data race unless

- the write is handled by an atomic operation
- the read and write happens at different times

Memory model contracts

The memory model in C++ 11 introduces varying levels of control flow order contracts available to the programmer.

What the memory model clarifies

  • atomic operations- what operations can perform un-interrupted.
  • partial ordering of operations- sequences of operations that must not be reordered.
  • visible effects of operations- when updates on shared variables are visible to other threads.

These are- in order of optimisation possibilities and the level of expertise required:

  • strong: single threading, one control flow only
  • medium: multi-threading with tasks, threads, and condition variables
  • weak: atomics; sequential consistency, acquire-release semantic, relaxed semantic

Roughly speaking, the stronger the contract, the less choices/options a cpp program can optimise. Conversely, the weaker the contract, the more optimisation choices a cpp program has available to optimise.

Tasks, threads and condition variables are understandable. Let's just focus on atomics.

Atomics

The two main ideas of atomics is- order constraints and synchronization.

With atomics, we reach the domain of experts. When we use atomics, we often speak of lock-free programming. Sequential consistency is the strongest, most intuitive level of the weak memory model. Acquired-release semantics is next. The relaxed semantic relaxed semantic is the weakest of the weak memory model.

std::atomic_flag C++11

atomic_flag is the simplest atomic boolean flag, used as a foundational building block for higher level lock abstractions. It is the only truly guaranteed atomic variable. All other atomics can potentially use mutexes internally depending on the processor architecture- although this is unlikely. Also other atomics may use mutexes if the size of the locked data exceeds the size of the processor's cache line (?).

A pre-C++20 atomic_flag only has two methods: test_and_set and clear.

test_and_set    sets the atomic_flag and return the old value
clear           clears the atomic_flag

Clear sets the flag to false.

Test_and_set TAS sets the flag to true and returns the old value. If it returns false, then that means the current thread is the first to set it. If it returns true, then this means some other thread set it to true first. In summary, if TAS returns false, then it means we have the lock and can proceed until we clear it.

To assert an atomic does not use a mutex, call its is_lock_free method or starting in C++17, the constexpr atomic::is_always_lock_free starting in C++17.

obj.is_lock_free()
atomic<type>::is_always_lock_free()

Spinlock

A first use case of atomic_flag is with a spinlock.

A mutex, when blocked, enters into a wait state by switching into kernel mode and waits until a later time (how much later?) to try a lock.

Unlike a mutex, a spinlock, when blocked does not context switch into kernel mode. A spinlock stays in user mode and rather spends its cpu cycles repeatedly retrying the lock, causing high cpu usage, but avoiding a costly switch to kernel mode.

Example of a spinlock using atomic_flag

static std::atomic_flag s_flag = ATOMIC_FLAG_INIT;
static int s_cnt = 0;

class Spinlock {
    bool _locked = false;
public:
    ~Spinlock() {
        if (_locked) {
            s_flag.clear(std::memory_order_release);
        }
    }

    // All Spinlock instances that want to acquire the flag will test_and_set; 
    // test_and_set returns the previous value of the flag meaning 
    // test_and_set returns false only if it acquires the lock.
    //
    // If the flag was already set (true), lock() will keep trying.
    // lock() will only break out of the while loop only if the flag is 
    // clear (false).

    void lock() {
        while( s_flag.test_and_set(std::memory_order_acquire) );
        _locked = true;
    }
};

// Even if the code after the lock throws an exception, 
// the spinlock instance will always unlock by RAII.

void test_spinlock(int i) {
    Spinlock spinlock;
    spinlock.lock();
    s_cnt += i;
}


TEST(Threads, Spinlock_TestUnlockByRaii) {
    int n_threads = 100;
    std::vector<std::thread> threads;
    threads.reserve(n_threads);

    for (int i=0; i<n_threads; i++) {
        threads.emplace_back(test_spinlock, i);
    }
    for (auto& t : threads) {
        t.join();
    }
    // sum of 0..99.
    ASSERT_TRUE(s_cnt == 4950);
}

TTS

The previous example of a test_and_set (TAS) Spinlock was a naive, inefficient implementation. A major problem with the use of TAS is that it is effectively a write. A write forces all other threads to invalidate their copy of that atomic_flag address' cache_line even though the value of the atomic_flag does not necessarily change (most likely true as is most of the time when flag has been set by other threads). This forces all other threads to reread from the cache_line- even when there is no value change- creating a lot of bus line contention. This atomic_flag test_and_set use in a Spinlock creates a lot of cache-line 'noise', also called cache line bouncing, because every spinning thread repeatedly performs a read-modify-write on the same cache-line.

A better Spinlock implementation with atomic_flag is the TTS protocol- test and test-and-set. Because reading is a lot faster, less expensive, less disruptive to other threads, TTS performs repeated tests (reads) first and if-and-only-if test returns false (the flag is just cleared by another thread!),

(post example here)

std::atomic_flag C++20

A C++20 atomic_flag has additional methods for synchronizing between threads, that is- one thread can tell another thread something is ready.

wait            blocks the calling thread until notified on the atomic_flag
notify_one      notify one thread waiting on the atomic_flag
notify_all      notify all threads waiting on the atomic_flag

A wait has the signature (bool old, std::memory_order order=std::memory_order_seq_cst) which blocks until the atomic_flag's value (false by default and also after a clear) is different from old. A test_and_set on the atomic_flag sets the flag value to true. This wait is guaranteed to unblock only if the value has changed.

A thread can wake up spuriously - meaning wake up for no reason due to kernel implementation details. To prevent a atomic_flag wait from returning erroneously, the Cpp++20 standard, std::atomic_flag::wait(bool old_val) has a built-in loop that checks- 'is the atomic_value still equal to old_val' ? If true, the thread goes back to sleep. This explains why the wait function has an old_val parameter.

Example of a atomic_flag wait and notify use case in a workflow:

std::atomic_flag flag_a = ATOMIC_FLAG_INIT;
std::atomic_flag flag_b = ATOMIC_FLAG_INIT;

// test_and_set and clear's default parameter is: memory_order_seq_cst
void appendA(std::string& str) {
    str += 'A';
    flag_a.test_and_set(std::memory_order_acquire); // should be first to get lock and set flag
    flag_a.notify_one();   // notify one waiting thread
}

void appendB(std::string& str) {
    flag_a.wait(false);  // wait until flag_a is set

    str += 'B';
    flag_a.clear(std::memory_order_release);        // clear flag_a before setting flag_b
    flag_b.test_and_set(std::memory_order_acquire); // should be first to get lock and set flag
    flag_b.notify_one();   // notify one waiting thread
}

void appendC(std::string& str) {
    flag_b.wait(false);  // wait until flag_b is set

    str += 'C';
    flag_b.clear(std::memory_order_release);  // clear flag_b for next use
}

TEST(Threads, AtomicFlag_WaitAndNotify) {
    for (int i=0; i<10; i++) {
        std::string result;
        std::thread t1(appendA, std::ref(result));
        std::thread t2(appendB, std::ref(result));
        std::thread t3(appendC, std::ref(result));

        t1.join();
        t2.join();
        t3.join();

        ASSERT_TRUE(result == "ABC");
    }
}

Note this form of synchronization, a wait and notify, is much more efficient than a spinlock.

A wait and notify in C++20 uses a hybrid synchronization model. The thread's wait() stays in user mode first and exits if the flag is set. The thread enters kernel mode only if necessary after a small loop period. The cpu consumes zero cpu while in the kernel takes over the thread in sleep mode.

Before C++20's wait and notify, developers used spinlocks, but this consumes excessive cpu and prevents the cpu from doing other work.

ABA Problem

The ABA problem is that in between times when a thread checks a flag for a changed value, another thread could have changed the flag value from A to B and then back to A. Then waiting thread never saw the changed state and continues to wait/block.

atomic_flag vs atomic<bool>

In 99% of all cases, std::atomic_flag and std::atomic<bool> compiles to the same assembly code. However, atomic_flag has one big difference- the C++ standard guarantees atomic_flag is always lock-free, regardless of how ancient or different the processor is. The atomic_flag is backed by raw hardware instructions, never a software lock.

atomic<bool> is usually lock-free, but not guaranteed. In some cases of obscure processors, the C++ compiler can secretely implement the atomic<bool> with a mutex lock.

However, on modern x64 (Intel/AMD) and ARM64 (Apple/Android) processors, atomic<bool> is almost always lock_free, and can be verified with the function call is_lock_free. So use atomic in almost all cases than a atomic_flag, and only use the latter for portability and implementing a spinlock- which seems to fit the test_and_set use case perfectly.

std::memory_order

A load with std::memory_order_acquire means all writes in other threads that release the same atomic variable are visible in this thread.

A store with std::memory_order_release means all writes in this thread are visible by other threads that acquire the same atomic variable.

.

2025-11