3. Atomics

Atomics

The two main ideas of atomics is- order constraints and synchronization.

With atomics, we reach the domain of experts. When we use atomics, we often speak of lock-free programming. Sequential consistency is the strongest, most intuitive level of the weak memory model. Acquired-release semantics is next. The relaxed semantic relaxed semantic is the weakest of the weak memory model.

std::atomic_flag C++11

atomic_flag is the simplest atomic boolean flag, used as a foundational building block for higher level lock abstractions. It is the only truly guaranteed atomic variable. All other atomics can potentially use mutexes internally depending on the processor architecture- although this is unlikely. Also other atomics may use mutexes if the size of the locked data exceeds the size of the processor's cache line (?).

A pre-C++20 atomic_flag only has two methods: test_and_set and clear.

test_and_set    sets the atomic_flag and return the old value
clear           clears the atomic_flag

Clear sets the flag to false.

Test_and_set TAS sets the flag to true and returns the old value. If it returns false, then that means the current thread is the first to set it. If it returns true, then this means some other thread set it to true first. In summary, if TAS returns false, then it means we have the lock and can proceed until we clear it.

To assert an atomic does not use a mutex, call its is_lock_free method or starting in C++17, the constexpr atomic::is_always_lock_free starting in C++17.

obj.is_lock_free()
atomic<type>::is_always_lock_free()

Spinlock

A first use case of atomic_flag is with a spinlock.

A mutex, when blocked, enters into a wait state by switching into kernel mode and waits until a later time (how much later?) to try a lock.

Unlike a mutex, a spinlock, when blocked does not context switch into kernel mode. A spinlock stays in user mode and rather spends its cpu cycles repeatedly retrying the lock, causing high cpu usage, but avoiding a costly switch to kernel mode.

Example of a spinlock using atomic_flag

static std::atomic_flag s_flag = ATOMIC_FLAG_INIT;
static int s_cnt = 0;

class Spinlock {
    bool _locked = false;
public:
    ~Spinlock() {
        if (_locked) {
            s_flag.clear(std::memory_order_release);
        }
    }

    // All Spinlock instances that want to acquire the flag will test_and_set; 
    // test_and_set returns the previous value of the flag meaning 
    // test_and_set returns false only if it acquires the lock.
    //
    // If the flag was already set (true), lock() will keep trying.
    // lock() will only break out of the while loop only if the flag is 
    // clear (false).

    void lock() {
        while( s_flag.test_and_set(std::memory_order_acquire) );
        _locked = true;
    }
};

// Even if the code after the lock throws an exception, 
// the spinlock instance will always unlock by RAII.

void test_spinlock(int i) {
    Spinlock spinlock;
    spinlock.lock();
    s_cnt += i;
}


TEST(Threads, Spinlock_TestUnlockByRaii) {
    int n_threads = 100;
    std::vector<std::thread> threads;
    threads.reserve(n_threads);

    for (int i=0; i<n_threads; i++) {
        threads.emplace_back(test_spinlock, i);
    }
    for (auto& t : threads) {
        t.join();
    }
    // sum of 0..99.
    ASSERT_TRUE(s_cnt == 4950);
}

TTS

The previous example of a test_and_set (TAS) Spinlock was a naive, inefficient implementation. A major problem with the use of TAS is that it is effectively a write. A write forces all other threads to invalidate their copy of that atomic_flag address' cache_line even though the value of the atomic_flag does not necessarily change (most likely true as is most of the time when flag has been set by other threads). This forces all other threads to reread from the cache_line- even when there is no value change- creating a lot of bus line contention. This atomic_flag test_and_set use in a Spinlock creates a lot of cache-line 'noise', also called cache line bouncing, because every spinning thread repeatedly performs a read-modify-write on the same cache-line.

A better Spinlock implementation with atomic_flag is the TTS protocol- test and test-and-set. Because reading is a lot faster, less expensive, less disruptive to other threads, TTS performs repeated tests (reads) first and if-and-only-if test returns false (the flag is just cleared by another thread!),

(post example here)

std::atomic_flag C++20

A C++20 atomic_flag has additional methods for synchronizing between threads, that is- one thread can tell another thread something is ready.

wait            blocks the calling thread until notified on the atomic_flag
notify_one      notify one thread waiting on the atomic_flag
notify_all      notify all threads waiting on the atomic_flag

A wait has the signature (bool old, std::memory_order order=std::memory_order_seq_cst) which blocks until the atomic_flag's value (false by default and also after a clear) is different from old. A test_and_set on the atomic_flag sets the flag value to true. This wait is guaranteed to unblock only if the value has changed.

A thread can wake up spuriously - meaning wake up for no reason due to kernel implementation details. To prevent a atomic_flag wait from returning erroneously, the Cpp++20 standard, std::atomic_flag::wait(bool old_val) has a built-in loop that checks- 'is the atomic_value still equal to old_val' ? If true, the thread goes back to sleep. This explains why the wait function has an old_val parameter.

Example of a atomic_flag wait and notify use case in a workflow:

std::atomic_flag flag_a = ATOMIC_FLAG_INIT;
std::atomic_flag flag_b = ATOMIC_FLAG_INIT;

// test_and_set and clear's default parameter is: memory_order_seq_cst
void appendA(std::string& str) {
    str += 'A';
    flag_a.test_and_set(std::memory_order_acquire); // should be first to get lock and set flag
    flag_a.notify_one();   // notify one waiting thread
}

void appendB(std::string& str) {
    flag_a.wait(false);  // wait until flag_a is set

    str += 'B';
    flag_a.clear(std::memory_order_release);        // clear flag_a before setting flag_b
    flag_b.test_and_set(std::memory_order_acquire); // should be first to get lock and set flag
    flag_b.notify_one();   // notify one waiting thread
}

void appendC(std::string& str) {
    flag_b.wait(false);  // wait until flag_b is set

    str += 'C';
    flag_b.clear(std::memory_order_release);  // clear flag_b for next use
}

TEST(Threads, AtomicFlag_WaitAndNotify) {
    for (int i=0; i<10; i++) {
        std::string result;
        std::thread t1(appendA, std::ref(result));
        std::thread t2(appendB, std::ref(result));
        std::thread t3(appendC, std::ref(result));

        t1.join();
        t2.join();
        t3.join();

        ASSERT_TRUE(result == "ABC");
    }
}

Note this form of synchronization, a wait and notify, is much more efficient than a spinlock.

A wait and notify in C++20 uses a hybrid synchronization model. The thread's wait() stays in user mode first and exits if the flag is set. The thread enters kernel mode only if necessary after a small loop period. The cpu consumes zero cpu while in the kernel takes over the thread in sleep mode.

Before C++20's wait and notify, developers used spinlocks, but this consumes excessive cpu and prevents the cpu from doing other work.

ABA Problem

The ABA problem is that in between times when a thread checks a flag for a changed value, another thread could have changed the flag value from A to B and then back to A. Then waiting thread never saw the changed state and continues to wait/block.

atomic_flag vs atomic<bool>

In 99% of all cases, std::atomic_flag and std::atomic<bool> compiles to the same assembly code. However, atomic_flag has one big difference- the C++ standard guarantees atomic_flag is always lock-free, regardless of how ancient or different the processor is. The atomic_flag is backed by raw hardware instructions, never a software lock.

atomic<bool> is usually lock-free, but not guaranteed. In some cases of obscure processors, the C++ compiler can secretely implement the atomic<bool> with a mutex lock.

However, on modern x64 (Intel/AMD) and ARM64 (Apple/Android) processors, atomic<bool> is almost always lock_free, and can be verified with the function call is_lock_free. So use atomic in almost all cases than a atomic_flag, and only use the latter for portability and implementing a spinlock- which seems to fit the test_and_set use case perfectly.

std::memory_order

A load with std::memory_order_acquire means all writes in other threads that release the same atomic variable are visible in this thread.

A store with std::memory_order_release means all writes in this thread are visible by other threads that acquire the same atomic variable.

2025-11