Understanding Concurrency From First Principles
You’re making breakfast. The coffee is brewing, toast is in the toaster, and eggs are frying on the stove. You didn’t wait for the coffee to finish before starting the toast—that would be absurd. Instead, you started all three tasks and now you’re managing them together, checking the eggs, listening for the toaster, watching the coffee pot. You’re doing multiple things at once.
This is concurrency. And your computer can do it too.
By the end of this tutorial, you will understand how programs can do many things simultaneously, why this matters, and how to write concurrent code that works correctly. Along the way, you’ll discover both the tremendous power of concurrency and its hidden dangers—and you’ll learn to navigate both.
Why Your Programs Only Use Part of Your Computer
Take a moment to appreciate what’s inside your machine. If you bought a computer in the last decade, it almost certainly has multiple processor cores—perhaps four, perhaps eight, perhaps more. Each core can execute instructions independently. Your computer literally has multiple brains.
Now here’s the uncomfortable truth: most programs use only one core. The others sit idle. It’s as if you hired a team of expert workers but only let one person do anything while the rest stand around watching.
Consider downloading a large file. In a single-threaded program, your application freezes. The user interface becomes unresponsive because your one thread of execution is busy waiting for network data. The user clicks buttons, but nothing happens. They wonder if the program crashed.
With concurrency, you can do better. One thread handles the download while another keeps the interface alive. The user can continue working, perhaps cancel the download, or start another—all while data streams in the background.
The benefits multiply for computationally intensive work. Image processing, video encoding, scientific simulations—these tasks can be split into independent pieces. Process them on all your cores simultaneously and your program finishes in a fraction of the time.
But concurrency isn’t free. It introduces complexity that catches even experienced programmers off guard. Multiple threads accessing the same data can corrupt it. Threads waiting on each other can freeze forever. These problems have names—race conditions and deadlocks—and they’re the dragons we’ll learn to slay.
First, though, we need to understand what a thread actually is.
Threads: Parallel Lives Within Your Program
When you run a program, the operating system creates a process for it. This process gets its own memory space, its own resources, and at least one thread of execution.
Think of a thread as a bookmark in a book of instructions. It marks where you are in the code. The processor reads the instruction at that bookmark, executes it, and moves the bookmark forward. One thread means one bookmark—your program can only be at one place in the code at any moment.
But you can create additional threads. Each thread is its own bookmark, tracking its own position in the code. Now your program can be at multiple places simultaneously. Each thread has its own call stack—its own record of which functions are running—but all threads in a process share the same memory.
This sharing is both the power and the peril of threads.
Let’s see this concretely. Here’s how to find out how many processor cores your computer has:
#include <iostream>
#include <thread>
int main()
{
unsigned int cores = std::thread::hardware_concurrency();
std::cout << "This computer has " << cores << " cores.\n";
return 0;
}
That number represents how many threads can truly run in parallel on your machine. Create more threads than cores, and they’ll take turns on the available cores—still useful for some workloads, but you won’t get additional parallel speedup.
Your First Thread
Let’s create our first thread and watch it run alongside the main thread.
#include <iostream>
#include <thread>
void greet()
{
std::cout << "Hello from the new thread!\n";
}
int main()
{
std::thread worker(greet);
worker.join();
std::cout << "Back in the main thread.\n";
return 0;
}
The std::thread constructor takes a function and immediately starts a new thread running that function. Two bookmarks now move through your code simultaneously.
The join() call is crucial. It makes the main thread wait until the worker thread finishes. Without it, main() might return and terminate the entire program before greet() completes. Always ensure your threads finish before your program ends.
Now let’s see two threads actually racing.
#include <iostream>
#include <thread>
void count(const char* name)
{
for (int i = 1; i <= 5; ++i)
std::cout << name << ": " << i << "\n";
}
int main()
{
std::thread alice(count, "Alice");
std::thread bob(count, "Bob");
alice.join();
bob.join();
return 0;
}
Run this and you might see:
Alice: 1
Bob: 1
Alice: 2
Bob: 2
Alice: 3
...
Or perhaps:
AliceBob: : 1
1
Alice: 2
...
The interleaving varies each time. Both threads race to print, and their outputs can jumble together. This unpredictability is your first glimpse of concurrent programming’s fundamental challenge: when threads share resources (here, the output stream), chaos can ensue.
Many Ways to Start a Thread
You’ve seen functions passed to std::thread. But threads accept any callable object: lambda expressions, function objects, and member functions.
Lambda expressions are often the clearest choice, especially for short tasks:
#include <iostream>
#include <thread>
int main()
{
int x = 42;
std::thread t([x]() {
std::cout << "The value is: " << x << "\n";
});
t.join();
return 0;
}
The lambda captures x by value—it copies x into the lambda’s own storage. This is important: std::thread copies all arguments by default, even when your function declares a reference parameter.
To actually pass by reference, you must be explicit:
#include <iostream>
#include <thread>
void increment(int& value)
{
++value;
}
int main()
{
int counter = 0;
std::thread t(increment, std::ref(counter));
t.join();
std::cout << "Counter is now: " << counter << "\n";
return 0;
}
Without std::ref(), the thread would modify a copy, leaving counter unchanged at zero. With it, the thread modifies the original variable.
For objects with custom behavior, you can use function objects (functors):
#include <iostream>
#include <thread>
class Counter
{
int limit_;
public:
Counter(int limit) : limit_(limit) {}
void operator()() const
{
for (int i = 0; i < limit_; ++i)
std::cout << i << " ";
std::cout << "\n";
}
};
int main()
{
std::thread t(Counter(5));
t.join();
return 0;
}
You can even run member functions on objects:
#include <iostream>
#include <thread>
#include <string>
class Greeter
{
public:
void greet(const std::string& name)
{
std::cout << "Hello, " << name << "!\n";
}
};
int main()
{
Greeter g;
std::thread t(&Greeter::greet, &g, "World");
t.join();
return 0;
}
The syntax &Greeter::greet names the member function; &g provides the instance to call it on.
Thread Lifecycle: Join, Detach, or Crash
Every std::thread object must be either joined or detached before it’s destroyed. If you let a thread object go out of scope without doing one of these, the program calls std::terminate() and aborts.
We’ve used join() extensively. It blocks the calling thread until the target thread completes:
std::thread worker(do_work);
// ... other operations ...
worker.join(); // wait here until do_work finishes
Sometimes you want a thread to run independently, continuing even after the std::thread object is destroyed. That’s what detach() does:
std::thread logger(background_logging);
logger.detach(); // thread continues running independently
// logger object is now "empty" and will safely destruct
A detached thread becomes a daemon thread. It runs until it finishes or the program exits. You lose all ability to wait for it or check its status. Use detachment sparingly—usually for truly fire-and-forget background work like logging.
Before joining or detaching, you can check if a thread is joinable:
std::thread t(some_function);
if (t.joinable())
{
t.join();
}
A thread is joinable if it represents an actual thread of execution. After joining or detaching, or after default construction, a std::thread is not joinable.
Inside a Thread: Useful Operations
Within a running thread, you can access several useful operations through std::this_thread:
#include <iostream>
#include <thread>
#include <chrono>
void worker()
{
// Get this thread's unique identifier
auto id = std::this_thread::get_id();
std::cout << "Thread ID: " << id << "\n";
// Pause for a specific duration
std::this_thread::sleep_for(std::chrono::milliseconds(500));
// Yield to other threads (hint to scheduler)
std::this_thread::yield();
std::cout << "Worker done.\n";
}
The sleep_for() function pauses the thread for at least the specified duration. The yield() function hints to the operating system that this thread is willing to give up its time slice—useful for busy-wait loops to avoid consuming all CPU time.
Thread-Local Storage: Private Data for Each Thread
Sometimes each thread needs its own copy of a variable—not shared with other threads, but persistent across function calls within that thread.
#include <iostream>
#include <thread>
thread_local int counter = 0;
void work(const char* name)
{
++counter;
std::cout << name << " counter: " << counter << "\n";
++counter;
std::cout << name << " counter: " << counter << "\n";
}
int main()
{
std::thread t1(work, "T1");
std::thread t2(work, "T2");
t1.join();
t2.join();
return 0;
}
Each thread sees its own counter. T1 prints 1, then 2. T2 independently prints 1, then 2. No synchronization needed because the data isn’t shared.
Thread-local storage is perfect for per-thread caches, random number generators, or error state. Use it when you need static-like persistence but don’t want sharing between threads.
The Hidden Danger: Race Conditions
Now we confront the central challenge of concurrent programming.
When multiple threads only read shared data, everything works fine. But the moment at least one thread writes while others read or write the same data, you enter dangerous territory. This is called a data race, and it produces undefined behavior—crashes, corruption, or silent errors that might not appear until your code runs in production.
Consider this innocent-looking code:
#include <iostream>
#include <thread>
int counter = 0;
void increment_many_times()
{
for (int i = 0; i < 100000; ++i)
++counter;
}
int main()
{
std::thread t1(increment_many_times);
std::thread t2(increment_many_times);
t1.join();
t2.join();
std::cout << "Counter: " << counter << "\n";
return 0;
}
Two threads, each incrementing 100,000 times. You’d expect 200,000. But run this repeatedly and you’ll see different results—180,000, 195,327, maybe occasionally 200,000. Something is wrong.
The ++counter operation looks atomic—indivisible—but it isn’t. Under the hood, it consists of three steps:
1. Read the current value of counter into a register
2. Add one to that register
3. Write the result back to counter
Between any of these steps, the other thread might execute its own steps. Imagine both threads read counter when it’s 5. Both add one, getting 6. Both write 6 back. Two increments, but the counter only went up by one.
This is called a lost update. It’s one of several classic race conditions. The more threads, the more opportunity for races. The faster your processor, the more instructions execute between context switches, potentially hiding the bug—until one critical day when timing is different.
Here’s an even subtler race:
if (x == 5) // Check the value
{
y = x * 2; // Act on it
// But what if another thread changed x between the check and the act?
}
The "check-then-act" pattern is a race condition waiting to happen. By the time you act, the condition you checked might no longer be true.
Mutual Exclusion: The Mutex
The solution to data races is mutual exclusion: ensuring that only one thread accesses shared data at a time.
A mutex (mutual exclusion object) is a lockable resource. Before accessing shared data, a thread locks the mutex. If another thread already holds the lock, the requesting thread blocks—it waits, doing nothing—until the lock is released. This serializes access to the protected data.
#include <iostream>
#include <thread>
#include <mutex>
int counter = 0;
std::mutex counter_mutex;
void increment_many_times()
{
for (int i = 0; i < 100000; ++i)
{
counter_mutex.lock();
++counter;
counter_mutex.unlock();
}
}
int main()
{
std::thread t1(increment_many_times);
std::thread t2(increment_many_times);
t1.join();
t2.join();
std::cout << "Counter: " << counter << "\n";
return 0;
}
Now the output is always 200,000. The mutex ensures that between lock() and unlock(), only one thread executes. The increment is now effectively atomic—indivisible from the perspective of any other thread.
But there’s a problem with calling lock() and unlock() directly. If code between them throws an exception, unlock() never executes. The mutex stays locked forever, and any thread waiting for it blocks eternally. This is a form of deadlock.
Lock Guards: Safety Through RAII
C++ has a powerful idiom called RAII (Resource Acquisition Is Initialization). The idea: acquire resources in a constructor, release them in the destructor. Since destructors run even when exceptions are thrown, cleanup is guaranteed.
Lock guards apply RAII to mutexes:
#include <iostream>
#include <thread>
#include <mutex>
int counter = 0;
std::mutex counter_mutex;
void increment_many_times()
{
for (int i = 0; i < 100000; ++i)
{
std::lock_guard<std::mutex> guard(counter_mutex);
++counter;
// guard's destructor automatically unlocks when scope ends
}
}
The std::lock_guard locks the mutex in its constructor and unlocks in its destructor. Even if an exception is thrown, the destructor runs and the mutex is released. This is the correct way to use mutexes.
Since C++17, std::scoped_lock is the preferred choice. It works like lock_guard but can lock multiple mutexes simultaneously (we’ll see why this matters when we discuss deadlocks):
std::scoped_lock guard(counter_mutex); // C++17 and later
For more control, use std::unique_lock. It can be unlocked before destruction, moved to another scope, or created without immediately locking:
std::unique_lock<std::mutex> guard(some_mutex, std::defer_lock);
// mutex not yet locked
guard.lock(); // lock when ready
// ... do protected work ...
guard.unlock(); // unlock early if needed
// ... do unprotected work ...
// destructor will unlock again if still locked
std::unique_lock is more flexible but slightly more expensive than std::lock_guard. Use the simplest tool that does the job.
The Deadlock Dragon
Mutexes solve data races but introduce a new danger: deadlock.
Imagine two threads and two mutexes. Thread A locks mutex 1, then tries to lock mutex 2. Thread B locks mutex 2, then tries to lock mutex 1. Each thread holds one mutex and waits for the other. Neither can proceed. The program freezes forever.
std::mutex mutex1, mutex2;
void thread_a()
{
std::lock_guard<std::mutex> lock1(mutex1);
// ... some work ...
std::lock_guard<std::mutex> lock2(mutex2); // blocks, waiting for B
}
void thread_b()
{
std::lock_guard<std::mutex> lock2(mutex2);
// ... some work ...
std::lock_guard<std::mutex> lock1(mutex1); // blocks, waiting for A
}
If both threads run and each acquires its first mutex before the other acquires the second, deadlock occurs. The program hangs silently, giving no indication of what went wrong.
The simplest prevention: always lock mutexes in the same order everywhere in your program. If every thread locks mutex1 before mutex2, no cycle can form.
When you need to lock multiple mutexes and can’t guarantee consistent order, use std::scoped_lock:
void safe_function()
{
std::scoped_lock lock(mutex1, mutex2); // locks both without deadlock risk
// ... do work with both protected resources ...
}
std::scoped_lock uses a deadlock-avoidance algorithm internally, acquiring both mutexes without risk of circular waiting. This is why it’s preferred over std::lock_guard in modern C++.
Atomics: Lock-Free Simplicity
For simple operations on simple data, mutexes might be overkill. Atomic types provide lock-free thread safety for individual values.
An atomic operation completes entirely before any other thread can observe its effects. There’s no intermediate state visible to other threads.
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<int> counter{0};
void increment_many_times()
{
for (int i = 0; i < 100000; ++i)
++counter; // atomic increment
}
int main()
{
std::thread t1(increment_many_times);
std::thread t2(increment_many_times);
t1.join();
t2.join();
std::cout << "Counter: " << counter << "\n";
return 0;
}
No mutex, no lock guard, yet the result is always 200,000. The std::atomic<int> ensures that increments are truly indivisible.
Atomics work best for single-variable operations: counters, flags, simple state machines. They’re faster than mutexes when contention is low because they use special hardware instructions instead of blocking. But they can’t protect complex operations involving multiple variables—for those, you need mutexes.
Common atomic types include std::atomic<bool>, std::atomic<int>, std::atomic<long>, and pointer types. The standard library provides convenient aliases like std::atomic_int and std::atomic_bool.
Use atomic types only when you need them. For uncontested single-threaded access, regular types are faster. Choose the right tool for each situation.
Readers and Writers: Shared Locks
Consider a configuration object that’s read constantly but updated rarely. A regular mutex serializes all access—but why block readers from each other? Multiple threads can safely read simultaneously; only writes require exclusive access.
Shared mutexes support this pattern:
#include <iostream>
#include <thread>
#include <shared_mutex>
#include <vector>
std::shared_mutex data_mutex;
std::vector<int> data;
void reader(int id)
{
std::shared_lock<std::shared_mutex> lock(data_mutex); // shared access
std::cout << "Reader " << id << " sees " << data.size() << " elements\n";
}
void writer(int value)
{
std::unique_lock<std::shared_mutex> lock(data_mutex); // exclusive access
data.push_back(value);
std::cout << "Writer added " << value << "\n";
}
std::shared_lock acquires a shared lock—multiple threads can hold shared locks simultaneously. std::unique_lock on a shared mutex acquires an exclusive lock—no other locks (shared or exclusive) can be held at the same time.
Think of it like a library reading room. Multiple people can read the books simultaneously (shared locks). But when someone needs to reorganize the shelves (exclusive lock), everyone must leave until they’re done.
The rules: - While any reader holds a shared lock, writers must wait - While a writer holds an exclusive lock, everyone waits - Multiple readers can work simultaneously
This pattern maximizes throughput for read-heavy workloads where writes are infrequent.
Varieties of Mutex
Beyond the basic std::mutex and std::shared_mutex, the standard library offers specialized variants:
std::timed_mutex allows you to give up waiting after a timeout:
std::timed_mutex mtx;
if (mtx.try_lock_for(std::chrono::milliseconds(100)))
{
// acquired the lock within 100ms
mtx.unlock();
}
else
{
// couldn't acquire in time, do something else
}
std::recursive_mutex allows the same thread to lock it multiple times. You must unlock the same number of times before another thread can acquire it. This is useful when protected functions call other protected functions:
std::recursive_mutex rmtx;
void outer()
{
std::lock_guard<std::recursive_mutex> lock(rmtx);
// ... work ...
inner(); // also locks rmtx, but that's okay
}
void inner()
{
std::lock_guard<std::recursive_mutex> lock(rmtx);
// ... more work ...
}
With a regular mutex, inner() would deadlock because the mutex is already held by outer(). With a recursive mutex, the same thread can lock it again.
Coordinating Threads: Condition Variables
Sometimes a thread must wait for a specific condition before proceeding. You could loop repeatedly checking:
while (!ready)
{
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
This works but wastes CPU cycles and introduces latency. Condition variables provide efficient waiting.
A condition variable allows one thread to signal others that something has changed. Waiting threads sleep until notified, consuming no CPU:
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
std::mutex mtx;
std::condition_variable cv;
bool ready = false;
void worker()
{
std::unique_lock<std::mutex> lock(mtx);
cv.wait(lock, []{ return ready; }); // sleep until ready is true
std::cout << "Worker proceeding!\n";
}
void controller()
{
std::this_thread::sleep_for(std::chrono::seconds(1));
{
std::lock_guard<std::mutex> lock(mtx);
ready = true;
}
cv.notify_one(); // wake one waiting thread
}
int main()
{
std::thread t(worker);
controller();
t.join();
return 0;
}
The worker thread calls cv.wait(), which atomically releases the mutex and suspends the thread. When controller() calls notify_one(), the worker wakes up, reacquires the mutex, checks the condition, and proceeds.
The lambda []{ return ready; } is the predicate. The wait() function checks this predicate after each wakeup and only returns when it evaluates to true. This guards against spurious wakeups—rare events where a thread wakes without being notified. Always use a predicate with condition variables.
Use notify_one() to wake a single waiting thread. Use notify_all() when multiple threads might be waiting and all should check their conditions:
cv.notify_all(); // wake all waiting threads
The condition variable pattern appears constantly in concurrent programming: producer-consumer queues, thread pools, event systems, and more.
Getting Results Back: Futures and Promises
We’ve focused on threads as parallel workers. But how do you get results from them?
Passing references and having threads modify them works, but it’s clunky. C++ offers a cleaner abstraction: futures and promises.
Think of ordering food at a restaurant. You place an order (make a promise). The kitchen gives you a number (a future). You can do other things while waiting. When your number is called, you present it to get your food (call get() on the future). If you try to get your food before it’s ready, you wait.
#include <iostream>
#include <thread>
#include <future>
void compute(std::promise<int> result_promise)
{
int answer = 6 * 7; // imagine this takes time
result_promise.set_value(answer);
}
int main()
{
std::promise<int> promise;
std::future<int> future = promise.get_future();
std::thread t(compute, std::move(promise));
std::cout << "Waiting for result...\n";
int result = future.get(); // blocks until value is set
std::cout << "The answer is: " << result << "\n";
t.join();
return 0;
}
A std::promise is a write-once container: one thread calls set_value(). A std::future is the corresponding read-once container: another thread calls get() to retrieve that value, blocking if necessary until it’s available.
Important details:
- A future’s get() can only be called once
- If you need multiple consumers, use std::shared_future
- If the promise is destroyed before setting a value, get() throws an exception
For multiple consumers:
std::promise<int> promise;
std::shared_future<int> future = promise.get_future().share();
// Now multiple threads can wait on this future
The Easy Path: std::async
Creating threads manually, managing promises, joining at the end—it’s mechanical. std::async automates it:
#include <iostream>
#include <future>
int compute()
{
return 6 * 7;
}
int main()
{
std::future<int> future = std::async(compute);
std::cout << "Computing...\n";
int result = future.get();
std::cout << "Result: " << result << "\n";
return 0;
}
std::async launches the function (potentially in a new thread), returning a future. No explicit thread creation, no promise management, no join call.
By default, the system decides whether to run the function in a new thread or defer execution until you call get(). You can control this with launch policies:
// Force a new thread
auto future1 = std::async(std::launch::async, compute);
// Defer until get() is called (runs in the calling thread)
auto future2 = std::async(std::launch::deferred, compute);
// Let the system decide (default)
auto future3 = std::async(std::launch::async | std::launch::deferred, compute);
Use std::launch::async when you want guaranteed parallelism. Use std::launch::deferred for lazy evaluation.
std::async works with any callable:
// Function pointer
auto f1 = std::async(std::launch::async, some_function, arg1, arg2);
// Lambda
auto f2 = std::async(std::launch::async, [](int x){ return x * 2; }, 21);
// Member function
MyClass obj;
auto f3 = std::async(std::launch::async, &MyClass::method, &obj, arg);
For quick parallel tasks where you need the result, std::async is often the cleanest choice.
When to Use Threads vs Tasks
You now have two approaches:
Use threads directly when: - You need fine-grained control over thread lifetime - You’re building long-running background workers - You need to manage complex synchronization with mutexes and condition variables - You’re implementing a thread pool or custom scheduler
Use async/futures when: - You have discrete tasks that return results - You want simple parallel execution - You don’t need manual thread management - You’re running short, independent computations
In practice, start with std::async for its simplicity. Graduate to manual threads when you need more control.
Practical Pattern: Producer-Consumer Queue
Let’s combine concepts into a useful pattern. A producer-consumer queue connects threads that produce work with threads that consume it:
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <queue>
template<typename T>
class ThreadSafeQueue
{
std::queue<T> queue_;
std::mutex mutex_;
std::condition_variable cv_;
public:
void push(T value)
{
{
std::lock_guard<std::mutex> lock(mutex_);
queue_.push(std::move(value));
}
cv_.notify_one();
}
T pop()
{
std::unique_lock<std::mutex> lock(mutex_);
cv_.wait(lock, [this]{ return !queue_.empty(); });
T value = std::move(queue_.front());
queue_.pop();
return value;
}
};
ThreadSafeQueue<int> work_queue;
void producer()
{
for (int i = 0; i < 5; ++i)
{
work_queue.push(i);
std::cout << "Produced: " << i << "\n";
}
}
void consumer()
{
for (int i = 0; i < 5; ++i)
{
int item = work_queue.pop();
std::cout << "Consumed: " << item << "\n";
}
}
int main()
{
std::thread prod(producer);
std::thread cons(consumer);
prod.join();
cons.join();
return 0;
}
The producer pushes items and notifies the condition variable. The consumer waits on the condition variable until items are available, then processes them. This decouples production from consumption and handles varying rates gracefully.
What You’ve Learned
You began knowing nothing about concurrency. Now you understand:
-
Threads are independent flows of execution within a process, each with its own call stack but sharing memory
-
Race conditions occur when threads access shared data without synchronization
-
Mutexes provide mutual exclusion, ensuring only one thread accesses protected data at a time
-
Lock guards automatically manage mutex lifetime, preventing forgotten unlocks and handling exceptions
-
Deadlocks occur when threads wait circularly for each other’s resources
-
Atomics offer lock-free thread safety for simple operations on single values
-
Shared locks allow multiple readers but only one writer
-
Condition variables let threads wait efficiently for specific conditions
-
Futures and promises communicate results between threads
-
std::async simplifies launching parallel work
You’ve seen the dangers—race conditions can corrupt data silently, deadlocks can freeze your program forever—and you’ve learned the tools to avoid them.
Parting Wisdom
Concurrency is challenging. Bugs hide until the worst possible moment. Testing is hard because timing varies between runs, between machines, between debug and release builds. A race condition might appear once in a million runs, then strike during a critical demonstration.
Start simple. Use std::async when you can. Prefer immutable data that never changes after creation—no synchronization needed if nothing ever writes. When you must share mutable state, protect it carefully with the appropriate mutex. Minimize the time locks are held. Avoid nested locks when possible; when you can’t avoid them, use std::scoped_lock to prevent deadlocks.
And test. Test with many threads. Test on different machines. Test under load. Use thread sanitizers and static analysis tools that can detect race conditions.
The parallel path awaits. Walk it carefully.