Demystifying Python Concurrency: Benchmarking Threads, Processes, and Async (and the GIL's Role)

buddha gautambuddha gautam
7 min read

Demystifying Python Concurrency: Benchmarking Threads, Processes, and Async (and the GIL's Role)

Alright, buckle up buttercups, because we're diving headfirst into the murky depths of Python concurrency. You know, that thing you think you understand after reading a few LinkedIn articles by "Thought Leaders™" who've never actually deployed anything beyond a Flask "Hello, World!" app. We're going to cut through the noise, the buzzwords, and the influencer fluff and get down to brass tacks. We're talking real, actionable knowledge you can use to actually make your code faster. And yes, we'll be talking about that thing: the GIL.

The GIL: The Bane of Your CPU-Bound Existence (and Why It's Not Always Evil)

The Global Interpreter Lock (GIL). Dun dun duuuun! It's the Voldemort of Python concurrency. The thing everyone whispers about in hushed tones. But what is it? Simply put, the GIL is a mutex that allows only one thread to hold control of the Python interpreter at any given time. This means that even if you have a multi-core CPU, only one thread can execute Python bytecode at a time.

Why does it exist? Historical reasons, mostly. It simplifies memory management and makes C extensions easier to write. Removing it is a Herculean task that's been attempted (and largely failed) numerous times.

Impact: This is where the fun begins. The GIL primarily affects CPU-bound tasks. If your code spends most of its time crunching numbers, doing heavy computations, or generally making your CPU sweat, the GIL will be a bottleneck. You won't get true parallelism from threads.

However, the GIL has a much smaller impact on I/O-bound tasks. If your code spends most of its time waiting for network requests, disk reads/writes, or other external operations, the GIL is less of a concern because threads spend most of their time releasing the GIL while waiting.

Concurrency Models: A Culinary Analogy

Let's think about concurrency like running a restaurant.

  • Threading: Imagine a single kitchen (the Python interpreter) with multiple chefs (threads). Only one chef can use the stove (execute Python bytecode) at a time because, well, there's only one stove. They can share ingredients and tools, but only one chef can actively cook. This is great for I/O-bound tasks where chefs spend time waiting for ingredients to arrive (network requests), but not so great for CPU-bound tasks where everyone wants to use the stove at the same time.

  • Multiprocessing: Now imagine multiple kitchens (separate Python processes), each with its own stove and chefs. Each kitchen operates independently. This is true parallelism. Great for CPU-bound tasks because each process has its own GIL (or rather, doesn't care about the GIL of other processes). Communication between kitchens (processes) is more complex (using things like queues or pipes), but it's worth it for the performance gains.

  • Asyncio: This is like a single chef (the event loop) who's incredibly efficient. They juggle multiple tasks, switching between them whenever one task is waiting for something (e.g., an ingredient to arrive). They don't actually do things in parallel, but they can handle many tasks concurrently by using non-blocking I/O. Think of it as hyper-efficient task switching.

Benchmarking: Let's Get Our Hands Dirty

Alright, enough theory. Let's write some code and see this stuff in action. We'll benchmark a CPU-bound task (calculating prime numbers) and an I/O-bound task (making HTTP requests) using threads, processes, and asyncio.

Benchmarking Setup:

  • Python 3.x (duh)
  • timeit module for measuring execution time
  • requests library for making HTTP requests (install with pip install requests)
  • A machine with multiple CPU cores (because why not?)

CPU-Bound Task: Prime Number Calculation

Here's the code:

import time
import timeit
import threading
import multiprocessing
import asyncio
import requests

def is_prime(n):
    """Naive prime number checker."""
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

def cpu_bound_task(n):
    """Finds prime numbers up to n."""
    primes = [x for x in range(2, n) if is_prime(x)]
    return len(primes)

N = 15000 # Adjust this value to change the workload

# Threading
def thread_cpu_bound(num_threads):
    threads = []
    for i in range(num_threads):
        thread = threading.Thread(target=cpu_bound_task, args=(N,))
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()

# Multiprocessing
def process_cpu_bound(num_processes):
    processes = []
    for i in range(num_processes):
        process = multiprocessing.Process(target=cpu_bound_task, args=(N,))
        processes.append(process)
        process.start()
    for process in processes:
        process.join()

# Asyncio (not suitable for CPU-bound tasks, but let's see the performance)
async def async_cpu_bound_task(n):
    return cpu_bound_task(n)

async def asyncio_cpu_bound(num_tasks):
    tasks = [asyncio_cpu_bound_task(N) for _ in range(num_tasks)]
    await asyncio.gather(*tasks)


def benchmark_cpu_bound():
    num_threads = 4
    num_processes = 4
    num_async_tasks = 4

    print(f"CPU-Bound Task (N={N}):")

    threading_time = timeit.timeit(lambda: thread_cpu_bound(num_threads), number=3)
    print(f"  Threading ({num_threads} threads): {threading_time:.4f} seconds")

    multiprocessing_time = timeit.timeit(lambda: process_cpu_bound(num_processes), number=3)
    print(f"  Multiprocessing ({num_processes} processes): {multiprocessing_time:.4f} seconds")

    asyncio_time = timeit.timeit(lambda: asyncio.run(asyncio_cpu_bound(num_async_tasks)), number=3)
    print(f"  Asyncio ({num_async_tasks} tasks): {asyncio_time:.4f} seconds")


if __name__ == "__main__":
    benchmark_cpu_bound()

I/O-Bound Task: Making HTTP Requests

Now, let's benchmark an I/O-bound task. We'll make multiple HTTP requests to a dummy endpoint.

import time
import timeit
import threading
import multiprocessing
import asyncio
import aiohttp
import requests

URL = "https://httpbin.org/delay/1"  # A dummy endpoint that delays the response by 1 second

# Threading
def thread_io_bound(num_threads):
    def make_request():
        requests.get(URL)

    threads = []
    for _ in range(num_threads):
        thread = threading.Thread(target=make_request)
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()


# Multiprocessing
def process_io_bound(num_processes):
    def make_request():
        requests.get(URL)

    processes = []
    for _ in range(num_processes):
        process = multiprocessing.Process(target=make_request)
        processes.append(process)
        process.start()
    for process in processes:
        process.join()


# Asyncio
async def async_io_bound_task():
    async with aiohttp.ClientSession() as session:
        async with session.get(URL) as response:
            await response.read()

async def asyncio_io_bound(num_tasks):
    tasks = [async_io_bound_task() for _ in range(num_tasks)]
    await asyncio.gather(*tasks)


def benchmark_io_bound():
    num_threads = 4
    num_processes = 4
    num_async_tasks = 4

    print(f"\nI/O-Bound Task (URL: {URL}):")

    threading_time = timeit.timeit(lambda: thread_io_bound(num_threads), number=3)
    print(f"  Threading ({num_threads} threads): {threading_time:.4f} seconds")

    multiprocessing_time = timeit.timeit(lambda: process_io_bound(num_processes), number=3)
    print(f"  Multiprocessing ({num_processes} processes): {multiprocessing_time:.4f} seconds")

    asyncio_time = timeit.timeit(lambda: asyncio.run(asyncio_io_bound(num_async_tasks)), number=3)
    print(f"  Asyncio ({num_async_tasks} tasks): {asyncio_time:.4f} seconds")


if __name__ == "__main__":
    benchmark_io_bound()

Analysis: The Numbers Don't Lie (Mostly)

Run the code and observe the results.

CPU-Bound Task:

  • You'll likely see that multiprocessing significantly outperforms threading. This is because multiprocessing bypasses the GIL by using separate processes. threading will probably be slightly slower than the single-threaded version due to the overhead of thread management and context switching. asyncio will likely be the slowest because it's fundamentally not designed for CPU-bound tasks.

I/O-Bound Task:

  • threading and asyncio will likely perform similarly and better than the single-threaded version. They can both handle multiple I/O operations concurrently. multiprocessing might be slightly faster than threading due to bypassing the GIL, but the overhead of process management can negate some of the benefits, especially for short-lived I/O operations.

Important Caveats:

  • These results can vary depending on your hardware, operating system, and Python version.
  • The specific task being benchmarked can also affect the results.
  • Micro-benchmarks can be misleading. Always benchmark your actual application code.

Decision Guide: Choosing the Right Tool for the Job

  • CPU-Bound Tasks: multiprocessing is your friend. Embrace the process.
  • I/O-Bound Tasks: asyncio is often the best choice for modern Python. threading can also be a good option, especially if you're working with legacy code or libraries that don't support asyncio.
  • Mixed Workloads: Consider using a combination of techniques. For example, you might use multiprocessing to handle CPU-intensive tasks and asyncio to handle I/O-bound tasks.

Real-World Applications and Common Pitfalls

  • Web Servers: asyncio is commonly used in modern web frameworks like FastAPI and Starlette to handle a large number of concurrent requests.
  • Data Processing: multiprocessing can be used to parallelize data processing tasks, such as image processing or scientific simulations.
  • Concurrency Pitfalls:
    • Race Conditions: Be careful when sharing mutable data between threads or processes. Use locks or other synchronization mechanisms to prevent race conditions.
    • Deadlocks: Avoid circular dependencies between locks.
    • Process Communication Overhead: Be mindful of the overhead of communicating data between processes.

Conclusion: Go Forth and Concur (Responsibly)

Concurrency is a powerful tool, but it's also a complex beast. Understanding the GIL, the different concurrency models, and their trade-offs is essential for writing efficient and scalable Python code. So, ditch the LinkedIn fluff, roll up your sleeves, and start experimenting. And remember, always benchmark your code to see what actually works best for your specific use case. Now go forth and concur... responsibly! Or don't. I'm not your supervisor.

0
Subscribe to my newsletter

Read articles from buddha gautam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

buddha gautam
buddha gautam

Python, Django, DevOps(can use ec2 and docker lol).