Demystifying Python Concurrency: Benchmarking threads with GIL enabled-disabled, Processes, and Async

buddha gautambuddha gautam
11 min read

Alright, folks. Let's talk about Python concurrency. And by "talk," I mean "I'm going to rant about it while showing you some actual data. We're diving deep into the GIL, threads (with and without that pesky GIL), processes, and async, and we're going to benchmark the living hell out of them. Prepare for graphs. Prepare for truth. Prepare for me to question my life choices.

The GIL: Python's Arch-Nemesis (or, Why Your CPU Isn't Doing What You Think It's Doing)

The Global Interpreter Lock (GIL). You've heard the whispers. The boogeyman of Python performance. In short, it's a mutex that allows only one thread to hold control of the Python interpreter at any given time. This means that even if you have a shiny new multi-core CPU, your Python code might be running on only one core at a time.

Why does it exist? Historical reasons. It simplifies memory management and makes C extensions easier to write. Is it ideal? Absolutely not. But it's what we've got (until now... more on that later).

The Contenders: Threads, Processes, and Async (Oh My!)

We're going to pit these concurrency methods against each other in two scenarios: CPU-bound tasks and I/O-bound tasks.

  • Threads (GIL Enabled): The standard Python threading library. Good for I/O, terrible for CPU.

  • Threads (GIL Disabled - Experimental): Python 3.13 (experimental) allows you to disable the GIL on a per-interpreter basis. Let's see if it lives up to the hype!

  • Processes (Multiprocessing): Spawns separate Python processes, each with its own interpreter and memory space. Bypasses the GIL, great for CPU-bound tasks, but has overhead.

  • Asyncio: Single-threaded, concurrent execution using coroutines. Excellent for I/O-bound tasks, requires careful coding to avoid blocking the event loop.

Benchmarking Setup: The Nitty-Gritty (aka, Show Me the Code!)

import argparse
import asyncio
import sys

from runners import (
    run_threading_cpu,
    run_multiprocessing_cpu,
    run_threading_io,
    run_asyncio_io,
)

def check_gil_enabled():
    import sys
    try:
        return sys._is_gil_enabled()
    except AttributeError:
        return True

def main():
    parser = argparse.ArgumentParser(description="Benchmark CPU/IO tasks with concurrency.")
    parser.add_argument("--task", choices=["cpu", "io"], required=True, help="Type of benchmark to run.")
    parser.add_argument("--method", choices=["threading", "multiprocessing", "asyncio"], required=True, help="Concurrency method.")
    parser.add_argument("--workers", type=int, default=4, help="Number of workers/threads/processes/tasks.")
    parser.add_argument("--n", type=int, default=100000, help="Upper limit for CPU-bound task (prime counting).")
    parser.add_argument("--duration", type=float, default=0.01, help="Sleep duration for IO-bound task in seconds.")
    parser.add_argument("--json", action="store_true", help="Output results in JSON format")

    args = parser.parse_args()

    gil_status = check_gil_enabled()
    results = {
        "gil_enabled": gil_status,
        "task": args.task,
        "method": args.method,
        "workers": args.workers,
        "duration": None,
        "total_primes": None
    }

    if args.task == "cpu":
        if args.method == "threading":
            duration, total = run_threading_cpu(args.n, args.workers)
            results["duration"] = duration
            results["total_primes"] = total
        elif args.method == "multiprocessing":
            duration, total = run_multiprocessing_cpu(args.n, args.workers)
            results["duration"] = duration
            results["total_primes"] = total
        else:
            print("Asyncio is not suitable for CPU-bound tasks.")
            sys.exit(1)

    elif args.task == "io":
        if args.method == "threading":
            duration = run_threading_io(args.workers, args.duration)
            results["duration"] = duration
        elif args.method == "asyncio":
            duration = asyncio.run(run_asyncio_io(args.workers, args.duration))
            results["duration"] = duration
        else:
            print("Multiprocessing is generally not used for IO-bound tasks.")
            sys.exit(1)

    if args.json:
        import json
        print(json.dumps(results))
    else:
        # Traditional output format
        print(f"GIL Enabled: {gil_status}")
        if args.task == "cpu":
            print(f"{args.method.capitalize()} CPU-bound took {results['duration']:.3f}s, total primes: {results['total_primes']}")
        else:
            print(f"{args.method.capitalize()} IO-bound took {results['duration']:.3f}s")

if __name__ == "__main__":
    main()

This script lets you choose between benchmarking CPU-bound tasks (like counting prime numbers for no good reason) or IO-bound tasks (like pretending to sleep on the job using time.sleep). Because nothing says productivity like simulating waiting.

Performance bound Benchmark (Prime Counter Edition)

This one brute-forces its way through checking for prime numbers — not because the world needs more primes, but because it's a classic way to hog the CPU. The goal isn’t mathematical insight; it’s to simulate CPU-heavy computation across threads or processes and observe how Python handles it under the GIL (or without it). Expect this to burn cycles, max out cores, and make your laptop sound like it’s trying to take off.

import matplotlib.pyplot as plt
import subprocess
import json
import pandas as pd
from typing import List, Dict
import time

def run_experiment(task_type: str, method: str, workers_list: List[int], python_version: str = "3.12", num_runs: int = 5) -> List[Dict]:
    """
    Run the benchmark experiment multiple times and collect results
    Args:
        task_type: "cpu" or "io"
        method: "threading", "multiprocessing", or "asyncio"
        workers_list: List of worker counts to test
        python_version: "3.12" (with GIL) or "3.13" (no GIL)
        num_runs: Number of times to repeat each experiment
    """
    results = []

    for workers in workers_list:
        for run in range(num_runs):
            if task_type == "cpu":
                cmd = [f"python{python_version}", "main.py", "--task", "cpu", "--method", method, 
                      "--workers", str(workers), "--n", "10000", "--json"]
            else:
                cmd = [f"python{python_version}", "main.py", "--task", "io", "--method", method,
                      "--workers", str(workers), "--duration", "0.1", "--json"]

            # Print the command being run
            print(f"\nRunning: {' '.join(cmd)}")

            output = subprocess.run(cmd, capture_output=True, text=True)

            try:
                result = json.loads(output.stdout)
                result['run'] = run  # Add run number to results
                result['python_version'] = python_version
                result['gil_enabled'] = python_version == "3.12"  # True for 3.12, False for 3.13
                results.append(result)
            except json.JSONDecodeError:
                print(f"Error parsing output: {output.stdout}")
                print(f"Stderr: {output.stderr}")

            # Small delay between runs
            time.sleep(0.1)

    return results

def create_visualizations(results: List[Dict]):
    """
    Create visualizations comparing different concurrency methods
    """
    df = pd.DataFrame(results)

    # Set style
    plt.style.use('default')

    # Create figure with subplots
    fig = plt.figure(figsize=(15, 10))

    # 1. Performance by Workers for each task type and Python version
    ax1 = plt.subplot(221)
    for task in df['task'].unique():
        task_data = df[df['task'] == task]
        for method in task_data['method'].unique():
            for version in task_data['python_version'].unique():
                data = task_data[(task_data['method'] == method) & 
                                (task_data['python_version'] == version)]
                means = data.groupby('workers')['duration'].mean()
                std = data.groupby('workers')['duration'].std()
                label = f"{task}-{method}-Python{version}"
                ax1.errorbar(means.index, means.values, yerr=std.values, 
                           label=label, marker='o')

    ax1.set_title('Performance by Number of Workers')
    ax1.set_xlabel('Number of Workers')
    ax1.set_ylabel('Duration (seconds)')
    ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax1.grid(True)

    # 2. Method Comparison (Box Plot)
    ax2 = plt.subplot(222)
    df.boxplot(column='duration', by=['method', 'python_version'], ax=ax2)
    ax2.set_title('Duration Distribution by Method and Python Version')
    ax2.set_ylabel('Duration (seconds)')
    plt.xticks(rotation=45)

    # 3. Speedup Analysis
    ax3 = plt.subplot(223)
    for task in df['task'].unique():
        task_data = df[df['task'] == task]
        for method in task_data['method'].unique():
            for version in task_data['python_version'].unique():
                data = task_data[(task_data['method'] == method) & 
                                (task_data['python_version'] == version)]
                baseline = data[data['workers'] == min(data['workers'])]['duration'].mean()
                speedup = data.groupby('workers')['duration'].mean().apply(lambda x: baseline / x)
                label = f"{task}-{method}-Python{version}"
                ax3.plot(speedup.index, speedup.values, label=label, marker='o')

    ax3.set_title('Speedup Analysis')
    ax3.set_xlabel('Number of Workers')
    ax3.set_ylabel('Speedup Factor')
    ax3.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax3.grid(True)

    # 4. GIL Impact Analysis (Python Version Comparison)
    ax4 = plt.subplot(224)
    version_comparison = df.groupby(['method', 'python_version'])['duration'].mean().unstack()
    version_comparison.plot(kind='bar', ax=ax4)
    ax4.set_title('Average Duration by Python Version (GIL vs No GIL)')
    ax4.set_xlabel('Method')
    ax4.set_ylabel('Duration (seconds)')
    plt.xticks(rotation=45)

    plt.tight_layout()
    plt.savefig('performance_analysis.png', dpi=300, bbox_inches='tight')
    plt.close()

    # Print summary statistics
    print("\nSummary Statistics:")
    for task in df['task'].unique():
        print(f"\n{task.upper()} Tasks:")
        task_data = df[df['task'] == task]
        for method in task_data['method'].unique():
            for version in task_data['python_version'].unique():
                data = task_data[(task_data['method'] == method) & 
                                (task_data['python_version'] == version)]
                gil_status = "with GIL" if version == "3.12" else "no GIL"
                print(f"\n{method.capitalize()} (Python {version} {gil_status}):")
                print(f"Average duration: {data['duration'].mean():.3f} seconds")
                print(f"Standard deviation: {data['duration'].std():.3f} seconds")
                if task == 'cpu' and 'total_primes' in data:
                    print(f"Average primes found: {data['total_primes'].mean():.0f}")

if __name__ == "__main__":
    # Define worker configurations to test
    workers_list = [1, 2, 4, 8, 16]

    # Run experiments with Python 3.12 (with GIL)
    print("Running experiments with Python 3.12 (with GIL)...")
    results_3_12 = []
    for task_type in ["cpu", "io"]:
        for method in ["threading", "multiprocessing"] if task_type == "cpu" else ["threading", "asyncio"]:
            results_3_12.extend(run_experiment(task_type, method, workers_list, "3.12"))

    # Run experiments with Python 3.13 (no GIL)
    print("\nRunning experiments with Python 3.13 (no GIL)...")
    results_3_13 = []
    for task_type in ["cpu", "io"]:
        for method in ["threading", "multiprocessing"] if task_type == "cpu" else ["threading", "asyncio"]:
            results_3_13.extend(run_experiment(task_type, method, workers_list, "3.13"))

    # Combine all results
    all_results = results_3_12 + results_3_13

    # Create visualizations
    create_visualizations(all_results)
    print("\nVisualizations saved as 'performance_analysis.png'")

Fig- performance benchmark with accross python3.12(gil enabled) vs python3.13(gil disabled)

Performance Results: The Good, The Bad, and The Asynchronous

Here’s how our contenders fared in the performance tests.

  • For CPU-Bound Tasks (the prime numbers):

    • Multiprocessing was the undisputed king. It showed near-linear speedup as we added more processes, proving that for heavy computation, bypassing the GIL by using separate processes is the most effective strategy.

    • Threading (with GIL) was, as expected, terrible. The execution time remained flat regardless of how many threads we threw at it. The GIL ensured only one thread could compute at a time, rendering the extra cores useless.

    • Threading (no GIL) was the exciting one. It showed a significant performance boost over its GIL-enabled counterpart, scaling nicely with more threads. This is the poster child for the no-GIL promise: true parallelism for CPU-bound work within a single process.

  • For I/O-Bound Tasks (the waiting game):

    • Asyncio was the star performer. It handled waiting on multiple tasks with minimal overhead, making it the fastest and most efficient choice for I/O-heavy applications.

    • Threading (with or without GIL) also performed very well. Since the GIL is released during I/O operations anyway, both versions scaled effectively. This confirms that for I/O tasks, standard threading remains a simple and very viable option.

So far, so good. The no-GIL mode seems to deliver on its promise for CPU-bound work. But then we ran the corruption benchmark, and things got weird.

Corruption Benchmark (a.k.a. “Race Condition Rodeo”)

This test isn’t about speed — it’s about chaos. Here, multiple threads (or tasks, or processes, depending on your weapon of choice) increment a shared counter. It should be simple: increment a number N times. But in the absence of proper synchronization — and especially when the GIL is disabled — all bets are off. We’re trying to surface the hidden dragons: race conditions, memory corruption, and that subtle existential dread when your final count is… not quite right.

The goal is to demonstrate why the GIL was necessary — not because Python devs love locks, but because shared memory without guardrails is basically inviting entropy into your program.

import matplotlib.pyplot as plt
import subprocess
import time

def run_experiment(python_version, num_runs=5):
    """
    Run the data corruption experiment multiple times and collect results
    """
    results = []
    for _ in range(num_runs):
        if python_version == "3.13":
            cmd = ["python3.13", "data_corruption.py"]
        else:
            cmd = ["python3.12", "data_corruption.py"]

        output = subprocess.run(cmd, capture_output=True, text=True)
        lines = output.stdout.split('\n')

        # Parse the results
        actual_value = None
        expected_value = None
        execution_time = None

        for line in lines:
            if "Actual counter value:" in line:
                actual_value = int(line.split()[-1])
            elif "Expected counter value:" in line:
                expected_value = int(line.split()[-1])
            elif "Function main took" in line:
                execution_time = float(line.split()[-2])

        results.append({
            "python_version": python_version,
            "actual_value": actual_value,
            "expected_value": expected_value,
            "execution_time": execution_time,
            "lost_increments": expected_value - actual_value if expected_value and actual_value else None
        })

        # Small delay between runs
        time.sleep(0.5)

    return results

def create_visualizations(results_3_12, results_3_13):
    """
    Create various visualizations comparing Python 3.12 and 3.13 results
    """
    fig = plt.figure(figsize=(15, 10))

    # 1. Counter Values Comparison
    ax1 = plt.subplot(221)
    runs_3_12 = range(len(results_3_12))
    runs_3_13 = range(len(results_3_13))

    expected = results_3_12[0]['expected_value']
    ax1.plot(runs_3_12, [r['actual_value'] for r in results_3_12], 'g-', label='Python 3.12 (with GIL)', marker='o')
    ax1.plot(runs_3_13, [r['actual_value'] for r in results_3_13], 'r-', label='Python 3.13 (no GIL)', marker='o')
    ax1.axhline(y=expected, color='b', linestyle='--', label='Expected Value')

    ax1.set_title('Counter Values Across Runs')
    ax1.set_xlabel('Run Number')
    ax1.set_ylabel('Counter Value')
    ax1.legend()
    ax1.grid(True)

    # 2. Lost Increments Over Time
    ax2 = plt.subplot(222)
    lost_3_13 = [r['lost_increments'] for r in results_3_13]
    ax2.plot(range(len(lost_3_13)), lost_3_13, 'r-', marker='o')
    ax2.set_title('Lost Increments Over Time (Python 3.13 no GIL)')
    ax2.set_xlabel('Run Number')
    ax2.set_ylabel('Number of Lost Increments')
    ax2.grid(True)

    # 3. Execution Time Comparison
    ax3 = plt.subplot(223)
    times_3_12 = [r['execution_time'] for r in results_3_12]
    times_3_13 = [r['execution_time'] for r in results_3_13]

    ax3.bar(['Python 3.12\n(with GIL)', 'Python 3.13\n(no GIL)'],
            [sum(times_3_12)/len(times_3_12), sum(times_3_13)/len(times_3_13)],
            color=['green', 'red'])
    ax3.set_title('Average Execution Time')
    ax3.set_ylabel('Time (seconds)')
    ax3.grid(True)

    # 4. Consistency Percentage
    ax4 = plt.subplot(224)
    consistency_3_12 = sum(1 for r in results_3_12 if r['actual_value'] == r['expected_value']) / len(results_3_12) * 100
    consistency_3_13 = sum(1 for r in results_3_13 if r['actual_value'] == r['expected_value']) / len(results_3_13) * 100

    ax4.bar(['Python 3.12\n(with GIL)', 'Python 3.13\n(no GIL)'], 
            [consistency_3_12, consistency_3_13],
            color=['green', 'red'])
    ax4.set_title('Consistency Percentage')
    ax4.set_ylabel('Percentage of Consistent Results')
    ax4.set_ylim(0, 100)
    ax4.grid(True)

    plt.tight_layout()
    plt.savefig('corruption_analysis.png')
    plt.close()

    # Print summary statistics
    print("\nSummary Statistics:")
    print("\nPython 3.12 (with GIL):")
    print(f"Average execution time: {sum(times_3_12)/len(times_3_12):.3f} seconds")
    print(f"Consistency rate: {consistency_3_12:.1f}%")

    print("\nPython 3.13 (no GIL):")
    print(f"Average execution time: {sum(times_3_13)/len(times_3_13):.3f} seconds")
    print(f"Consistency rate: {consistency_3_13:.1f}%")
    print(f"Average lost increments: {sum(lost_3_13)/len(lost_3_13):.0f}")

if __name__ == "__main__":
    # Run experiments
    print("Running experiments with Python 3.12...")
    results_3_12 = run_experiment("3.12")

    print("Running experiments with Python 3.13...")
    results_3_13 = run_experiment("3.13")

    # Create visualizations
    create_visualizations(results_3_12, results_3_13)
    print("\nVisualizations saved as 'corruption_analysis.png'")

fig: bench-marking memory corruption

Corruption Results: A Slow-Motion Car Crash

This is where my assumptions were shattered. I expected the no-GIL threads to be fast but wrong. I was not prepared for them to be slow and wrong.

Let's dissect this beautiful disaster:

  1. Consistency: As predicted, Python 3.12 (with the GIL) was 100% correct every time. Python 3.13 (no GIL), without any manual locks, produced wildly incorrect results, losing tens of thousands of increments to race conditions. No surprises there.

  2. Execution Time: This was the shocker. The no-GIL version was significantly slower than the GIL version. It not only failed to produce the right answer, it took its sweet time doing it.

Why in the World Was It Slower?

This counter-intuitive result gets to the heart of concurrency. Removing the GIL isn't free. It's replaced by other, more granular mechanisms like atomic operations to keep Python's internals safe. And our specific test—multiple threads hammering the exact same variable—is the absolute worst-case scenario for this.

It created extreme memory contention.

Think of it this way: The GIL is like a strict librarian who only lets one person at a time into a room to get a book. It's slow, but it's orderly. The no-GIL mode fires the librarian and lets everyone rush in at once. But since they all want the same book, they just trample each other in a chaotic pile-up at the bookshelf. The CPU's memory system has to work overtime to manage this riot, and the end result is that everything takes longer than the orderly queue did.

Our test created a digital riot. The overhead of managing the chaos was greater than the benefit of parallelism.

The Grand, Unfulfilling Conclusion: So What Did We Learn?

After all this, the key takeaway is that there is no magic bullet. Anyone who tells you "always use async" or "the no-GIL mode fixes everything" is, frankly, full of it.

The reality requires you to think. Here's your cheat sheet:

  1. For CPU-Bound Work: Multiprocessing is still your safest bet for raw, scalable power. Use the no-GIL mode only when your threads can work on different chunks of data with low contention. Benchmark it first, or you might end up in a slow-motion riot like I did.

  2. For I/O-Bound Work: Asyncio is the modern, highly efficient champion. But standard threading (with the GIL) remains a perfectly simple and effective choice for less complex I/O tasks.

The no-GIL mode is not a "go faster" button. It's a powerful, expert-level tool that unlocks true parallelism, but it demands that you understand and manage the perils of concurrent memory access. For most, the GIL isn't a prison; it's a guardrail that keeps you from driving off a cliff. Sometimes, that's exactly what you need.


0
Subscribe to my newsletter

Read articles from buddha gautam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

buddha gautam
buddha gautam

Python, Django, DevOps(can use ec2 and docker lol).