Dynamic Lists: Performance Costs Explained

Introduction

Ever wondered why you are always told that numpy performs better than python lists? Well, it circles back to dynamic + reference based arrays vs static arrays.

In this article, we will understand it by exploring how Python's implementation of dynamic lists affects cache performance and why NumPy arrays can be dramatically faster for numerical operations.

Glossary

L1 Cache (D1): Smallest, fastest cache directly connected to the CPU
Last-Level Cache (LL): Larger but slower cache before accessing main memory
Cache Lines: Data is loaded in fixed-size blocks (typically 64 bytes), not individual values
Spatial Locality: The principle that after accessing one memory location, programs are likely to access nearby locations soon

Python Lists: Under the Hood

Python's lists are implemented as dynamic arrays of references. This means that when you create a list like [1, 2, 3], what you're actually getting is an array of pointers, each pointing to a separate Python object stored somewhere else in memory:

List in memory:
[ptr] → PyObject(1)
[ptr] → PyObject(2)
[ptr] → PyObject(3)

Each PyObject carries type information, reference counts, and the actual value, requiring 16+ bytes per integer on 64-bit systems. When the list grows beyond its capacity, Python allocates a new, larger array (typically 1.125x the size until 9 elements, then 2x thereafter), copies over the pointers, and frees the old array.

NumPy Arrays: The Contiguous Approach

In contrast, NumPy arrays store values directly in a contiguous memory block:

NumPy array in memory:
[1][2][3][4][5]...

This approach sacrifices Python's flexibility (mixing types) for dramatic performance improvements with homogeneous data.

Two Approaches Compared

Let's implement Python-like and NumPy-like structures in C++ to demonstrate the difference.

Python Lists (Dynamic Arrays)

// Python-like list implementation (simplified)

class PythonList {
private:
    T* array;
    int capacity;
    int size;

    void resize() {
        capacity *= 2;
        T* newArray = new T[capacity];
        for (int i = 0; i < size; i++) {
            newArray[i] = array[i];
        }
        delete[] array;
        array = newArray;
    }

public:
    PythonList() : capacity(1), size(0) {
        array = new T[capacity];
    }

    ~PythonList() {
        delete[] array;
    }

    void append(T value) {
        if (size == capacity) {
            resize();
        }
        array[size++] = value;
    }

    T get(int index) const {
        if (index < 0 || index >= size) {
            throw std::out_of_range("Index out of range");
        }
        return array[index];
    }

    int getSize() const {
        return size;
    }

    int getCapacity() const {
        return capacity;
    }
};

Dynamic lists resize when the initial defined size, and then subsequently, double their size. In simple words, [capacity reached → create new array double the size → copy elements → repeat]

Extras!

When a Python list runs out of capacity, it needs to resize. This resizing operation requires:
1. Allocating a new, larger array (typically 2× the previous size)
2. Copying all existing elements to the new array
3. Deallocating the old array

This resize operation is expensive—O(n) where n is the number of elements. However, it occurs infrequently. If we've performed n append operations, resizing has only happened approximately log₂(n) times. The total cost after n operations is approximately: O(n) + O(n·log₂(n)) = O(n)

When divided by n operations, the amortized cost per operation becomes: O(n)/n = O(1)

Numpy (Similar to Primitive Arrays)

This approach sacrifices flexibility for performance, as the size and the data type cannot be changed.


// NumPy-like array implementation (simplified)
template <typename T>
class NumpyArray {
private:
    T* array;
    int size;

public:
    NumpyArray(int n) : size(n) {
        array = new T[size];
    }

    ~NumpyArray() {
        delete[] array;
    }

    void set(int index, T value) {
        if (index < 0 || index >= size) {
            throw std::out_of_range("Index out of range");
        }
        array[index] = value;
    }

    T get(int index) const {
        if (index < 0 || index >= size) {
            throw std::out_of_range("Index out of range");
        }
        return array[index];
    }

    int getSize() const {
        return size;
    }
};

Both serve similar purposes but have fundamentally different memory layouts.

Analyzing Cache Performance

Let's examine what happens when we iterate through both structures to calculate a sum:

// PythonList iteration
long long sum = 0;
for (int i = 0; i < pyList.getSize(); i++) {
    sum += pyList.get(i);  // Each access may cause a cache miss
}

// NumpyArray iteration
long long sum = 0;
for (int i = 0; i < npArray.getSize(); i++) {
    sum += npArray.get(i);  // Sequential access has better locality
}

Cache Performance Results

Using Valgrind's cachegrind tool, we measured cache misses for both implementations across different sizes and performance:

Numpy’s execution time is less than list as the dataset size increases.
D1 and LLd cache misses are more for python’s dynamic list implementation

The Spatial Locality Advantage

NumPy outperforms Python lists because:

Contiguous Storage: When a cache line is loaded for array[i], it also contains array[i+1], array[i+2], etc., which will be used immediately afterward.

Fewer Memory Indirections: Python lists require a pointer dereference for each access, potentially bringing in a cache line that contains only one useful value.

Better Prefetching: Modern CPUs can predict sequential access patterns and prefetch data, but struggle with the scattered memory access of pointer chasing.

Cache Misses Explained

When iterating through a Python list, each element access follows this pattern:

Load pointer from list array (potential cache miss #1)
Follow pointer to actual object (potential cache miss #2)
Extract value from object (potential cache miss #3)
Move to next element (which may be in a completely different memory area)

For NumPy arrays, the access pattern is:

Calculate position (base address + i * element_size)
Access value directly (adjacent to previous value)
Move to next element (likely in the same cache line)

With Python lists generating significantly more cache misses, the performance impact is enormous.

Conclusion

The post is not about how lists are internally implemented in python, but rather it is an attempt to spark your curiosity to understand how things work understand the hood, and understand the trade-offs. For the ease of use, and to provide flexibility to have different types of elements in the list, python has traded it off with performance. Hope you found this insightful!

The performance cost of dynamic lists