Part 3 of the Brian2 Performance Optimization Series

In our previous episode, we conquered the SpikeQueue bottleneck by eliminating Python overhead with PyCapsules.
Today, we're tackling an even bigger challenge: Dynamic Arrays — the data structures that store everything from neuron voltages to synaptic weights in Brian2 simulations.

The Problem: When Arrays Grow, Performance Dies

Picture this: you're running a brain simulation with millions of synapses that can form and dissolve during learning. Every time a new connection forms, Brian2 needs to:

Resize arrays to accommodate new data
Access array elements during computation
Monitor changing values for analysis

Sounds simple, right? Well, in the old Brian2 system, this "simple" process looked more like this:

For arrays that resize frequently and get accessed millions of times per second, this overhead was killing performance.

Meet the Culprit: Python's Dynamic Arrays

Brian2's original dynamic arrays were implemented in pure Python, wrapped with Cython for supposed "speed." Here's what the old system looked like:

class DynamicArray:
    """The old way - pure Python with all its overhead"""

    def __init__(self, shape, dtype=float):
        self.data = np.zeros(shape, dtype=dtype)
        self.shape = shape

    def resize(self, new_shape):
        # This creates a NEW NumPy array every time!
        old_data = self.data
        self.data = np.zeros(new_shape, dtype=old_data.dtype)

        # Copy old data (expensive!)
        copy_shape = tuple(min(old, new) for old, new in zip(old_data.shape, new_shape))
        slices = tuple(slice(0, s) for s in copy_shape)
        self.data[slices] = old_data[slices]

    def __getitem__(self, key):
        return self.data[key]  # Python method call overhead!

The Abstraction Hell

Let's trace what happened when Brian2 needed to access a single array element in the old system:

So Here’s what we did to solve it :

Our solution was simple: eliminate Python entirely for dynamic array operations …

The New Architecture

Instead of Python objects wrapping NumPy arrays, we built native C++ dynamic arrays that provide zero-copy NumPy views:

template <class T>
class DynamicArray1D {
private:
    std::vector<T> m_data;        // Raw C++ storage
    size_t m_size;                // Logical size (what user sees)
    size_t m_growth_factor;       // Smart over-allocation

public:
    // Direct C++ access - no Python overhead!
    T* get_data_ptr() { return m_data.data(); }

    void resize(size_t new_size) {
        if (new_size > m_data.size()) {
            // Smart growth: allocate more than needed
            size_t grown = static_cast<size_t>(m_data.size() * m_growth_factor) + 1;
            size_t new_capacity = std::max(new_size, grown);
            m_data.resize(new_capacity);
        }
        m_size = new_size;
    }

    T& operator[](size_t idx) { return m_data[idx]; }  // Pure C++ speed!
};

Side Note: What's a Zero-Copy NumPy View? 💡

Ever wonder how NumPy can slice and reshape massive datasets in a flash? One of its secret weapons is the zero-copy view.

In a nutshell, a zero-copy view is a new NumPy array object that looks at the exact same block of memory as the original array. Think of your original data as a large warehouse full of items. A zero-copy view is like giving someone a blueprint and a set of instructions (like "only look at the second aisle" or "treat these boxes as pairs") without actually moving or duplicating any items in the warehouse.

Since you're not actually copying any data—hence, "zero-copy"—these operations are incredibly fast and memory-efficient.

The Magic: Zero-Copy NumPy Integration

The most elegant part of our solution is how we bridge C++ and Python without copying data:

@property
def data(self):
    """
    Create a zero-copy NumPy view of our C++ data.
    This is not a copy - it's a direct window into C++ memory!
    """
    cdef size_t size = self.get_size()
    cdef void* data_ptr = self.get_data_ptr()  # Direct C++ pointer

    cdef cnp.npy_intp shape[1]
    shape[0] = size

    # Create NumPy array that points to our C++ memory
    return cnp.PyArray_SimpleNewFromData(1, shape, self.numpy_type, data_ptr)

The Memory Layout

At the heart of high-performance computing is the principle of doing less. Why copy data when you can share it? This is the core idea behind creating zero-copy NumPy views directly from C++ data structures.

Traditionally, getting data from a custom C++ object into Python might involve multiple slow, memory-hungry steps. But by being clever about our memory layout, we can create a direct highway between them.

Both the C++ array and the NumPy view can point to the exact same block of memory.

2D Arrays: Row-Major Magic with Strides

For 2D arrays, things get more interesting. To achieve this zero-copy miracle, we need to handle the memory layout carefully. The secret lies in combining row-major storage with strides.

Storing a 2D Grid in a 1D Line

First, we must store our 2D grid in a single, continuous block of memory. Instead of a vector of vector's, which can scatter data all over memory, we use a single flat std::vector. This is called row-major order: you lay out the first row, then the second right after it, and so on.

Our C++ DynamicArray2D class is designed precisely for this:

template <class T>
class DynamicArray2D {
private:
    std::vector<T> m_buffer;      // Flat storage (row-major)
    size_t m_rows, m_cols;        // Logical dimensions
    size_t m_buffer_rows, m_buffer_cols;  // Physical capacity

    inline size_t index(size_t i, size_t j) const {
        // Calculate position in the flat buffer
        return i * m_buffer_cols + j;
    }
public:
    T& operator()(size_t i, size_t j) {
         return m_buffer[index(i, j)];
    }
};

This guarantees our data is in one contiguous chunk … but what are strides ?

Strides: The GPS for navigating DynamicArray

Now, how does NumPy navigate this 1D block as a 2D grid without copying it? It uses strides.

A stride tells NumPy how many bytes to jump in memory to get from one element to the next along a specific axis. For a 2D array, we need two strides:

Column Stride: How many bytes to jump to get to the next column in the same row.
Row Stride: How many bytes to jump to get to the same column in the next row.

@property
def data(self):
    cdef size_t rows = self.get_rows()
    cdef size_t cols = self.get_cols()
    cdef size_t stride = self.get_stride()  # Physical row width
    cdef size_t itemsize = self.dtype.itemsize

    # NumPy stride map: how many bytes to jump for each dimension
    cdef cnp.npy_intp strides[2]
    strides[1] = itemsize              # Next column: 1 element
    strides[0] = stride * itemsize     # Next row: full physical row

    # Create the zero-copy view
    return PyArray_NewFromDescr(...)

Template Changes: Direct C++ Access

Our new generated templates bypass all Python indirection we earlier had :

Before: Python-Mediated Access

# OLD WAY - Multiple layers of overhead
{% block maincode %}
cdef size_t _current_len = {{dynamic_t}}.shape[0]  # Python call!
{{dynamic_t}}.resize(_current_len + 1)             # Python call!
cdef numpy.ndarray arr = {{dynamic_t}}.data       # Python call!
arr[_current_len] = new_value                      # Bounds checking!
{% endblock %}

After: Direct C++ Speed

# NEW WAY - Pure C++ performance
{% block maincode %}
{% set t_array = get_array_name(variables['t'], access_data=False) %}
cdef size_t _current_len = {{t_array}}_ptr.size()      # Direct C++!
{{t_array}}_ptr.resize(_current_len + 1)                # Direct C++!
cdef double* _data = {{t_array}}_ptr.get_data_ptr()     # Direct C++!
_data[_current_len] = new_value                         # Raw memory access!
{% endblock %}

Final Touch: Smarter, Not Harder, with Smart Growth

One last optimization: std::vector is good, but we can be smarter. When a std::vector resizes, it might not allocate enough extra memory, leading to frequent, costly reallocations. We implemented a custom growth factor to solve this.

void resize(size_t new_size) {
    if (new_size > m_data.capacity()) {
        // Over-allocate to prevent frequent resizing!
        size_t grown = static_cast<size_t>(m_data.capacity() * m_growth_factor) + 1;
        size_t new_capacity = std::max(new_size, grown);
        m_data.reserve(new_capacity);
    }
    m_size = new_size;
}

By over-allocating with a factor like 1.5, we strike a perfect balance between memory use and the speed of not having to reallocate constantly. It's a strategy used in production-grade libraries for a reason.

Fun fact: Facebook's high-performance folly::fbvector also defaults to a growth factor of 1.5 for this very reason!

A nice read here : https://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/20481237#20481237

By moving our core data structures to C++ and using zero-copy views, we eliminated layers of overhead and unlocked the true performance potential of the underlying hardware.

Stay tuned for our next episode, where we'll start to explore cppyy and how we plan on making Brian2’s runtime mode truly run dynamically …

Dynamic Arrays Memory Access Patterns: From Python Indirection to Direct C++ Speed