Dynamic Arrays Memory Access Patterns: From Python Indirection to Direct C++ Speed


Part 3 of the Brian2 Performance Optimization Series
In our previous episode, we conquered the SpikeQueue
bottleneck by eliminating Python overhead with PyCapsules.
Today, we're tackling an even bigger challenge: Dynamic Arrays — the data structures that store everything from neuron voltages to synaptic weights in Brian2 simulations.
The Problem: When Arrays Grow, Performance Dies
Picture this: you're running a brain simulation with millions of synapses that can form and dissolve during learning. Every time a new connection forms, Brian2 needs to:
Resize arrays to accommodate new data
Access array elements during computation
Monitor changing values for analysis
Sounds simple, right? Well, in the old Brian2 system, this "simple" process looked more like this:
For arrays that resize frequently and get accessed millions of times per second, this overhead was killing performance.
Meet the Culprit: Python's Dynamic Arrays
Brian2's original dynamic arrays were implemented in pure Python, wrapped with Cython for supposed "speed." Here's what the old system looked like:
class DynamicArray:
"""The old way - pure Python with all its overhead"""
def __init__(self, shape, dtype=float):
self.data = np.zeros(shape, dtype=dtype)
self.shape = shape
def resize(self, new_shape):
# This creates a NEW NumPy array every time!
old_data = self.data
self.data = np.zeros(new_shape, dtype=old_data.dtype)
# Copy old data (expensive!)
copy_shape = tuple(min(old, new) for old, new in zip(old_data.shape, new_shape))
slices = tuple(slice(0, s) for s in copy_shape)
self.data[slices] = old_data[slices]
def __getitem__(self, key):
return self.data[key] # Python method call overhead!
The Abstraction Hell
Let's trace what happened when Brian2 needed to access a single array element in the old system:
So Here’s what we did to solve it :
Our solution was simple: eliminate Python entirely for dynamic array operations …
The New Architecture
Instead of Python objects wrapping NumPy arrays, we built native C++ dynamic arrays that provide zero-copy NumPy views:
template <class T>
class DynamicArray1D {
private:
std::vector<T> m_data; // Raw C++ storage
size_t m_size; // Logical size (what user sees)
size_t m_growth_factor; // Smart over-allocation
public:
// Direct C++ access - no Python overhead!
T* get_data_ptr() { return m_data.data(); }
void resize(size_t new_size) {
if (new_size > m_data.size()) {
// Smart growth: allocate more than needed
size_t grown = static_cast<size_t>(m_data.size() * m_growth_factor) + 1;
size_t new_capacity = std::max(new_size, grown);
m_data.resize(new_capacity);
}
m_size = new_size;
}
T& operator[](size_t idx) { return m_data[idx]; } // Pure C++ speed!
};
Side Note: What's a Zero-Copy NumPy View? 💡
Ever wonder how NumPy can slice and reshape massive datasets in a flash? One of its secret weapons is the zero-copy view.
In a nutshell, a zero-copy view is a new NumPy array object that looks at the exact same block of memory as the original array. Think of your original data as a large warehouse full of items. A zero-copy view is like giving someone a blueprint and a set of instructions (like "only look at the second aisle" or "treat these boxes as pairs") without actually moving or duplicating any items in the warehouse.
Since you're not actually copying any data—hence, "zero-copy"—these operations are incredibly fast and memory-efficient.
The Magic: Zero-Copy NumPy Integration
The most elegant part of our solution is how we bridge C++ and Python without copying data:
@property
def data(self):
"""
Create a zero-copy NumPy view of our C++ data.
This is not a copy - it's a direct window into C++ memory!
"""
cdef size_t size = self.get_size()
cdef void* data_ptr = self.get_data_ptr() # Direct C++ pointer
cdef cnp.npy_intp shape[1]
shape[0] = size
# Create NumPy array that points to our C++ memory
return cnp.PyArray_SimpleNewFromData(1, shape, self.numpy_type, data_ptr)
The Memory Layout
At the heart of high-performance computing is the principle of doing less. Why copy data when you can share it? This is the core idea behind creating zero-copy NumPy views directly from C++ data structures.
Traditionally, getting data from a custom C++ object into Python might involve multiple slow, memory-hungry steps. But by being clever about our memory layout, we can create a direct highway between them.
Both the C++ array and the NumPy view can point to the exact same block of memory.
2D Arrays: Row-Major Magic with Strides
For 2D arrays, things get more interesting. To achieve this zero-copy miracle, we need to handle the memory layout carefully. The secret lies in combining row-major storage with strides.
Storing a 2D Grid in a 1D Line
First, we must store our 2D grid in a single, continuous block of memory. Instead of a vector
of vector
's, which can scatter data all over memory, we use a single flat std::vector
. This is called row-major order: you lay out the first row, then the second right after it, and so on.
Our C++ DynamicArray2D
class is designed precisely for this:
template <class T>
class DynamicArray2D {
private:
std::vector<T> m_buffer; // Flat storage (row-major)
size_t m_rows, m_cols; // Logical dimensions
size_t m_buffer_rows, m_buffer_cols; // Physical capacity
inline size_t index(size_t i, size_t j) const {
// Calculate position in the flat buffer
return i * m_buffer_cols + j;
}
public:
T& operator()(size_t i, size_t j) {
return m_buffer[index(i, j)];
}
};
This guarantees our data is in one contiguous chunk … but what are strides ?
Strides: The GPS for navigating DynamicArray
Now, how does NumPy navigate this 1D block as a 2D grid without copying it? It uses strides.
A stride tells NumPy how many bytes to jump in memory to get from one element to the next along a specific axis. For a 2D array, we need two strides:
Column Stride: How many bytes to jump to get to the next column in the same row.
Row Stride: How many bytes to jump to get to the same column in the next row.
@property
def data(self):
cdef size_t rows = self.get_rows()
cdef size_t cols = self.get_cols()
cdef size_t stride = self.get_stride() # Physical row width
cdef size_t itemsize = self.dtype.itemsize
# NumPy stride map: how many bytes to jump for each dimension
cdef cnp.npy_intp strides[2]
strides[1] = itemsize # Next column: 1 element
strides[0] = stride * itemsize # Next row: full physical row
# Create the zero-copy view
return PyArray_NewFromDescr(...)
Template Changes: Direct C++ Access
Our new generated templates bypass all Python indirection we earlier had :
Before: Python-Mediated Access
# OLD WAY - Multiple layers of overhead
{% block maincode %}
cdef size_t _current_len = {{dynamic_t}}.shape[0] # Python call!
{{dynamic_t}}.resize(_current_len + 1) # Python call!
cdef numpy.ndarray arr = {{dynamic_t}}.data # Python call!
arr[_current_len] = new_value # Bounds checking!
{% endblock %}
After: Direct C++ Speed
# NEW WAY - Pure C++ performance
{% block maincode %}
{% set t_array = get_array_name(variables['t'], access_data=False) %}
cdef size_t _current_len = {{t_array}}_ptr.size() # Direct C++!
{{t_array}}_ptr.resize(_current_len + 1) # Direct C++!
cdef double* _data = {{t_array}}_ptr.get_data_ptr() # Direct C++!
_data[_current_len] = new_value # Raw memory access!
{% endblock %}
Final Touch: Smarter, Not Harder, with Smart Growth
One last optimization: std::vector
is good, but we can be smarter. When a std::vector
resizes, it might not allocate enough extra memory, leading to frequent, costly reallocations. We implemented a custom growth factor to solve this.
void resize(size_t new_size) {
if (new_size > m_data.capacity()) {
// Over-allocate to prevent frequent resizing!
size_t grown = static_cast<size_t>(m_data.capacity() * m_growth_factor) + 1;
size_t new_capacity = std::max(new_size, grown);
m_data.reserve(new_capacity);
}
m_size = new_size;
}
By over-allocating with a factor like 1.5
, we strike a perfect balance between memory use and the speed of not having to reallocate constantly. It's a strategy used in production-grade libraries for a reason.
Fun fact: Facebook's high-performance
folly::fbvector
also defaults to a growth factor of1.5
for this very reason!A nice read here : https://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/20481237#20481237
By moving our core data structures to C++ and using zero-copy views, we eliminated layers of overhead and unlocked the true performance potential of the underlying hardware.
Stay tuned for our next episode, where we'll start to explore cppyy and how we plan on making Brian2’s runtime mode truly run dynamically …
Subscribe to my newsletter
Read articles from Mrigesh Thakur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
