Brian2: How We Made Spike Processing Faster by Eliminating Python Overhead


Recap: Cython, Code Generation, and the SpikeQueue Bottleneck
Last time, we dug into how code generation works in Brian2, and we ended by looking at the spikeQueue
case — a great example of where Cython introduces significant overhead and complexity.
The Problem: When Fast C++ Gets Slow Through Python
Picture this: you’ve built a race car with a Ferrari-grade engine (C++), but to start the car, the driver has to get out, walk to a control room, fill out some paperwork, get it approved, walk back, and then press the gas. Not the most efficient race strategy
( Ok that is a bad analogy but still you get the gist right that’s essentially what was happening in Brian2’s spike processing pipeline. )
Brian2 is a powerful neural simulator that lets researchers model millions of neurons using simple Python scripts. But here's the catch: Python is notoriously slow for the heavy computational loops that neural simulations demand. The Brian2 developers solved this by writing performance-critical parts in C++ and using Cython to bridge between Python and C++. Clever, right?
Well, not quite. While the engine (C++ code) was lightning fast, the communication system had become a performance bottleneck that could take 2-3 minutes just to compile code for each simulation.
Meet the SpikeQueue: The Neural Communication Infrastructure
Before diving into the optimization, let's understand what a SpikeQueue actually does. Think of it as a sophisticated postal service for the brain ( human brain ).
What is a SpikeQueue?
In neural networks, when a neuron "fires" (generates a spike), it needs to send signals to other neurons it's connected to. But here's the thing: these signals don't arrive instantly. Just like mail takes time to deliver, neural signals have delays.
The SpikeQueue is Brian2's system for managing these delayed deliveries:
Neuron A fires at time T=5ms
↓
SpikeQueue receives spike
↓
Spike scheduled to arrive at Neuron B at T=7ms (2ms delay)
↓
At T=7ms, SpikeQueue delivers spike to Neuron B
Visualizing the SpikeQueue Structure
Think of the SpikeQueue Data structure as a circular conveyor belt system with time slots:
Time → [Now] [T+1] [T+2] [T+3] [T+4] ...
┌─────┬─────┬─────┬─────┬─────┐
│ [] │ [2] │[1,3]│ [4] │ [] │ ← Each slot contains synapse IDs
└─────┴─────┴─────┴─────┴─────┘
↑
Current position
How it works:
Push: When neurons fire, their spikes get placed in future time slots based on their delays
Peek: Check what spikes are ready to process RIGHT NOW
Advance: Move to the next time step and clear the current slot
As always here’s a neat visual :)
The Performance Nightmare: A Journey Through Abstraction Layers
Let's trace what happened when a spike needed to be processed in the old system:
The Inefficient Journey (Before Optimization)
┌─────────────────────────────────────────────────────────────┐
│ The Long Journey Home │
│ │
│ Generated → Python → Cython → Python → Cython → C++ │
│ Template Method Call Method Call │
│ ↓ ↓ ↓ ↓ ↓ │
│ ~20ns ~50ns ~100ns ~150ns ~160ns │
│ │
│ Total time per spike: ~600ns (with all the overhead!) │
└─────────────────────────────────────────────────────────────┘
Note timings are just to explain and not realistic benchmarks
Step 1: Generated Template Code (Bad)
# This is what Brian2 was generating:
{% block maincode %}
owner.push_spikes() # 😱 Python method call from generated code!
{% endblock %}
Step 2: Python Method (More Overhead)
def push_spikes(self):
events = self.eventspace[: self.eventspace[len(self.eventspace) - 1]]
if len(events):
self.queue.push(events) # 😱 Another Python method call!
Step 3: Cython Wrapper (Even More Overhead)
def push(self, np.ndarray[int32_t, ndim=1, mode='c'] spikes):
# 😱 Python array conversion and bounds checking
self.thisptr.push(<int32_t*>spikes.data, spikes.shape[0])
Step 4: Finally, C++ (The Actual Work)
void CSpikeQueue::push(int32_t *spikes, int nspikes) {
// 😮💨 This is where the real work happens - but we took forever to get here!
}
The Breakthrough: Direct C++ Access with PyCapsules
Our solution was elegant: eliminate all the middle layers and go directly to C++.
The Magic of PyCapsules
PyCapsules are Python's way of safely passing raw C++ pointers between different parts of a program. Think of them as "sealed envelopes" containing the address of C++ objects.
def get_capsule(self):
"""
Returns a sealed envelope containing the address of our C++ SpikeQueue
No Python overhead, no conversions - just a direct pointer!
"""
return PyCapsule_New(<void*>self.thisptr, "CSpikeQueue", NULL)
The New Optimized Journey
┌─────────────────────────────────────────────────────────────┐
│ The Express Route │
│ │
│ Generated → Extract → Direct C++ → Profit! │
│ Template Capsule Method Call │
│ ↓ ↓ ↓ │
│ ~20ns ~10ns ~2ns │
│ │
│ Total time per spike: ~50ns (12x faster!) │
└─────────────────────────────────────────────────────────────┘
Note timings are just to explain and not realistic benchmarks
Optimized Template Code
Here's what our new generated templates look like:
For Pushing Spikes:
{% block maincode %}
# Extract the C++ object directly from the capsule
cdef object capsule = queuecapsule
cdef CSpikeQueue* cpp_queue = <CSpikeQueue*>PyCapsule_GetPointer(capsule, "CSpikeQueue")
# Get spike count directly from the buffer
cdef int spike_count = {{eventspace}}[_num{{eventspace}}-1]
if spike_count > 0:
# Direct C++ call - no Python overhead!
cpp_queue.push({{eventspace}}, spike_count)
{% endblock %}
For Processing Spikes:
{% block maincode %}
# Direct C++ access
cdef CSpikeQueue* cpp_queue = <CSpikeQueue*>PyCapsule_GetPointer(capsule, "CSpikeQueue")
# Get spikes that are ready NOW
cdef vector[int32_t]* spike_vector = cpp_queue.peek()
cdef size_t num_spikes = dereference(spike_vector).size()
if num_spikes == 0:
cpp_queue.advance() # Nothing to do, move to next timestep
return
# Direct memory access to spike data
cdef int32_t* spike_data = &dereference(spike_vector)[0]
# Process each spike with pure C++ speed
for i in range(num_spikes):
synapse_id = spike_data[i] # No bounds checking, no Python objects!
# ... process the synapse ...
cpp_queue.advance() # Move to next timestep
{% endblock %}
Real-World Performance: The Numbers Don't Lie
Let's see the optimization in action with a concrete example:
Scenario: 3 Synapses Ready to Fire
Synapse IDs: [42, 17, 99]
Memory location: 0x7fff8c002000
Data layout: Three 32-bit integers in C++ memory
Old Approach Execution Timeline
Time What's Happening
----- ----------------
0ns Call Python method owner.push_spikes()
50ns Python method dispatch overhead
100ns Create NumPy array wrapper around C++ data
150ns Python bounds checking for array access
200ns Finally access spike_data[0] = 42
250ns More bounds checking...
300ns Access spike_data[1] = 17
350ns Even more bounds checking...
400ns Access spike_data[2] = 99
450ns Python cleanup and return
Total: ~500ns for 3 spikes
New Approach Execution Timeline
Time What's Happening
----- ----------------
0ns Extract C++ pointer from capsule
10ns Direct C++ method call cpp_queue.peek()
15ns Get pointer to vector data
20ns Access spike_data[0] = 42 (direct memory access)
25ns Access spike_data[1] = 17 (direct memory access)
30ns Access spike_data[2] = 99 (direct memory access)
35ns Call cpp_queue.advance()
Total: ~40ns for 3 spikes
The Beautiful Thing: Same Memory, Zero Copies
The most elegant part of this optimization is that both approaches access the exact same memory location. The only difference is the path we take to get there:
Memory Layout (Identical in Both Cases):
┌─────────────────────────────────────┐
│ Address: 0x7fff8c002000 │
│ Content: [42, 17, 99, ?, ?, ...] │ ← Same data!
│ Type: int32_t array │
└─────────────────────────────────────┘
Old Path: Template → Python → NumPy → Bounds Check → Memory
New Path: Template → Capsule → Direct Pointer → Memory
No data copying, no conversions, no allocations. Just a much shorter route to the same destination.
Implementation Details: The Three Critical Methods
Our optimization focused on the three core SpikeQueue operations:
1. push() - Adding Spikes to Future Time Slots
// When neuron fires, schedule its synapses for future delivery
cpp_queue.push(spike_array, spike_count);
2. peek() - Getting Spikes Ready Now
// What synapses should fire right now?
vector<int32_t>* ready_spikes = cpp_queue.peek();
3. advance() - Moving to Next Timestep
// Done processing this timestep, move forward
cpp_queue.advance();
Each of these now bypasses all Python overhead and runs at native C++ speeds.
The Bigger Picture: Why This Matters
This optimization solves a fundamental problem in high-performance Python applications:
“ the cost of abstraction layers “
While Cython is excellent for bridging Python and C++, it can become a bottleneck when:
You have very frequent function calls (millions per second)
The actual work is simple (just moving pointers around)
You're calling the same C++ methods repeatedly
Our solution demonstrates that sometimes the best optimization is elimination - removing layers rather than making them faster :)
What's Next: The Dynamic Arrays Challenge
SpikeQueue was just the beginning. Our next target is Dynamic Arrays - another critical Brian2 component that suffers from similar Python-mediated access patterns.
Read more in this series:
Series Starter : My journey in exploring and understanding Brian2 Codebase ...
Part 1: Understanding Brian2's Code Generation Architecture
Part 2: Fixing SpikeQueue Implementation (You are here)
Part 3: Dynamic Arrays Memory Access Patterns (Coming next!)
Subscribe to my newsletter
Read articles from Mrigesh Thakur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
