I thought I understood the compilation problem in Brian2 until I realized that every "obvious" solution - manual C extensions, direct g++ calls, even fancy build systems - all fell into the same fundamental trap. They were trying to solve a dynamic runtime problem with static ahead-of-time thinking. Here's why that approach is doomed and how cppyy breaks free from this paradigm entirely.

The Fundamental Misunderstanding

When I first identified Brian2's compilation bottleneck, my instinct was to look for alternatives to Cython. "Surely," I thought, "we can just write C++ directly and skip the Cython overhead." This led me down a path of investigating manual C extensions, direct compiler invocation, and various build system optimizations.

Every single one of these approaches missed the fundamental nature of the problem.

The issue isn't that Cython is slow. The issue isn't that we're using the wrong compiler flags. The issue isn't even that we're writing too much generated code.

The issue is that we're doing Ahead-of-Time compilation at runtime.

Understanding the Compilation Paradigms

Let me clarify the two fundamental approaches to compilation:

Ahead-of-Time (AOT) Compilation: Write source code, compile it into machine code, then execute the machine code later. This is the traditional model used by languages like C, C++, and Rust.

Just-in-Time (JIT) Compilation: Generate or receive source code during program execution, compile it immediately in memory, and execute it without intermediate steps. This is used by languages like Java (JVM), C# (.NET), and modern JavaScript engines.

Brian2's problem is unique: it needs to generate and compile C++ code dynamically based on user-defined neuron equations. This is fundamentally a JIT problem, but every traditional solution approaches it with AOT thinking.

The Manual C Extension Mirage: Why Our "Obvious" Solution Was Still Wrong

When we first identified the Cython bottleneck in Brian2, our team's initial reaction was logical: "We already have working C++ templates for the standalone device. Why not just adapt those to generate manual C extensions instead of going through Cython?" This seemed like the perfect solution - until we realized we were still trapped in the same AOT paradigm.

The Seemingly Brilliant Plan

Brian2 already has a sophisticated C++ standalone device that generates clean, efficient C++ code. Our templates produce code like this:

// From Brian2's existing C++ standalone templates
void _run_neurongroup_stateupdater() {
    using namespace brian;

    const int N = 10000;
    double * const v = _array_neurongroup_v;
    double * const I = _array_neurongroup_I;

    for(int _neuron_idx=0; _neuron_idx<N; _neuron_idx++) {
        const double _v = v[_neuron_idx];
        const double _I = I[_neuron_idx];

        // Integration step
        v[_neuron_idx] = _v + (-_v + _I) * dt / tau;
    }
}

This code is clean, fast, and exactly what we want. The problem was the compilation pipeline: generate this code, write it to files, call g++, link everything together.

Our "brilliant" insight was: why not adapt these same templates to generate Python C extension code instead of standalone C++? We already had the hard part figured out - generating efficient computational kernels. We just needed to wrap them in Python C API boilerplate.

The Manual C Extension Strategy

Our plan was straightforward: modify our existing Jinja2 templates to generate C extension modules instead of Cython code. Instead of this Cython template:

# Current Brian2 Cython template (simplified)
{% macro main_code() %}
def run_neurongroup_stateupdater(
    np.ndarray[double, ndim=1, mode='c'] _v,
    np.ndarray[double, ndim=1, mode='c'] _I,
    const int N,
    const double dt):

    cdef int _neuron_idx
    cdef double v, I

    for _neuron_idx in range(N):
        v = _v[_neuron_idx]
        I = _I[_neuron_idx]

        {{code_lines|autoindent}}

        _v[_neuron_idx] = v
{% endmacro %}

We would generate this C extension template:

// Our planned manual C extension template
{% macro main_code() %}
#include <Python.h>
#include <numpy/arrayobject.h>

static PyObject* run_neurongroup_stateupdater(PyObject* self, PyObject* args) {
    PyArrayObject *v_array, *I_array;
    int N;
    double dt;

    // Parse arguments
    if (!PyArg_ParseTuple(args, "OOid", &v_array, &I_array, &N, &dt)) {
        return NULL;
    }

    // Extract data pointers  
    double *v = (double*)PyArray_DATA(v_array);
    double *I = (double*)PyArray_DATA(I_array);

    // Main computation - reuse our existing C++ kernel!
    for(int _neuron_idx = 0; _neuron_idx < N; _neuron_idx++) {
        const double _v = v[_neuron_idx];
        const double _I = I[_neuron_idx];

        {{code_lines|autoindent}}  // Same computational kernel

        v[_neuron_idx] = _v;
    }

    Py_RETURN_NONE;
}

static PyMethodDef methods[] = {
    {"run_neurongroup_stateupdater", run_neurongroup_stateupdater, METH_VARARGS, "Update neurons"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT, "{{module_name}}", NULL, -1, methods
};

PyMODINIT_FUNC PyInit_{{module_name}}(void) {
    import_array();
    return PyModule_Create(&module);
}
{% endmacro %}

This felt like the perfect solution. We could:

Reuse our existing C++ computational kernels without modification
Eliminate Cython entirely and its compilation overhead
Generate cleaner code without Cython's template expansion bloat
Maintain our template-based architecture with minimal changes

Why This Felt So Right Initially

The appeal was obvious. Look at the difference in generated code complexity:

Current Cython Output (Simplified)

# What Cython generates for a simple neuron update:
def run_neurongroup_stateupdater_codeobject(
        np.ndarray[double, ndim=1, mode='c'] _array_neurongroup_v,
        np.ndarray[double, ndim=1, mode='c'] _array_neurongroup_I, 
        const int _num_neurons,
        const double dt,
        const double tau):

    cdef int _neuron_idx
    cdef int _vectorisation_idx  
    cdef double v, I
    cdef double _lio_1, _lio_2, _lio_3, _lio_4

    # 500+ lines of boilerplate and intermediate variables...

    for _neuron_idx in range(_num_neurons):
        v = _array_neurongroup_v[_neuron_idx]
        I = _array_neurongroup_I[_neuron_idx]

        # Lots of intermediate calculations for a simple equation
        _lio_1 = (-v)
        _lio_2 = (_lio_1 + I)
        _lio_3 = (_lio_2 / tau)
        _lio_4 = (_lio_3 * dt)
        v += _lio_4

        _array_neurongroup_v[_neuron_idx] = v

# Plus 1000+ more lines of module initialization boilerplate...

Our Planned Manual C Extension

// Clean, minimal C extension using our existing computational kernel:
static PyObject* run_neurongroup_stateupdater(PyObject* self, PyObject* args) {
    PyArrayObject *v_array, *I_array;
    int N;
    double dt, tau;

    if (!PyArg_ParseTuple(args, "OOidd", &v_array, &I_array, &N, &dt, &tau)) {
        return NULL;
    }

    double *v = (double*)PyArray_DATA(v_array);
    double *I = (double*)PyArray_DATA(I_array);

    // Direct, efficient computation - no intermediate variables
    for(int i = 0; i < N; i++) {
        v[i] += (-v[i] + I[i]) * dt / tau;
    }

    Py_RETURN_NONE;
}

// Minimal module boilerplate...

The manual approach was clearly more direct and efficient. No Cython overhead, no template expansion bloat, just clean C code wrapped in minimal Python API calls.

The Crushing Realization

As we started planning the implementation details, the brutal truth became clear: we would eliminate Cython but not the fundamental bottleneck.

Through careful analysis, we could see the compilation times would indeed be better:

Cython approach: 25-45 seconds per code object ( say for example)
Manual C extension: Estimated 15-25 seconds per code object

But we would still be:

Writing large files to disk - Each generated C extension would still be thousands of lines with all the Python API boilerplate
Launching external compiler processes - Every subprocess.run(['gcc', ...]) call would have significant overhead
Waiting for compilation - gcc would still need to parse our generated code and run full optimization passes
Managing temporary files - File I/O, cleanup, and cache management would remain complex

Before writing a single line of implementation code, we realized we were optimizing the wrong thing. The problem wasn't Cython's code generation - the problem was doing AOT compilation at runtime, period.

Why AOT Thinking Trapped Us

The manual C extension approach failed because we were still thinking in AOT terms:

Generate source code (C instead of Cython, but still large files)
Write to disk (still required for gcc to process)
Launch external compiler (gcc instead of cython+gcc, but still external)
Wait for compilation (faster but still 15-40 seconds)
Load result (same complexity)

We had optimized step 3 but left the fundamental pipeline intact. The bottleneck wasn't Cython specifically - it was the entire file-based, external-compiler, AOT workflow.

It showed us that the problem wasn't solvable within the AOT paradigm, no matter how much we optimized it. We needed to break free from AOT thinking entirely.

That's what led us to cppyy - and that's where the real revolution began.

Next, we'll dive into how cppyy's JIT compilation approach and see why it’s revolutionary … :)

Escaping the AOT Trap: Why Brian2 Is Exploring a Cppyy-Powered Runtime