Understanding cppyy : A True automatic Python-C++ binding

Mrigesh ThakurMrigesh Thakur
15 min read

So as discussed in our previous blog , we concluded that our entire approach to performance was trapped by traditional Ahead-of-Time (AOT) thinking. It became clear that no matter how much we optimized our existing pipeline, we were fighting the fundamental nature of the problem.

So, let's take a step back and focus on the core issues we're trying to solve with our current runtime, which is powered by Cython. While functional, this approach has created significant friction:

  1. The Compilation Speed Problem: Brian2's power comes from its ability to compile user-defined equations dynamically. Cython's file-based compilation is a major bottleneck here, slowing down the start of every simulation.

  2. The Two-Headed Dragon of Code Generation: We currently have to maintain two separate code generation targets: one for the Cython runtime and another for the C++ standalone mode. This adds immense complexity and doubles our maintenance workload.

  3. The "One-Size-Fits-All" Problem: Cython is a general-purpose tool. For our very specific needs, it generates thousands of lines of C++ boilerplate that we don't need, creating bloat. To combat the slow compilation, we introduced a caching mechanism, but this created its own set of headaches with managing cache size and invalidation.

We need a solution that isn't just a better AOT compiler, but a complete shift in philosophy. This is what leads us to cppyy.

In this post, I'll take you on my journey as I explore this new technology. We'll document what cppyy is, how its Just-in-Time (JIT) compilation works, and how we plan to experiment with it as a potential future for Brian2's runtime. Let's dive in.


But First, A Shocking Revelation: What is Python?

I'll be honestβ€”until I started this cppyy deep dive, I thought Python was just... Python. You download it, you run pythonscript.py, and magic happens. But wow, was I in for a shock!

Python is just a language specificationβ€”a set of rules. The "magic" we all use every day is actually an implementation of those rules. The most popular one, the one you almost certainly have, is CPython, which is written in the C language.

Understanding how CPython works is the key to understanding why tools like cppyy are so revolutionary. When you run a script with CPython, it's a two-step process:

  1. Compilation to Bytecode: First, CPython reads your .py file and parses it into a structure called an Abstract Syntax Tree (AST). This tree represents the logical flow of your code. For example, x = 10 + 20 becomes a tree where = is the main operation. This AST is then compiled into Python bytecodeβ€”a simpler, intermediate language that is platform-independent. This is the stage where .pyc files are created in your __pycache__ ( yup now you know like me what that pycache thing is πŸ˜‰) directory to speed things up on subsequent runs.

  2. Interpretation by the PVM: Now that it has bytecode, the Python Virtual Machine (PVM) takes over. The PVM is the heart of CPython, a runtime engine that works in a simple but incredibly fast loop:

    • Read one bytecode instruction (e.g., BINARY_ADD, STORE_FAST).

    • Execute it.

    • Move to the next instruction.

This "interpreter loop" churns through the bytecode until the script is done.

Your `.py` file  ───> β”‚ CPython Compiler β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚      
                            β–Ό      
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚ Bytecode β”‚ (saved in `.pyc`)
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚         Python Virtual Machine (PVM)      β”‚
      β”‚ (Reads and executes bytecode instruction  β”‚
      β”‚  by instruction inside a giant loop)      β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  Result  β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Matters: CPython, PyPy, and the cppyy Bridge

CPython isn't the only game in town. There's also:

  • PyPy: A Python interpreter written in Python that uses a sophisticated Just-in-Time (JIT) compiler to dramatically speed up code.

  • Jython: Python running on the Java Virtual Machine (JVM).

So why do we need to understand all this? Because, as the cppyy docs explains, the performance of Python-C++ bindings depends heavily on the underlying Python implementation .. Now let’s see what is cppyy

Core Architecture of Cppyy

cppyy is automatic Python-C++ bindings generator built on Cling, which is an interactive C++ interpreter based on LLVM/Clang … Hmm wait what is cling-clang ( sounds rhymy) and what is llvm , and what are bindings ??? , I was in the same toes so here’s a guide of what each part is … so let’s go to basics …

The Language Barrier

Imagine you're a Python programmer, but you have a friend who has written an amazing, super-fast library in C++. You want to use that library, but there's a problem: Python and C++ are like people speaking different languages.

# You want to do this in Python:
result = my_cpp_library.fast_calculation(data)
// But the library exists in C++:
double fast_calculation(std::vector<double>& data) {
    // Super optimized C++ code here
    return result;
}

The Problem: How do you call C++ functions from Python? How do you pass Python data to C++ functions? How do you get results back?

What Are "Bindings"?

Bindings are like translators that sit between Python and C++. They handle the conversation:

  1. Converting Python data to C++ data python

     python_list = [1.0, 2.0, 3.0]
     # Binding converts this to:
     # std::vector<double> cpp_vector = {1.0, 2.0, 3.0}
    
  2. Calling the C++ function cpp

     double result = fast_calculation(cpp_vector);
    
  3. Converting C++ results back to Python python

     # Binding converts result back to Python float
     python_result = 42.5
    

Think of bindings as a universal translator that knows both languages perfectly , and yes as we have been talking alot about in this series … cython is also a langauge that helps to bind the worlds of C++ and python

Understanding Modern Compiler Technology (LLVM, Clang, and the Revolution)

Now I need to explain the modern tools that make cppyy possible. Think of this as the difference between old-fashioned factories and modern automated production lines.

What is LLVM? (The Universal Assembly Line)

LLVM is like a universal "assembly line" for turning any programming language into machine code. Here's the key insight:

Traditional Approach (before LLVM):

C++ Code β†’ C++ Compiler β†’ x86 Machine Code
Python β†’ Python Interpreter β†’ (stays in Python)
Java β†’ Java Compiler β†’ Java Bytecode β†’ JVM β†’ Machine Code

Each language had its own completely separate path to machine code.

LLVM Approach:

C++ Code    β†’ Clang Frontend β†’ LLVM IR β†˜
Python Code β†’ LLVM Frontend β†’ LLVM IR β†’ LLVM Backend β†’ Machine Code
Java Code   β†’ LLVM Frontend β†’ LLVM IR β†—

LLVM created a universal intermediate representation (LLVM IR) that any language can target, and then LLVM handles the final step to machine code.

What is LLVM IR?

LLVM IR (Intermediate Representation) is like a universal assembly language that's much more advanced than traditional assembly. Here's an example:

C++ code:

int add(int a, int b) {
    return a + b;
}

LLVM IR:

define i32 @add(i32 %a, i32 %b) {
entry:
  %add = add nsw i32 %a, %b
  ret i32 %add
}

Machine Code (x86):

add:
    addl    %esi, %edi
    movl    %edi, %eax
    retq

The beauty is that LLVM IR is platform-independent but still very close to machine code.

What is Clang? (The C++ Translator)

Clang is the part of LLVM that specifically understands C++. Think of it as a expert translator who can read C++ and convert it to LLVM IR.

// Clang reads this C++:
class MyClass {
    int value;
public:
    MyClass(int v) : value(v) {}
    int getValue() const { return value; }
};

// And produces LLVM IR that represents all the C++ concepts:
// - Class layout
// - Constructor logic  
// - Member function calls
// - etc.

Clang is incredibly sophisticated - it understands:

  • Templates and template instantiation

  • Inheritance and virtual functions

  • Operator overloading

  • Modern C++ features (C++11, C++14, C++17, C++20)

  • Complex type systems

Just-In-Time (JIT) Compilation: The Game Changer

Here's where it gets really interesting. Traditionally, compilation worked like this:

Ahead-of-Time (AOT) Compilation (traditional):

Write Code β†’ Compile β†’ Run (later)
     ↓           ↓       ↓
  Slow      Very Slow   Fast

Just-In-Time (JIT) Compilation (modern):

Write Code β†’ Run (compilation happens during execution)
     ↓           ↓
   Fast     Fast (after brief initial delay)

JIT compilation means the compiler runs while your program is running and can make optimizations based on the actual data and usage patterns it sees.

Why JIT can be faster than AOT:

  1. Runtime optimization: The JIT can see how your code actually behaves and optimize for those patterns

  2. No file I/O: Everything happens in memory

  3. Incremental compilation: Only compile what you actually use

  4. Adaptive optimization: If usage patterns change, recompile with better optimizations

What is Cling?

Cling is like giving C++ the superpower of being interactive like Python. Traditionally, C++ worked like this:

// Traditional C++: Write entire program, compile, run
#include <iostream>
int main() {
    int x = 5;
    int y = 10;
    std::cout << x + y << std::endl;
    return 0;
}
// Then: g++ program.cpp -o program && ./program

Cling lets you do this:

// Interactive C++ with Cling:
[cling] int x = 5;
[cling] int y = 10;  
[cling] x + y
(int) 15
[cling] #include <vector>
[cling] std::vector<int> v = {1, 2, 3, 4, 5};
[cling] v.size()
(unsigned long) 5

You can type C++ code line by line and see results immediately, just like Python!

How Cling Works (The Technical Magic)

Cling is built on top of Clang and LLVM, and here's the brilliant part:

  1. Parse C++ incrementally: When you type a line of C++, Cling uses Clang to parse it immediately

  2. Generate LLVM IR: Clang converts your C++ to LLVM IR

  3. JIT compile: LLVM immediately compiles the IR to machine code in memory

  4. Execute: The machine code runs right away

  5. Remember state: Cling keeps track of all variables, functions, classes you've defined

// When you type this in Cling:
int calculate(int x) { return x * x + 2 * x + 1; }

// Cling immediately:
// 1. Parses the function with Clang
// 2. Generates LLVM IR for the function
// 3. JIT compiles to machine code
// 4. Stores the function pointer in memory
// 5. Ready to call instantly!

// Later when you type:
calculate(5)
// Cling directly calls the compiled machine code - super fast!

Why This is Revolutionary

Before Cling, if you wanted to run C++ code, you had to:

  1. Write complete source files

  2. Invoke the compiler (slow)

  3. Link everything together (slow)

  4. Run the resulting executable

With Cling, you can:

  1. Type C++ code

  2. It runs immediately at full native speed

This is like the difference between having to publish a book every time you want to say something (old C++) versus having a conversation (Cling).

Comes in cppyy - Putting It All Together

Now we get to the star of the show! cppyy combines Python with Cling to create something magical.

The cppyy Architecture

cppyy is essentially Cling embedded inside Python. Here's how it works:

So cppyy operates on a fundamentally different principle than traditional binding systems. Instead of generating static bindings at compile time, it creates a live bridge between Python and C++ using just-in-time compilation … but how does it do it …

What happens internally in cppyy:

The Three-Layer System

Layer 1: Python Interface Layer This is what users interact with. It provides Python functions like cppdef(), include(), and the gbl namespace for accessing C++ code.

Layer 2: CPyCppyy Bridge Layer This C extension module handles the translation between Python and C++ at runtime. It manages memory, converts types, and creates proxy objects.

Layer 3: Cling Interpreter Engine The core engine that parses C++ code, compiles it to machine code, and executes it. Based on Clang/LLVM technology , which we discussed above …

How Code Loading Works

When you write this:

cppyy.cppdef("""
    class Calculator {
    public:
        int add(int a, int b) { return a + b; }
    };
""")

Here's the internal process:

Step 1: Code Parsing

The C++ code string goes directly to Cling, which uses the same parser as the Clang compiler. Cling builds an Abstract Syntax Tree (AST) that represents the structure of your C++ code.

Step 2: Symbol Registration

Cling doesn't immediately compile everything. Instead, it registers that a class called "Calculator" exists with a method called "add". This information is stored in symbol tables.

Step 3: Lazy Compilation

No machine code is generated yet. Compilation happens only when you actually try to use the code.

Dynamic Class Creation

When you access cppyy.gbl.Calculator, something interesting happens:

Calculator = cppyy.gbl.Calculator  # This triggers class creation

The class doesn't exist until this moment. Here's what cppyy does:

Runtime Class Factory

cppyy queries Cling for information about the Calculator class. It discovers the class has a constructor, an add method, and other metadata. Using this information, it dynamically creates a Python class that acts as a proxy.

Lazy Method Binding

The methods aren't bound immediately either. When you first call calc.add(), cppyy:

  1. Asks Cling to compile the add method to machine code

  2. Creates a Python callable that can invoke this machine code

  3. Caches this callable for future use

This means the first call to any method is slightly slower, but subsequent calls are at native C++ speed.

The Two-Phase Allocation System in cppyy

Why Two Phases Matter

Traditional Python extensions allocate everything at once. cppyy's two-phase system solves several critical problems:

  1. Lazy Initialization: C++ objects aren't created until absolutely necessary

  2. Reference Semantics: Multiple Python objects can reference the same C++ object

  3. Smart Pointer Integration: Proxies can wrap smart pointers transparently

  4. Error Handling: Construction can fail without leaving invalid Python objects

Phase 1: Python Proxy Creation

calc = Calculator(42)

What Actually Happens in __new__

c

// CPyCppyy internal C code (simplified)
typedef struct {
    PyObject_HEAD           // Standard Python object header (24 bytes)
    void*     fObject;      // Pointer to C++ object (8 bytes) - NULL initially
    uint32_t  fFlags;       // Ownership and state flags (4 bytes)
    void*     fSmartPtr;    // Smart pointer storage (8 bytes) - optional
    PyObject* fType;        // Type information cache (8 bytes)
} CPPInstance;

static PyObject* CPPInstance_new(PyTypeObject* type, PyObject* args, PyObject* kwds) {
    CPPInstance* self = (CPPInstance*)type->tp_alloc(type, 0);
    if (self) {
        self->fObject = NULL;           // Critical: C++ object doesn't exist yet
        self->fFlags = 0;               // No ownership, not initialized
        self->fSmartPtr = NULL;         // No smart pointer yet
        self->fType = NULL;             // Type info loaded lazily
    }
    return (PyObject*)self;
}

Memory State After Phase 1

Python Heap Memory:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPPInstance (Python proxy)      β”‚
β”‚ β”œβ”€ PyObject_HEAD: 24 bytes      β”‚
β”‚ β”œβ”€ fObject: NULL                β”‚  ← No C++ object yet
β”‚ β”œβ”€ fFlags: 0                    β”‚  ← Not initialized
β”‚ β”œβ”€ fSmartPtr: NULL              β”‚  ← No smart pointer
β”‚ └─ fType: NULL                  β”‚  ← Type info not loaded
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

C++ Heap Memory:
(empty - nothing allocated yet)

The Proxy Object Layout

// Detailed flag system
#define CPPYY_IS_OWNER      0x0001   // Python owns the C++ object
#define CPPYY_IS_SMARTPTR   0x0002   // Wraps a smart pointer
#define CPPYY_IS_REFERENCE  0x0004   // References existing object
#define CPPYY_IS_TEMP       0x0008   // Temporary object
#define CPPYY_IS_CONST      0x0010   // Const object
#define CPPYY_IS_INITIALIZED 0x0020  // __init__ was called

Phase 2: C++ Object Creation

The __init__ Method Execution

# This triggers Phase 2
calc.__init__(42)

Step-by-Step C++ Object Creation

// CPyCppyy internal implementation
static int CPPInstance_init(CPPInstance* self, PyObject* args, PyObject* kwds) {
    // Step 1: JIT compile the constructor if needed
    const char* class_name = "Calculator";
    MethodProxy* constructor = get_constructor(class_name, args);

    if (!constructor->is_compiled) {
        // Ask Cling to compile: Calculator::Calculator(int)
        cling_compile_constructor(class_name, get_arg_types(args));
        constructor->func_ptr = cling_get_symbol("Calculator_ctor_int");
        constructor->is_compiled = true;
    }

    // Step 2: Allocate C++ memory
    size_t object_size = cling_sizeof("Calculator");  // Query Cling for size
    void* cpp_memory = malloc(object_size);           // Raw memory allocation

    // Step 3: Call placement new with compiled constructor
    typedef void (*ConstructorFunc)(void*, int);
    ConstructorFunc ctor = (ConstructorFunc)constructor->func_ptr;
    ctor(cpp_memory, PyLong_AsLong(PyTuple_GET_ITEM(args, 0)));  // Call C++ ctor

    // Step 4: Update proxy state
    self->fObject = cpp_memory;                    // Store C++ object pointer
    self->fFlags |= CPPYY_IS_OWNER;               // Python owns this object
    self->fFlags |= CPPYY_IS_INITIALIZED;         // Mark as initialized

    return 0;  // Success
}

Memory State After Phase 2

Python Heap Memory:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPPInstance (Python proxy)      β”‚
β”‚ β”œβ”€ PyObject_HEAD: 24 bytes      β”‚
β”‚ β”œβ”€ fObject: 0x7fff12345678 ────┼─┐  Points to C++ object
β”‚ β”œβ”€ fFlags: OWNER|INITIALIZED    β”‚ β”‚
β”‚ β”œβ”€ fSmartPtr: NULL              β”‚ β”‚
β”‚ └─ fType: Calculator*           β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                                    β”‚
C++ Heap Memory:                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ Calculator object               β”‚β—„β”˜
β”‚ β”œβ”€ vtable pointer: 8 bytes      β”‚     Virtual function table
β”‚ β”œβ”€ member variables...          β”‚     Actual C++ object data
β”‚ └─ (total size from sizeof())   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why cppyy is Perfect for Brian2

Now let's see how this solves Brian2's specific problems.

Current Brian2 Workflow Problems

Here's what happens now when we run a Brian2 simulation:

# When we write this:
G = NeuronGroup(1000, 'dv/dt = -v/tau : volt')
run(100*ms)

# Brian2 currently does this (simplified):
# 1. Generate Cython code for neuron equations
template = """
def neuron_update(double[:] v, double[:] I, double dt, double tau):
    cdef int i
    for i in range(v.shape[0]):
        v[i] += dt * (-v[i]/tau + I[i])  # Integrate differential equation
"""

# 2. Write to disk (SLOW!)
with open('/tmp/brian_cache/neuron_12345.pyx', 'w') as f:
    f.write(template)

# 3. Call Cython compiler (VERY SLOW!)
os.system('cython neuron_12345.pyx')  # Generates .cpp file

# 4. Call C++ compiler (VERY SLOW!) 
os.system('g++ -O3 -shared neuron_12345.cpp -o neuron_12345.so')

# 5. Load compiled module (SLOW!)
import neuron_12345

# 6. Finally run simulation (FAST!)
neuron_12345.neuron_update(voltage_array, current_array, dt, tau)

Proposed cppyy Workflow

# When we will write this:
G = NeuronGroup(1000, 'dv/dt = -v/tau : volt')
run(100*ms)

# With cppyy, Brian2 would do this:
# 1. Generate C++ code (same templates)
cpp_code = """
class NeuronUpdater {
public:
    void update(double* v, double* I, int n, double dt, double tau) {
        for(int i = 0; i < n; i++) {
            v[i] += dt * (-v[i]/tau + I[i]);  // Same equation, pure C++
        }
    }
};
"""

# 2. JIT compile instantly (FAST!)
cppyy.cppdef(cpp_code)  # Happens in memory, no files!

# 3. Use immediately (FAST!)
updater = cppyy.gbl.NeuronUpdater()
updater.update(voltage_array, current_array, n_neurons, dt, tau)

Solving the Cache Problem

Current problem: Brian2 creates huge cache directories because every neuron model variant needs its own compiled file.

cppyy solution: No files at all! Everything happens in memory. Templates are instantiated on-demand and kept in memory only as long as needed.

and what is the bonus is we can make use of the cpp-standalone templates we already have … so things look really promising for Brian2’s JIT runtime with cppyy …


Have you tried cppyy? What's your experience with JIT compilation? Share your thoughts and experiments in the comments below!

Next up: I'm planning a follow-up post showing a complete Brian2-to-cppyy conversion example. Stay tuned! 🎯

0
Subscribe to my newsletter

Read articles from Mrigesh Thakur directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mrigesh Thakur
Mrigesh Thakur