Understanding cppyy : A True automatic Python-C++ binding


So as discussed in our previous blog , we concluded that our entire approach to performance was trapped by traditional Ahead-of-Time (AOT) thinking. It became clear that no matter how much we optimized our existing pipeline, we were fighting the fundamental nature of the problem.
So, let's take a step back and focus on the core issues we're trying to solve with our current runtime, which is powered by Cython. While functional, this approach has created significant friction:
The Compilation Speed Problem: Brian2's power comes from its ability to compile user-defined equations dynamically. Cython's file-based compilation is a major bottleneck here, slowing down the start of every simulation.
The Two-Headed Dragon of Code Generation: We currently have to maintain two separate code generation targets: one for the Cython runtime and another for the C++ standalone mode. This adds immense complexity and doubles our maintenance workload.
The "One-Size-Fits-All" Problem: Cython is a general-purpose tool. For our very specific needs, it generates thousands of lines of C++ boilerplate that we don't need, creating bloat. To combat the slow compilation, we introduced a caching mechanism, but this created its own set of headaches with managing cache size and invalidation.
We need a solution that isn't just a better AOT compiler, but a complete shift in philosophy. This is what leads us to cppyy
.
In this post, I'll take you on my journey as I explore this new technology. We'll document what cppyy
is, how its Just-in-Time (JIT) compilation works, and how we plan to experiment with it as a potential future for Brian2's runtime. Let's dive in.
But First, A Shocking Revelation: What is Python?
I'll be honestβuntil I started this cppyy
deep dive, I thought Python was just... Python. You download it, you run python
script.py
, and magic happens. But wow, was I in for a shock!
Python is just a language specificationβa set of rules. The "magic" we all use every day is actually an implementation of those rules. The most popular one, the one you almost certainly have, is CPython, which is written in the C language.
Understanding how CPython works is the key to understanding why tools like cppyy
are so revolutionary. When you run a script with CPython, it's a two-step process:
Compilation to Bytecode: First, CPython reads your
.py
file and parses it into a structure called an Abstract Syntax Tree (AST). This tree represents the logical flow of your code. For example,x = 10 + 20
becomes a tree where=
is the main operation. This AST is then compiled into Python bytecodeβa simpler, intermediate language that is platform-independent. This is the stage where.pyc
files are created in your__pycache__
( yup now you know like me what that pycache thing is π) directory to speed things up on subsequent runs.Interpretation by the PVM: Now that it has bytecode, the Python Virtual Machine (PVM) takes over. The PVM is the heart of CPython, a runtime engine that works in a simple but incredibly fast loop:
Read one bytecode instruction (e.g.,
BINARY_ADD
,STORE_FAST
).Execute it.
Move to the next instruction.
This "interpreter loop" churns through the bytecode until the script is done.
Your `.py` file βββ> β CPython Compiler β
ββββββββββββββββββββ
β
βΌ
ββββββββββββ
β Bytecode β (saved in `.pyc`)
ββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββ
β Python Virtual Machine (PVM) β
β (Reads and executes bytecode instruction β
β by instruction inside a giant loop) β
βββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββ
β Result β
ββββββββββββ
Why This Matters: CPython, PyPy, and the cppyy
Bridge
CPython isn't the only game in town. There's also:
PyPy: A Python interpreter written in Python that uses a sophisticated Just-in-Time (JIT) compiler to dramatically speed up code.
Jython: Python running on the Java Virtual Machine (JVM).
So why do we need to understand all this? Because, as the cppyy
docs explains, the performance of Python-C++ bindings depends heavily on the underlying Python implementation .. Now letβs see what is cppyy
Core Architecture of Cppyy
cppyy
is automatic Python-C++ bindings generator built on Cling, which is an interactive C++ interpreter based on LLVM/Clang β¦ Hmm wait what is cling-clang ( sounds rhymy) and what is llvm , and what are bindings ??? , I was in the same toes so hereβs a guide of what each part is β¦ so letβs go to basics β¦
The Language Barrier
Imagine you're a Python programmer, but you have a friend who has written an amazing, super-fast library in C++. You want to use that library, but there's a problem: Python and C++ are like people speaking different languages.
# You want to do this in Python:
result = my_cpp_library.fast_calculation(data)
// But the library exists in C++:
double fast_calculation(std::vector<double>& data) {
// Super optimized C++ code here
return result;
}
The Problem: How do you call C++ functions from Python? How do you pass Python data to C++ functions? How do you get results back?
What Are "Bindings"?
Bindings are like translators that sit between Python and C++. They handle the conversation:
Converting Python data to C++ data python
python_list = [1.0, 2.0, 3.0] # Binding converts this to: # std::vector<double> cpp_vector = {1.0, 2.0, 3.0}
Calling the C++ function cpp
double result = fast_calculation(cpp_vector);
Converting C++ results back to Python python
# Binding converts result back to Python float python_result = 42.5
Think of bindings as a universal translator that knows both languages perfectly , and yes as we have been talking alot about in this series β¦ cython is also a langauge that helps to bind the worlds of C++ and python
Understanding Modern Compiler Technology (LLVM, Clang, and the Revolution)
Now I need to explain the modern tools that make cppyy possible. Think of this as the difference between old-fashioned factories and modern automated production lines.
What is LLVM? (The Universal Assembly Line)
LLVM is like a universal "assembly line" for turning any programming language into machine code. Here's the key insight:
Traditional Approach (before LLVM):
C++ Code β C++ Compiler β x86 Machine Code
Python β Python Interpreter β (stays in Python)
Java β Java Compiler β Java Bytecode β JVM β Machine Code
Each language had its own completely separate path to machine code.
LLVM Approach:
C++ Code β Clang Frontend β LLVM IR β
Python Code β LLVM Frontend β LLVM IR β LLVM Backend β Machine Code
Java Code β LLVM Frontend β LLVM IR β
LLVM created a universal intermediate representation (LLVM IR) that any language can target, and then LLVM handles the final step to machine code.
What is LLVM IR?
LLVM IR (Intermediate Representation) is like a universal assembly language that's much more advanced than traditional assembly. Here's an example:
C++ code:
int add(int a, int b) {
return a + b;
}
LLVM IR:
define i32 @add(i32 %a, i32 %b) {
entry:
%add = add nsw i32 %a, %b
ret i32 %add
}
Machine Code (x86):
add:
addl %esi, %edi
movl %edi, %eax
retq
The beauty is that LLVM IR is platform-independent but still very close to machine code.
What is Clang? (The C++ Translator)
Clang is the part of LLVM that specifically understands C++. Think of it as a expert translator who can read C++ and convert it to LLVM IR.
// Clang reads this C++:
class MyClass {
int value;
public:
MyClass(int v) : value(v) {}
int getValue() const { return value; }
};
// And produces LLVM IR that represents all the C++ concepts:
// - Class layout
// - Constructor logic
// - Member function calls
// - etc.
Clang is incredibly sophisticated - it understands:
Templates and template instantiation
Inheritance and virtual functions
Operator overloading
Modern C++ features (C++11, C++14, C++17, C++20)
Complex type systems
Just-In-Time (JIT) Compilation: The Game Changer
Here's where it gets really interesting. Traditionally, compilation worked like this:
Ahead-of-Time (AOT) Compilation (traditional):
Write Code β Compile β Run (later)
β β β
Slow Very Slow Fast
Just-In-Time (JIT) Compilation (modern):
Write Code β Run (compilation happens during execution)
β β
Fast Fast (after brief initial delay)
JIT compilation means the compiler runs while your program is running and can make optimizations based on the actual data and usage patterns it sees.
Why JIT can be faster than AOT:
Runtime optimization: The JIT can see how your code actually behaves and optimize for those patterns
No file I/O: Everything happens in memory
Incremental compilation: Only compile what you actually use
Adaptive optimization: If usage patterns change, recompile with better optimizations
What is Cling?
Cling is like giving C++ the superpower of being interactive like Python. Traditionally, C++ worked like this:
// Traditional C++: Write entire program, compile, run
#include <iostream>
int main() {
int x = 5;
int y = 10;
std::cout << x + y << std::endl;
return 0;
}
// Then: g++ program.cpp -o program && ./program
Cling lets you do this:
// Interactive C++ with Cling:
[cling] int x = 5;
[cling] int y = 10;
[cling] x + y
(int) 15
[cling] #include <vector>
[cling] std::vector<int> v = {1, 2, 3, 4, 5};
[cling] v.size()
(unsigned long) 5
You can type C++ code line by line and see results immediately, just like Python!
How Cling Works (The Technical Magic)
Cling is built on top of Clang and LLVM, and here's the brilliant part:
Parse C++ incrementally: When you type a line of C++, Cling uses Clang to parse it immediately
Generate LLVM IR: Clang converts your C++ to LLVM IR
JIT compile: LLVM immediately compiles the IR to machine code in memory
Execute: The machine code runs right away
Remember state: Cling keeps track of all variables, functions, classes you've defined
// When you type this in Cling:
int calculate(int x) { return x * x + 2 * x + 1; }
// Cling immediately:
// 1. Parses the function with Clang
// 2. Generates LLVM IR for the function
// 3. JIT compiles to machine code
// 4. Stores the function pointer in memory
// 5. Ready to call instantly!
// Later when you type:
calculate(5)
// Cling directly calls the compiled machine code - super fast!
Why This is Revolutionary
Before Cling, if you wanted to run C++ code, you had to:
Write complete source files
Invoke the compiler (slow)
Link everything together (slow)
Run the resulting executable
With Cling, you can:
Type C++ code
It runs immediately at full native speed
This is like the difference between having to publish a book every time you want to say something (old C++) versus having a conversation (Cling).
Comes in cppyy
- Putting It All Together
Now we get to the star of the show! cppyy combines Python with Cling to create something magical.
The cppyy Architecture
cppyy is essentially Cling embedded inside Python. Here's how it works:
So cppyy operates on a fundamentally different principle than traditional binding systems. Instead of generating static bindings at compile time, it creates a live bridge between Python and C++ using just-in-time compilation β¦ but how does it do it β¦
What happens internally in cppyy:
The Three-Layer System
Layer 1: Python Interface Layer This is what users interact with. It provides Python functions like cppdef()
, include()
, and the gbl
namespace for accessing C++ code.
Layer 2: CPyCppyy Bridge Layer This C extension module handles the translation between Python and C++ at runtime. It manages memory, converts types, and creates proxy objects.
Layer 3: Cling Interpreter Engine The core engine that parses C++ code, compiles it to machine code, and executes it. Based on Clang/LLVM technology , which we discussed above β¦
How Code Loading Works
When you write this:
cppyy.cppdef("""
class Calculator {
public:
int add(int a, int b) { return a + b; }
};
""")
Here's the internal process:
Step 1: Code Parsing
The C++ code string goes directly to Cling, which uses the same parser as the Clang compiler. Cling builds an Abstract Syntax Tree (AST) that represents the structure of your C++ code.
Step 2: Symbol Registration
Cling doesn't immediately compile everything. Instead, it registers that a class called "Calculator" exists with a method called "add". This information is stored in symbol tables.
Step 3: Lazy Compilation
No machine code is generated yet. Compilation happens only when you actually try to use the code.
Dynamic Class Creation
When you access cppyy.gbl.Calculator
, something interesting happens:
Calculator = cppyy.gbl.Calculator # This triggers class creation
The class doesn't exist until this moment. Here's what cppyy does:
Runtime Class Factory
cppyy queries Cling for information about the Calculator class. It discovers the class has a constructor, an add
method, and other metadata. Using this information, it dynamically creates a Python class that acts as a proxy.
Lazy Method Binding
The methods aren't bound immediately either. When you first call calc.add()
, cppyy:
Asks Cling to compile the
add
method to machine codeCreates a Python callable that can invoke this machine code
Caches this callable for future use
This means the first call to any method is slightly slower, but subsequent calls are at native C++ speed.
The Two-Phase Allocation System in cppyy
Why Two Phases Matter
Traditional Python extensions allocate everything at once. cppyy's two-phase system solves several critical problems:
Lazy Initialization: C++ objects aren't created until absolutely necessary
Reference Semantics: Multiple Python objects can reference the same C++ object
Smart Pointer Integration: Proxies can wrap smart pointers transparently
Error Handling: Construction can fail without leaving invalid Python objects
Phase 1: Python Proxy Creation
calc = Calculator(42)
What Actually Happens in __new__
c
// CPyCppyy internal C code (simplified)
typedef struct {
PyObject_HEAD // Standard Python object header (24 bytes)
void* fObject; // Pointer to C++ object (8 bytes) - NULL initially
uint32_t fFlags; // Ownership and state flags (4 bytes)
void* fSmartPtr; // Smart pointer storage (8 bytes) - optional
PyObject* fType; // Type information cache (8 bytes)
} CPPInstance;
static PyObject* CPPInstance_new(PyTypeObject* type, PyObject* args, PyObject* kwds) {
CPPInstance* self = (CPPInstance*)type->tp_alloc(type, 0);
if (self) {
self->fObject = NULL; // Critical: C++ object doesn't exist yet
self->fFlags = 0; // No ownership, not initialized
self->fSmartPtr = NULL; // No smart pointer yet
self->fType = NULL; // Type info loaded lazily
}
return (PyObject*)self;
}
Memory State After Phase 1
Python Heap Memory:
βββββββββββββββββββββββββββββββββββ
β CPPInstance (Python proxy) β
β ββ PyObject_HEAD: 24 bytes β
β ββ fObject: NULL β β No C++ object yet
β ββ fFlags: 0 β β Not initialized
β ββ fSmartPtr: NULL β β No smart pointer
β ββ fType: NULL β β Type info not loaded
βββββββββββββββββββββββββββββββββββ
C++ Heap Memory:
(empty - nothing allocated yet)
The Proxy Object Layout
// Detailed flag system
#define CPPYY_IS_OWNER 0x0001 // Python owns the C++ object
#define CPPYY_IS_SMARTPTR 0x0002 // Wraps a smart pointer
#define CPPYY_IS_REFERENCE 0x0004 // References existing object
#define CPPYY_IS_TEMP 0x0008 // Temporary object
#define CPPYY_IS_CONST 0x0010 // Const object
#define CPPYY_IS_INITIALIZED 0x0020 // __init__ was called
Phase 2: C++ Object Creation
The __init__
Method Execution
# This triggers Phase 2
calc.__init__(42)
Step-by-Step C++ Object Creation
// CPyCppyy internal implementation
static int CPPInstance_init(CPPInstance* self, PyObject* args, PyObject* kwds) {
// Step 1: JIT compile the constructor if needed
const char* class_name = "Calculator";
MethodProxy* constructor = get_constructor(class_name, args);
if (!constructor->is_compiled) {
// Ask Cling to compile: Calculator::Calculator(int)
cling_compile_constructor(class_name, get_arg_types(args));
constructor->func_ptr = cling_get_symbol("Calculator_ctor_int");
constructor->is_compiled = true;
}
// Step 2: Allocate C++ memory
size_t object_size = cling_sizeof("Calculator"); // Query Cling for size
void* cpp_memory = malloc(object_size); // Raw memory allocation
// Step 3: Call placement new with compiled constructor
typedef void (*ConstructorFunc)(void*, int);
ConstructorFunc ctor = (ConstructorFunc)constructor->func_ptr;
ctor(cpp_memory, PyLong_AsLong(PyTuple_GET_ITEM(args, 0))); // Call C++ ctor
// Step 4: Update proxy state
self->fObject = cpp_memory; // Store C++ object pointer
self->fFlags |= CPPYY_IS_OWNER; // Python owns this object
self->fFlags |= CPPYY_IS_INITIALIZED; // Mark as initialized
return 0; // Success
}
Memory State After Phase 2
Python Heap Memory:
βββββββββββββββββββββββββββββββββββ
β CPPInstance (Python proxy) β
β ββ PyObject_HEAD: 24 bytes β
β ββ fObject: 0x7fff12345678 βββββΌββ Points to C++ object
β ββ fFlags: OWNER|INITIALIZED β β
β ββ fSmartPtr: NULL β β
β ββ fType: Calculator* β β
βββββββββββββββββββββββββββββββββββ β
β
C++ Heap Memory: β
βββββββββββββββββββββββββββββββββββ β
β Calculator object βββ
β ββ vtable pointer: 8 bytes β Virtual function table
β ββ member variables... β Actual C++ object data
β ββ (total size from sizeof()) β
βββββββββββββββββββββββββββββββββββ
Why cppyy is Perfect for Brian2
Now let's see how this solves Brian2's specific problems.
Current Brian2 Workflow Problems
Here's what happens now when we run a Brian2 simulation:
# When we write this:
G = NeuronGroup(1000, 'dv/dt = -v/tau : volt')
run(100*ms)
# Brian2 currently does this (simplified):
# 1. Generate Cython code for neuron equations
template = """
def neuron_update(double[:] v, double[:] I, double dt, double tau):
cdef int i
for i in range(v.shape[0]):
v[i] += dt * (-v[i]/tau + I[i]) # Integrate differential equation
"""
# 2. Write to disk (SLOW!)
with open('/tmp/brian_cache/neuron_12345.pyx', 'w') as f:
f.write(template)
# 3. Call Cython compiler (VERY SLOW!)
os.system('cython neuron_12345.pyx') # Generates .cpp file
# 4. Call C++ compiler (VERY SLOW!)
os.system('g++ -O3 -shared neuron_12345.cpp -o neuron_12345.so')
# 5. Load compiled module (SLOW!)
import neuron_12345
# 6. Finally run simulation (FAST!)
neuron_12345.neuron_update(voltage_array, current_array, dt, tau)
Proposed cppyy Workflow
# When we will write this:
G = NeuronGroup(1000, 'dv/dt = -v/tau : volt')
run(100*ms)
# With cppyy, Brian2 would do this:
# 1. Generate C++ code (same templates)
cpp_code = """
class NeuronUpdater {
public:
void update(double* v, double* I, int n, double dt, double tau) {
for(int i = 0; i < n; i++) {
v[i] += dt * (-v[i]/tau + I[i]); // Same equation, pure C++
}
}
};
"""
# 2. JIT compile instantly (FAST!)
cppyy.cppdef(cpp_code) # Happens in memory, no files!
# 3. Use immediately (FAST!)
updater = cppyy.gbl.NeuronUpdater()
updater.update(voltage_array, current_array, n_neurons, dt, tau)
Solving the Cache Problem
Current problem: Brian2 creates huge cache directories because every neuron model variant needs its own compiled file.
cppyy solution: No files at all! Everything happens in memory. Templates are instantiated on-demand and kept in memory only as long as needed.
and what is the bonus is we can make use of the cpp-standalone templates we already have β¦ so things look really promising for Brian2βs JIT runtime with cppyy β¦
Have you tried cppyy? What's your experience with JIT compilation? Share your thoughts and experiments in the comments below!
Next up: I'm planning a follow-up post showing a complete Brian2-to-cppyy conversion example. Stay tuned! π―
Subscribe to my newsletter
Read articles from Mrigesh Thakur directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
