ML with metacall

A few months ago, while diving into open-source for GSOC, I discovered MetaCall. This fantastic library lets you execute code across different programming languages, building truly "polyglot" applications. Imagine running Python's BeautifulSoup from your Node.js backend – seamlessly!

As an ML learner, I often struggled with huge memory consumption on my CPU-only PC when running models. Knowing Go's efficient concurrency, I wondered: Could I use Go and MetaCall to slash memory and boost speed? This question sparked an experiment, revealing deep lessons about how languages truly use memory.

My goal: Analyze 1000 wine reviews with a DistilBERT-base-uncased-finetuned-sst-2-english model. I tested three approaches:

Pure single-threaded Python.
Python's built-in multiprocessing.
Go + Python with MetaCall.

Let's dive into what I found!

Approach 1: The Surprising Power of Single-Threaded Python

I started with a simple Python script for our baseline.

from transformers import pipeline
import pandas as pd
import time

def classify_texts_batches(texts, batch_size=8):
    classifier = pipeline('text-classification',
        model="distilbert-base-uncased-finetuned-sst-2-english",
        device="cpu")
    # ... processing logic ...
    return results

# Load data and run
descriptions = data["description"].tolist()[:1000]
start_time = time.time()
results = classify_texts_batches(descriptions)
processing_time = time.time() - start_time

Initial Results

graph LR
    A[Single-Threaded Results] --> B[Processing Time: 38s]
    A --> C[Memory Usage: ~500MB]
    A --> D[CPU Utilization: 4-5 cores at 100%]

This was surprisingly fast and reasonably memory-efficient for the model size! Observing CPU usage, I saw activity across multiple cores, indicating PyTorch's internal parallelization was at play. This "single-threaded" Python wasn't truly single-threaded at the computation level – PyTorch, built on decades of optimized scientific computing libraries like Intel MKL and OpenMP, automatically leverages your CPU cores and handles memory efficiently. It's truly "efficient by default."

Approach 2: Python Multiprocessing – A Shocking Setback

Next, I tried multiprocessing, expecting a boost.

import multiprocessing # and others...

def worker_function(chunk, process_id, result_queue, status_queue):
    # Each process loads its own model
    classifier = pipeline('text-classification', model="distilbert-base-uncased-finetuned-sst-2-english", device="cpu")
    # ... process chunk ...

if __name__ == "__main__":
    # ... setup processes ...
    for i, chunk in enumerate(chunks):
        p = multiprocessing.Process(target=worker_function, args=(chunk, i, result_queue, status_queue))
        p.start()
    # ... join processes ...

The Unfortunate Reality

graph TD
    A[Multiprocessing Results] --> B[Processing Time: 220+ seconds]
    A --> C[Memory Usage: ~4000MB]
    A --> D[Performance: WORSE than single-threaded]

This was a disaster! It was 5.8x slower and consumed 8x more memory (~4GB) than the single-threaded version. Why?

CPU Over-Subscription: My 8-core system tried to run 8 processes, each wanting 4-5 cores (thanks, PyTorch!). This demanded 32-40 cores, leading to massive "thrashing" and context switching overhead.
Memory Duplication: Each of the 8 processes loaded its own 500MB model copy. That's ( 8 \times 500 \text{MB} = 4 \text{GB} )!
Cache Pollution: All those processes competed for memory access, constantly invalidating CPU caches and saturating memory bandwidth, further slowing things down.

This was a tough lesson: More processes don't always mean better performance.

Could Shared Memory Have Helped?

Python does offer multiprocessing.shared_memory to, well, share memory between processes. Theoretically, we could load the model once into shared memory and have all workers access it.

from multiprocessing import shared_memory
import numpy as np

# Theoretical shared memory approach sketch
def create_shared_model():
    # Load model weights into shared memory
    shm = shared_memory.SharedMemory(create=True, size=model_weights.nbytes)
    # ... populate shared_buffer with model weights ...
    return shm

# ... worker attaches to existing shared memory ...

While this would tackle the memory duplication, it introduces significant complexity:

Requires deep code restructuring.
PyTorch/Transformers don't natively support this pattern.
Introduces synchronization overhead for shared access.
Crucially, it wouldn't solve the CPU over-subscription problem.

Estimated Performance with Shared Memory:

Memory Usage: ~1200MB (shared model + 8 process overhead)
Processing Time: ~45-50s (still slower due to coordination overhead)

So, while better than naive multiprocessing, it still lags behind the simple single-threaded Python, highlighting its default efficiency.

Approach 3: Go + MetaCall – The Architectural Promise

Finally, I introduced MetaCall to orchestrate calls to a single Python process from Go.

package main

import (
    "fmt"
    "runtime"
    "sync"
    metacall "github.com/metacall/core/source/ports/go_port/source"
)

// Simplified memory measurement (original flaw explained later!)
func getMemoryUsage() uint64 {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    return m.Alloc / 1024 / 1024 // ONLY Go heap!
}

func (bp *BatchProcessor) worker(id int) {
    defer bp.wg.Done()
    for batch := range bp.inputChan {
        result, err := metacall.Call("process_batch", batch) // Call Python!
        // ... process results ...
    }
}

func main() {
    metacall.Initialize()
    metacall.LoadFromFile("py", []string{"ml_processor.py"}) // Load Python script
    // ... setup Go goroutines and run ...
    fmt.Printf("Final Memory Usage: %dMB\n", getMemoryUsage())
}

My Python ML processor was simple:

from transformers import pipeline

classifier = pipeline('text-classification',
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device="cpu")

def process_batch(texts: list[str]) -> list[dict]:
    return classifier(texts)

The 94MB "Revolution" (or So I Thought)

graph LR
    A[Go + MetaCall Results] --> B[Processing Time: 40s]
    A --> C[Memory Usage: 94MB?!]
    A --> D[Performance: Comparable to single-threaded]

At first, I was ecstatic! The Go program reported only 94MB memory usage. I thought I'd discovered a game-changer! But then, reality hit.

The Truth About Memory: A Measurement Mistake

How could a 300MB model use only 94MB total? The answer: My getMemoryUsage function was only measuring Go's heap allocation. It completely ignored the memory consumed by the Python subprocess, the ML model, and the PyTorch runtime!

graph TB
    subgraph "What Actually Happens"
        A[Go Process] --> B[Go Runtime & Data]
        E[Python Subprocess] --> F[Python, PyTorch, DistilBERT Model]
        A --> J[MetaCall Bridge] --> E
    end

    subgraph "Memory Measurement"
        K[My Go Code Measures] --> B
        K --> L[Reports: 94MB ❌]

        M[Should Measure] --> B & F
        M --> N[Total: ~694MB ✅]
    end

The actual memory used was the total of the single Python process plus Go's usage, which came to around ~694MB.

The Honest Comparison: What I Really Learned

graph LR
    A[Corrected Memory Usage] --> B[Python Single: 500MB]
    A --> C[Python Multi Naive: 4000MB]
    A --> D[Python Multi Shared: ~1200MB]
    A --> E[Go + MetaCall: 694MB]

Approach	Memory	Time	Real Assessment
Python Single-Threaded	500MB	38s	✅ Excellent baseline
Python Multi (Naive)	4000MB	220s+	❌ Resource disaster
Python Multi (Shared Memory)	~1200MB	~50s	🤷‍♂️ Better but still complex & slower
Go + MetaCall	694MB	40s	✅ Good vs multiprocessing

This experience made me appreciate Python immensely. Its underlying scientific libraries are incredibly optimized, making it efficient by default.

MetaCall's True Advantages

Even with corrected numbers, MetaCall provides legitimate benefits, especially compared to naive multiprocessing:

Significant Memory Savings: 83% vs. naive multiprocessing (694MB vs 4000MB) and 42% vs. Python multiprocessing with shared memory (694MB vs 1200MB).
Architectural Elegance: Clean separation of Go for orchestration and Python for ML.
Resource Sharing: A single model instance serves multiple concurrent requests.
Efficient Concurrency: Go's lightweight goroutines handle concurrency far better than heavyweight Python processes for this type of workload.

Key Takeaways

Python's Excellence: The "single-threaded" Python approach for ML is incredibly optimized out-of-the-box. Don't underestimate it.
Multiprocessing Pitfalls: Naive multiprocessing for ML can kill performance due to resource contention and memory duplication. Even shared memory comes with trade-offs.
Measurement Matters: Always measure total system impact, not just one component's memory. My "94MB" discovery was a harsh, but valuable, lesson.
Architectural Value: MetaCall offers genuine benefits for specific use cases by enabling elegant architecture and resource efficiency, especially when concurrent requests share a large model.

This project taught me a ton about how operating systems and languages truly manage memory, and the importance of rigorous benchmarking. It reminded me that sometimes, the best optimization is simply understanding and leveraging the powerful, pre-built optimizations already present in mature ecosystems like Python's.

While the GSOC opportunity didn't pan out for MetaCall this year as the organization wasn’t selected :(, this journey was an incredible learning experience. You can check out the full repository with all the details here: https://github.com/fyzanshaik/MetaCall-ML-Go/blob/main/README.md

[Metacall repo](https://github.com/metacall/go-python-ml-example)

Text Classification with MetaCall: My Deep Dive into Polyglot ML Performance