Building a 200‑Server Local LLM Cluster

Alvin EndratnoAlvin Endratno
3 min read

Large‑language‑model (LLM) APIs are incredibly powerful—but they are also expensive. Local LLM inference, on the other hand, is almost free once the hardware is on your desk. This post walks through how we turned an ever‑growing pile of idle Apple silicon laptops into a 200‑node inference cluster that now carries a quarter of our production traffic, all without a data‑center contract.

Spoiler: it started with a dusty meeting room and ended with me rewiring the entire office network at 3 a.m.

Phase One: Breathing Life into Unused MacBooks

Our office shelves were lined with twelve M1 MacBook Pros (32 GB RAM) that nobody had touched in months. Instead of letting them depreciate silently, I proposed re‑purposing them as LLM inference servers. The CEO loved the cost‑saving angle, so we rolled up our sleeves.

The stack

Nothing fancy:

  • Ollama for model serving (picked after trying four frameworks—see below)

  • HAProxy on one MacBook for simple round‑robin load balancing

  • Prometheus + Grafana for metrics and dashboards

  • A cheap desktop fan to keep the “data center” (a 6 m² meeting room) cool

  • Installing one macbook at atime

Framework bake‑off

FrameworkWhy it didn’t make the cut
LM Studio (MLX backend)Great MLX support, but froze on long contexts and parallel requests
Raw MLX LibraryNo OpenAI‑style API; required custom parsing; high memory usage
ollama.cppImpressive performance, but hard to automate at the time (no Ansible yet)
OllamaEasiest to deploy, solid performance, OpenAI‑compatible endpoint. We are using this

Within a week we had twelve MacBooks serving local 30B models and handling ~5 % of live traffic.

Phase Two: The Mac Studio Detour—Bigger Isn’t Always Better

Success breeds ambition. Convinced that “more memories = more throughput,” management ordered six fully loaded Mac Studio (512 GB RAM, 80‑core GPU) machines, expecting each to replace eight MacBooks.

Reality check: LLM speed scaled almost linearly with GPU cores, not RAM, and Amdahl’s Law reminded us that some parts of the pipeline stay serial no matter what. A single Mac Studio was only about 3×–4x faster than a MacBook, not 8×.

undefined

Understanding Concurrency Through Amdahl's Law - DEV Community

Lesson learned, but the six Mac Studios still bumped us to ~25 % traffic coverage.

Phase Three: 200 Mac Minis and the Joy of Ansible

Why Mac minis?

A cost analysis showed that two Mac minis (20‑core GPU) delivered more tokens per dollar than a single Mac Studio. We bulk‑ordered two hundred of them.

Automated provisioning

Hand‑installing 200 machines was a non‑starter, so I dove head‑first into Ansible:

(Below is example. the real deal is much complex)

# excerpt from playbook
- hosts: mac
  tasks:
    - name: Install Ollama
      homebrew:
        name: ollama
        state: present
    - name: Configure model
      shell: ollama pull mistral:7b-instruct
    - name: Register with HAProxy
      template:
        src: haproxy.cfg.j2
        dest: /usr/local/etc/haproxy/haproxy.cfg

Bringing the first 50 nodes online felt like magic. After that, it was rinse‑and‑repeat.

The 3 a.m. network meltdown

The real nightmare was networking. We needed a separate VLAN for the “server farm,” but the only documentation for our Yamaha router was a half‑translated PDF, and the prior network engineer had left months earlier. One mis‑tagged port later, the office Wi‑Fi went dark. Twelve hours, three pots of coffee, and a crash course in VLAN tagging later, both networks were humming.

Future Plans

Scale to racks – 200 minis fit, but airflow and cabling are becoming a headache. A 42U rack with proper PDUs is next.

Stay tuned for part 2; next time I’ll cover the VLAN saga in detail and share the Grafana dashboards that keep this Frankencluster alive.

Thanks for reading!

Questions, comments, or horror stories of your own? Let me know below; I’d love to compare notes.

1
Subscribe to my newsletter

Read articles from Alvin Endratno directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alvin Endratno
Alvin Endratno

Hi there 👋 I'm an Indonesian who loves coffee and dogs!! Also, code sometimes. Currently studying at Kyoto College of Graduate Studies for Informatics. 🏫 Work as Full-Stack Engineer