Building a 200‑Server Local LLM Cluster


Large‑language‑model (LLM) APIs are incredibly powerful—but they are also expensive. Local LLM inference, on the other hand, is almost free once the hardware is on your desk. This post walks through how we turned an ever‑growing pile of idle Apple silicon laptops into a 200‑node inference cluster that now carries a quarter of our production traffic, all without a data‑center contract.
Spoiler: it started with a dusty meeting room and ended with me rewiring the entire office network at 3 a.m.
Phase One: Breathing Life into Unused MacBooks
Our office shelves were lined with twelve M1 MacBook Pros (32 GB RAM) that nobody had touched in months. Instead of letting them depreciate silently, I proposed re‑purposing them as LLM inference servers. The CEO loved the cost‑saving angle, so we rolled up our sleeves.
The stack
Nothing fancy:
Ollama for model serving (picked after trying four frameworks—see below)
HAProxy on one MacBook for simple round‑robin load balancing
Prometheus + Grafana for metrics and dashboards
A cheap desktop fan to keep the “data center” (a 6 m² meeting room) cool
Installing one macbook at atime
Framework bake‑off
Framework | Why it didn’t make the cut |
LM Studio (MLX backend) | Great MLX support, but froze on long contexts and parallel requests |
Raw MLX Library | No OpenAI‑style API; required custom parsing; high memory usage |
ollama.cpp | Impressive performance, but hard to automate at the time (no Ansible yet) |
Ollama | Easiest to deploy, solid performance, OpenAI‑compatible endpoint. We are using this |
Within a week we had twelve MacBooks serving local 30B models and handling ~5 % of live traffic.
Phase Two: The Mac Studio Detour—Bigger Isn’t Always Better
Success breeds ambition. Convinced that “more memories = more throughput,” management ordered six fully loaded Mac Studio (512 GB RAM, 80‑core GPU) machines, expecting each to replace eight MacBooks.
Reality check: LLM speed scaled almost linearly with GPU cores, not RAM, and Amdahl’s Law reminded us that some parts of the pipeline stay serial no matter what. A single Mac Studio was only about 3×–4x faster than a MacBook, not 8×.
Lesson learned, but the six Mac Studios still bumped us to ~25 % traffic coverage.
Phase Three: 200 Mac Minis and the Joy of Ansible
Why Mac minis?
A cost analysis showed that two Mac minis (20‑core GPU) delivered more tokens per dollar than a single Mac Studio. We bulk‑ordered two hundred of them.
Automated provisioning
Hand‑installing 200 machines was a non‑starter, so I dove head‑first into Ansible:
(Below is example. the real deal is much complex)
# excerpt from playbook
- hosts: mac
tasks:
- name: Install Ollama
homebrew:
name: ollama
state: present
- name: Configure model
shell: ollama pull mistral:7b-instruct
- name: Register with HAProxy
template:
src: haproxy.cfg.j2
dest: /usr/local/etc/haproxy/haproxy.cfg
Bringing the first 50 nodes online felt like magic. After that, it was rinse‑and‑repeat.
The 3 a.m. network meltdown
The real nightmare was networking. We needed a separate VLAN for the “server farm,” but the only documentation for our Yamaha router was a half‑translated PDF, and the prior network engineer had left months earlier. One mis‑tagged port later, the office Wi‑Fi went dark. Twelve hours, three pots of coffee, and a crash course in VLAN tagging later, both networks were humming.
Future Plans
Scale to racks – 200 minis fit, but airflow and cabling are becoming a headache. A 42U rack with proper PDUs is next.
Stay tuned for part 2; next time I’ll cover the VLAN saga in detail and share the Grafana dashboards that keep this Frankencluster alive.
Thanks for reading!
Questions, comments, or horror stories of your own? Let me know below; I’d love to compare notes.
Subscribe to my newsletter
Read articles from Alvin Endratno directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Alvin Endratno
Alvin Endratno
Hi there 👋 I'm an Indonesian who loves coffee and dogs!! Also, code sometimes. Currently studying at Kyoto College of Graduate Studies for Informatics. 🏫 Work as Full-Stack Engineer