Scalable Local LLM Cluster Setup

Large‑language‑model (LLM) APIs are incredibly powerful—but they are also expensive. Local LLM inference, on the other hand, is almost free once the hardware is on your desk. This post walks through how we turned an ever‑growing pile of idle Apple silicon laptops into a 200‑node inference cluster that now carries a quarter of our production traffic, all without a data‑center contract.

Spoiler: it started with a dusty meeting room and ended with me rewiring the entire office network at 3 a.m.

Phase One: Breathing Life into Unused MacBooks

Our office shelves were lined with twelve M1 MacBook Pros (32 GB RAM) that nobody had touched in months. Instead of letting them depreciate silently, I proposed re‑purposing them as LLM inference servers. The CEO loved the cost‑saving angle, so we rolled up our sleeves.

The stack

Nothing fancy:

Ollama for model serving (picked after trying four frameworks—see below)
HAProxy on one MacBook for simple round‑robin load balancing
Prometheus + Grafana for metrics and dashboards
A cheap desktop fan to keep the “data center” (a 6 m² meeting room) cool
Installing one macbook at atime

Framework bake‑off

Framework	Why it didn’t make the cut
LM Studio (MLX backend)	Great MLX support, but froze on long contexts and parallel requests
Raw MLX Library	No OpenAI‑style API; required custom parsing; high memory usage
ollama.cpp	Impressive performance, but hard to automate at the time (no Ansible yet)
Ollama	Easiest to deploy, solid performance, OpenAI‑compatible endpoint. We are using this

Within a week we had twelve MacBooks serving local 30B models and handling ~5 % of live traffic.

Phase Two: The Mac Studio Detour—Bigger Isn’t Always Better

Success breeds ambition. Convinced that “more memories = more throughput,” management ordered six fully loaded Mac Studio (512 GB RAM, 80‑core GPU) machines, expecting each to replace eight MacBooks.

Reality check: LLM speed scaled almost linearly with GPU cores, not RAM, and Amdahl’s Law reminded us that some parts of the pipeline stay serial no matter what. A single Mac Studio was only about 3×–4x faster than a MacBook, not 8×.

Understanding Concurrency Through Amdahl's Law - DEV Community

Lesson learned, but the six Mac Studios still bumped us to ~25 % traffic coverage.

Phase Three: 200 Mac Minis and the Joy of Ansible

Why Mac minis?

A cost analysis showed that two Mac minis (20‑core GPU) delivered more tokens per dollar than a single Mac Studio. We bulk‑ordered two hundred of them.

Automated provisioning

Hand‑installing 200 machines was a non‑starter, so I dove head‑first into Ansible:

(Below is example. the real deal is much complex)

# excerpt from playbook
- hosts: mac
  tasks:
    - name: Install Ollama
      homebrew:
        name: ollama
        state: present
    - name: Configure model
      shell: ollama pull mistral:7b-instruct
    - name: Register with HAProxy
      template:
        src: haproxy.cfg.j2
        dest: /usr/local/etc/haproxy/haproxy.cfg

Bringing the first 50 nodes online felt like magic. After that, it was rinse‑and‑repeat.

The 3 a.m. network meltdown

The real nightmare was networking. We needed a separate VLAN for the “server farm,” but the only documentation for our Yamaha router was a half‑translated PDF, and the prior network engineer had left months earlier. One mis‑tagged port later, the office Wi‑Fi went dark. Twelve hours, three pots of coffee, and a crash course in VLAN tagging later, both networks were humming.

Future Plans

Scale to racks – 200 minis fit, but airflow and cabling are becoming a headache. A 42U rack with proper PDUs is next.

Stay tuned for part 2; next time I’ll cover the VLAN saga in detail and share the Grafana dashboards that keep this Frankencluster alive.

Thanks for reading!

Questions, comments, or horror stories of your own? Let me know below; I’d love to compare notes.

Building a 200‑Server Local LLM Cluster

Table of contents

Phase One: Breathing Life into Unused MacBooks

The stack

Framework bake‑off

Phase Two: The Mac Studio Detour—Bigger Isn’t Always Better

Phase Three: 200 Mac Minis and the Joy of Ansible

Why Mac minis?

Automated provisioning

The 3 a.m. network meltdown

Future Plans

Thanks for reading!

Subscribe to my newsletter

Alvin Endratno

Alvin Endratno

Building a 200‑Server Local LLM Cluster

Table of contents

Phase One: Breathing Life into Unused MacBooks

The stack

Framework bake‑off

Phase Two: The Mac Studio Detour—Bigger Isn’t Always Better

Phase Three: 200 Mac Minis and the Joy of Ansible

Why Mac minis?

Automated provisioning

The 3 a.m. network meltdown

Future Plans

Thanks for reading!

Subscribe to my newsletter

Alvin Endratno

Alvin Endratno

The 3 a.m. network meltdown