Qwen 3 on Raspberry Pi 5

TL;DR – We ran the brand-new Qwen 3 (0.6 B & 1.7 B) on our Distiller CM5 dev-kit.
≈9 tokens/s on the 1.7 B Q8 build and 21 tokens/s on the 0.6 B Q8 build—both under 1.3 GB RAM, and—more interesting—noticeably “agentic” behaviour that auto-chains Wi-Fi → SSH steps with minimal prompting.

Why care?

Edge builders live under tight power & memory budgets.
Qwen 3’s tiny dense models (0.6 B – 1.7 B) were released April 29 2025 and trained on -36 T tokens—double Qwen 2.5’s corpus. (GitHub)
If they’re fast and reason well, you can embed them in robots, wearables, or battery-powered IoT without off-loading to the cloud.

Test rig

Item	Spec
Board	Distiller CM5 (4 × Cortex-A76 @ 2.4 GHz, 8 GB LPDDR4, ≤ 10 W)
Runtime	`llama.cpp` (commit `af2f…`) + custom-built `llama-cpp-python`
Models	Qwen 3 0.6 B & 1.7 B (Q4_K_M / Q8 / BF16) plus Qwen 2.5 3 B (Q4_K_M)

Raw throughput

Model	Size (B)	Quant	RAM (GB)	tokens/s (4Thread)	tokens/s (1Thread)
Qwen 2.5‑instruct	3	Q4_K_M	1.3	5.21	3.23
Qwen 3	0.6	BF16	0.76	14.34	15.00
Qwen 3	0.6	Q8	0.53	21.48	13.54
Qwen 3	1.7	Q4_K_M	1.90	5.42	5.76
Qwen 3	1.7	Q8	1.20	8.69	5.14
Qwen 3	4	Q4_K_M	2.56	4.00	2.48

Source: internal llama-bench runs

Take‑away: Tiny Qwen‑3 0.6 B models hog < 1 GB and still hit double‑digit tokens/s (up to 21 t/s with Q8). Need more reasoning depth? The 1.7 B build stays under 2 GB, and even the 4 B variant fits in 2.6 GB on‑chip.

Take-away: the 1.7 B Q4_K_M sits at a sweet-spot: ～9 t/s, under 2 GB—perfect for 8 GB edge boards.

Agent-style usability test

We pointed each model at two real device tools:

wifi_mcp_server – scans networks & joins Wi-Fi
weather_mcp_server – fetches weather by city

Each scenario was prompted 10× with slight paraphrases.

Model	Successes / 10	Notes
DeepSeek-V3-instruct	10	baseline
Qwen 2.5 1.5 B Q4_K_M	1	frequent tool misuse / hallucinated SSID
Qwen 3 1.7 B Q4_K_M	10	remembered end-goal, auto-planned Wi-Fi → SSH into the device
Qwen 3 0.6 B BF16	10	slightly chattier but still reliable

Observation: Qwen 3 models actively plan multi-step tasks. They skip redundant explanations, assume tool output is visible to the user, and immediately ask for the next parameter or input. It feels like a proactive agent compared to older models. They were definitely trained on some agent/function call workflow data.

What this means for builders

Battery-friendly LLMs are here. You can hit double-digit tokens / s with a credit card sized compute now.
Agentic behaviour at 1–2 B params can unlocks voice-first workflows for a lot use case now: “Scan Wi-Fi → join → open SSH” happens in one round-trip.
Memory headroom (≤ 2 GB) leaves space for vision or audio pipelines on the same Sobel.

Further work

We are currently upgrading the Qwen2.5 stack to Qwen3 and will update our GitHub repo once things are stable. If you have a Raspberry Pi 5, feel free to try it out!
We are working on replacing faster-whisper with parakeet-tdt. We believe the 1.7B model paired with a better STT system can significantly improve the edge user experience. We'll publish benchmarks of all transcription tasks tested on Raspberry Pi 5 in our next post—stay tuned and follow along if you're interested.

If you want to get a plug-and-play devkit to experiment with edge LLM, check out our shop and YouTube videos.

Appendix: External resources

Qwen team’s launch blog – deeper dive into dataset & architecture (Qwen)
Baseten “Day-zero Qwen 3” latency report (235 B model) – shows server-scale potential (Baseten)
GitHub repo with GGUF weights (0.6 B & 1.7 B) (Hugging Face)

Qwen 3 on a Raspberry Pi 5: Small Models, Big Agent Energy