Qwen 3 on a Raspberry Pi 5: Small Models, Big Agent Energy

TL;DR – We ran the brand-new Qwen 3 (0.6 B & 1.7 B) on our Distiller CM5 dev-kit.
≈9 tokens/s on the 1.7 B Q8 build and 21 tokens/s on the 0.6 B Q8 build—both under 1.3 GB RAM, and—more interesting—noticeably “agentic” behaviour that auto-chains Wi-Fi → SSH steps with minimal prompting.
Why care?
Edge builders live under tight power & memory budgets.
Qwen 3’s tiny dense models (0.6 B – 1.7 B) were released April 29 2025 and trained on -36 T tokens—double Qwen 2.5’s corpus. (GitHub)
If they’re fast and reason well, you can embed them in robots, wearables, or battery-powered IoT without off-loading to the cloud.
Test rig
Item | Spec |
Board | Distiller CM5 (4 × Cortex-A76 @ 2.4 GHz, 8 GB LPDDR4, ≤ 10 W) |
Runtime | llama.cpp (commit af2f… ) + custom-built llama-cpp-python |
Models | Qwen 3 0.6 B & 1.7 B (Q4_K_M / Q8 / BF16) plus Qwen 2.5 3 B (Q4_K_M) |
Raw throughput
Model | Size (B) | Quant | RAM (GB) | tokens/s (4Thread) | tokens/s (1Thread) |
Qwen 2.5‑instruct | 3 | Q4_K_M | 1.3 | 5.21 | 3.23 |
Qwen 3 | 0.6 | BF16 | 0.76 | 14.34 | 15.00 |
Qwen 3 | 0.6 | Q8 | 0.53 | 21.48 | 13.54 |
Qwen 3 | 1.7 | Q4_K_M | 1.90 | 5.42 | 5.76 |
Qwen 3 | 1.7 | Q8 | 1.20 | 8.69 | 5.14 |
Qwen 3 | 4 | Q4_K_M | 2.56 | 4.00 | 2.48 |
Source: internal llama-bench
runs
Take‑away: Tiny Qwen‑3 0.6 B models hog < 1 GB and still hit double‑digit tokens/s (up to 21 t/s with Q8). Need more reasoning depth? The 1.7 B build stays under 2 GB, and even the 4 B variant fits in 2.6 GB on‑chip.
Take-away: the 1.7 B Q4_K_M sits at a sweet-spot: ~9 t/s, under 2 GB—perfect for 8 GB edge boards.
Agent-style usability test
We pointed each model at two real device tools:
wifi_mcp_server – scans networks & joins Wi-Fi
weather_mcp_server
– fetches weather by city
Each scenario was prompted 10× with slight paraphrases.
Model | Successes / 10 | Notes |
DeepSeek-V3-instruct | 10 | baseline |
Qwen 2.5 1.5 B Q4_K_M | 1 | frequent tool misuse / hallucinated SSID |
Qwen 3 1.7 B Q4_K_M | 10 | remembered end-goal, auto-planned Wi-Fi → SSH into the device |
Qwen 3 0.6 B BF16 | 10 | slightly chattier but still reliable |
Observation: Qwen 3 models actively plan multi-step tasks. They skip redundant explanations, assume tool output is visible to the user, and immediately ask for the next parameter or input. It feels like a proactive agent compared to older models. They were definitely trained on some agent/function call workflow data.
What this means for builders
Battery-friendly LLMs are here. You can hit double-digit tokens / s with a credit card sized compute now.
Agentic behaviour at 1–2 B params can unlocks voice-first workflows for a lot use case now: “Scan Wi-Fi → join → open SSH” happens in one round-trip.
Memory headroom (≤ 2 GB) leaves space for vision or audio pipelines on the same Sobel.
Further work
We are currently upgrading the Qwen2.5 stack to Qwen3 and will update our GitHub repo once things are stable. If you have a Raspberry Pi 5, feel free to try it out!
We are working on replacing faster-whisper with parakeet-tdt. We believe the 1.7B model paired with a better STT system can significantly improve the edge user experience. We'll publish benchmarks of all transcription tasks tested on Raspberry Pi 5 in our next post—stay tuned and follow along if you're interested.
If you want to get a plug-and-play devkit to experiment with edge LLM, check out our shop and YouTube videos.
Appendix: External resources
Qwen team’s launch blog – deeper dive into dataset & architecture (Qwen)
Baseten “Day-zero Qwen 3” latency report (235 B model) – shows server-scale potential (Baseten)
GitHub repo with GGUF weights (0.6 B & 1.7 B) (Hugging Face)
Subscribe to my newsletter
Read articles from PamirAI Founders directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
