Qwen 3 on a Raspberry Pi 5: Small Models, Big Agent Energy

TL;DR – We ran the brand-new Qwen 3 (0.6 B & 1.7 B) on our Distiller CM5 dev-kit.
≈9 tokens/s on the 1.7 B Q8 build and 21 tokens/s on the 0.6 B Q8 build—both under 1.3 GB RAM, and—more interesting—noticeably “agentic” behaviour that auto-chains Wi-Fi → SSH steps with minimal prompting.

Why care?

  • Edge builders live under tight power & memory budgets.

  • Qwen 3’s tiny dense models (0.6 B – 1.7 B) were released April 29 2025 and trained on -36 T tokens—double Qwen 2.5’s corpus. (GitHub)

  • If they’re fast and reason well, you can embed them in robots, wearables, or battery-powered IoT without off-loading to the cloud.

Test rig

ItemSpec
BoardDistiller CM5 (4 × Cortex-A76 @ 2.4 GHz, 8 GB LPDDR4, ≤ 10 W)
Runtimellama.cpp (commit af2f…) + custom-built llama-cpp-python
ModelsQwen 3 0.6 B & 1.7 B (Q4_K_M / Q8 / BF16) plus Qwen 2.5 3 B (Q4_K_M)

Raw throughput

ModelSize (B)QuantRAM (GB)tokens/s (4Thread)tokens/s (1Thread)
Qwen 2.5‑instruct3Q4_K_M1.35.213.23
Qwen 30.6BF160.7614.3415.00
Qwen 30.6Q80.5321.4813.54
Qwen 31.7Q4_K_M1.905.425.76
Qwen 31.7Q81.208.695.14
Qwen 34Q4_K_M2.564.002.48

Source: internal llama-bench runs

Take‑away: Tiny Qwen‑3 0.6 B models hog < 1 GB and still hit double‑digit tokens/s (up to 21 t/s with Q8). Need more reasoning depth? The 1.7 B build stays under 2 GB, and even the 4 B variant fits in 2.6 GB on‑chip.

Take-away: the 1.7 B Q4_K_M sits at a sweet-spot: ~9 t/s, under 2 GB—perfect for 8 GB edge boards.

Agent-style usability test

We pointed each model at two real device tools:

  1. wifi_mcp_server – scans networks & joins Wi-Fi

  2. weather_mcp_server – fetches weather by city

Each scenario was prompted 10× with slight paraphrases.

ModelSuccesses / 10Notes
DeepSeek-V3-instruct10baseline
Qwen 2.5 1.5 B Q4_K_M1frequent tool misuse / hallucinated SSID
Qwen 3 1.7 B Q4_K_M10remembered end-goal, auto-planned Wi-Fi → SSH into the device
Qwen 3 0.6 B BF1610slightly chattier but still reliable

Observation: Qwen 3 models actively plan multi-step tasks. They skip redundant explanations, assume tool output is visible to the user, and immediately ask for the next parameter or input. It feels like a proactive agent compared to older models. They were definitely trained on some agent/function call workflow data.

What this means for builders

  1. Battery-friendly LLMs are here. You can hit double-digit tokens / s with a credit card sized compute now.

  2. Agentic behaviour at 1–2 B params can unlocks voice-first workflows for a lot use case now: “Scan Wi-Fi → join → open SSH” happens in one round-trip.

  3. Memory headroom (≤ 2 GB) leaves space for vision or audio pipelines on the same Sobel.

Further work

  • We are currently upgrading the Qwen2.5 stack to Qwen3 and will update our GitHub repo once things are stable. If you have a Raspberry Pi 5, feel free to try it out!

  • We are working on replacing faster-whisper with parakeet-tdt. We believe the 1.7B model paired with a better STT system can significantly improve the edge user experience. We'll publish benchmarks of all transcription tasks tested on Raspberry Pi 5 in our next post—stay tuned and follow along if you're interested.

If you want to get a plug-and-play devkit to experiment with edge LLM, check out our shop and YouTube videos.

Appendix: External resources

  • Qwen team’s launch blog – deeper dive into dataset & architecture (Qwen)

  • Baseten “Day-zero Qwen 3” latency report (235 B model) – shows server-scale potential (Baseten)

  • GitHub repo with GGUF weights (0.6 B & 1.7 B) (Hugging Face)


0
Subscribe to my newsletter

Read articles from PamirAI Founders directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

PamirAI Founders
PamirAI Founders