Docker Model Runner is available on Linux...


Those who have been following me for some time know that I've developed a sort of addiction to SLMs (Small Language Models) and that I enjoy running them on my Raspberry Pi 5 (8GB RAM) with Ollama. By doing this, my goal is threefold:
Demonstrate that we can do Generative AI without GPU
Demonstrate that SMLs (*) can be useful (coupled with RAG for example or as a "team")
And ultimately, demonstrate that we can also do Generative AI on less powerful machines with what I call "Tiny Language Models" (TLMs).
The release of Docker Model Runner for Linux was excellent news: I was going to be able to verify if my favorite work tool would allow me to satisfy my passion (TLMs on RPi 5). And if it runs on my RPi 5, it will also run on any Linux machine... But also in your CI Pipelines 😉 (I sense a new series of blog posts approaching).
Disclaimer: Docker Model Runner is in beta phase and some points are subject to change in the future.
Prerequisites
You'll need a Linux machine, the RPi 5 with 8 GB RAM and a good SD Card is the perfect candidate, but you can use any Linux machine with at least 8 GB RAM. You can also of course use a VM for your tests.
You'll need to install Docker and Docker Compose. For those worried, know that I run them without problems on Pi0s and Pi3 A+, so on an RPi 5, it'll work without issues.
For info: I use Raspbian on my Pi 5.
You can find installation instructions here:
[Install Docker on Ubuntu](Ref: https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)
✋ The Docker Compose version must be:
v2.36.2
Good practice: add your user to the Docker group
sudo usermod -aG docker $USER
newgrp docker
Installing Docker Model Runner
The installation is extremely simple, just follow these instructions:
sudo apt-get update
sudo apt-get install docker-model-plugin
docker model list # (this will finish the setup)
Testing the installation with a 1st TLM: Qwen 2.5 0.5b
For your first test, I recommend testing with Qwen 2.5 0.5B. The "0.5B" refers to the number of parameters in the model. Parameters are the weights and biases learned by the neural network during training. The more parameters a model has, the more nuances and complexity it can theoretically capture in the data, but it also requires more computational resources and memory. Qwen 2.5 and its variants are available on Docker Hub here: https://hub.docker.com/r/ai/qwen2.5
The 0.5B
model is designed to be lightweight and efficient, ideal for applications with resource constraints or for deployment on less powerful devices, while maintaining reasonable capabilities for many tasks.
To download its small version, use the following command:
docker model pull ai/qwen2.5:0.5B-F16
In
qwen2.5:0.5B-F16
, the"F16"
refers to the numerical precision format used to store the model parameters.
You can then verify that the model is properly installed by running the following command:
docker model list
You should see output similar to this:
MODEL NAME PARAMETERS QUANTIZATION ARCHITECTURE MODEL ID CREATED SIZE
ai/qwen2.5:0.5B-F16 494.03 M F16 qwen2 3e1aad67b4cc 2 months ago 942.43 MiB
And you can ask your first question to Qwen 2.5 with the following command:
In interactive mode:
docker model run ai/qwen2.5:0.5B-F16
Or directly in command mode:
docker model run ai/qwen2.5:0.5B-F16 "Who is Spock?"
And you can note that the response time is quite acceptable for such a small machine:
But, can I find even smaller models?
Yes, there's for example Smollm2, but it might be a bit too small for my taste and I haven't yet managed to get what I wanted from it. Nevertheless, it's even faster than Qwen 2.5 0.5B on small machines, so I'm keeping it for future tests and to understand in which use cases it could be useful to me (to be continued).
What if you "packaged" your own models? There are numerous models available on Hugging Face in GGUF format that are usable with Docker Model Runner.
GGUF is a file format specifically designed to store and distribute language models (LLM). GGUF has become the de facto standard for distributing open-source LLMs optimized for local inference.
Let's see how to proceed to package your own model for Docker Model Runner.
Packaging your own model for Docker Model Runner
If you type the following command:
docker model help
You'll get this output:
Usage: docker model COMMAND
Docker Model Runner
Commands:
inspect Display detailed information on one model
install-runner Install Docker Model Runner
list List the available models that can be run with the Docker Model Runner
logs Fetch the Docker Model Runner logs
package package a model
pull Download a model
push Upload a model
rm Remove models downloaded from Docker Hub
run Run a model with the Docker Model Runner
status Check if the Docker Model Runner is running
tag Tag a model
uninstall-runner Uninstall Docker Model Runner
version Show the Docker Model Runner version
Run 'docker model COMMAND --help' for more information on a command.
And you can see that the package
command is available. This command allows you to package a model for Docker Model Runner.
Now, if you type the following command:
docker model package --help
You'll get this output:
Usage: docker model package [OPTIONS] MODELUsage: docker model package --gguf <path> [--license <path>...] --push TARGET
package a model
Options:
--gguf string absolute path to gguf file (required)
-l, --license stringArray absolute path to a license file
--push push to registry (required)
So we need a model in GGUF format and a license. Let's take a tour of Hugging Face to find a model that suits us. ✋ you'll need to create an account to be able to download models.
Let's stick with Qwen2.5-0.5B
I'm particularly fond of "Qwen" models. Later, it will be up to you to do your own explorations on Hugging Face to find the models that suit you best. But for now, let's stick with Qwen2.5-0.5B.
If you search for "Qwen2.5-0.5B gguf", the suggested search result will be: Qwen2.5-0.5B-Instruct-GGUF . Then if you click on the "Files and versions" tab, you'll get a list of model variant files:
I suggest downloading the qwen2.5-0.5b-instruct-q5_0.gguf
file as well as the LICENSE
license to your machine, and then running the following script to package the model and push it to Docker Hub:
#!/bin/bash
HF_MODEL=qwen2.5-0.5b-instruct-q5_0.gguf
LICENSE=LICENSE
HANDLE=k33g
DMR_MODEL=qwen2.5:0.5b-instruct-q5_0
docker model package \
--gguf $(pwd)/${HF_MODEL} \
--license $(pwd)/${LICENSE} \
--push ${HANDLE}/${DMR_MODEL}
docker model pull ${HANDLE}/${DMR_MODEL}
✋ Of course, you'll need to adapt the
HANDLE
variable according to your Docker Hub username.
Once the model is published on Docker Hub, you can use it like any other model. Go back to your Pi and run the following command:
docker model pull k33g/qwen2.5:0.5b-instruct-q5_0
✋ here too change
k33g
to your Docker Hub username. (or use my version if you want to do tests).
And you can now use this new model like other models:
docker model run k33g/qwen2.5:0.5b-instruct-q5_0 "Who is Spock?"
Testing qwen2.5:0.5b-instruct-q5_0 with a Generative AI application
For this, I've prepared a small Docker Compose project that uses this model with Google's Python framework Agent Development Kit: https://github.com/whales-collective/tiny-agent#
To use it, it's very simple:
git clone https://github.com/whales-collective/tiny-agent.git
cd tiny-agent/agents/
docker compose up
And you can then access the Generative AI application via your browser at the following address: http://domain-name-of-your-pi:6060
While the Compose project loads and builds, let's examine the compose.yml
file:
services:
tiny-agent:
build:
context: .
dockerfile: Dockerfile
ports:
- 6060:8000
environment:
- PORT=8000
- OPENAI_API_BASE=${DMR_BASE_URL}/engines/llama.cpp/v1
- OPENAI_API_KEY="tinymodelsaretheway"
- MODEL_RUNNER_CHAT_MODEL=${MODEL_RUNNER_CHAT_MODEL}
depends_on:
- download-chat-model
download-chat-model:
provider:
type: model
options:
model: ${MODEL_RUNNER_CHAT_MODEL}
Google's ADK uses LiteLLM to connect to local LLMs. And LiteLLM uses the OpenAI API to connect to models.
✋ What's important to note in this Compose file are the environment variables MODEL_RUNNER_CHAT_MODEL
and DMR_BASE_URL
which are defined in the .env
file:
DMR_BASE_URL=http://172.17.0.1:12434
MODEL_RUNNER_CHAT_MODEL=k33g/qwen2.5:0.5b-instruct-q5_0
Our application runs in a container and must be able to access the Docker Model Runner API. For this, we must use the following address: http://172.17.0.1:12434
(which will have a "nickname" in the future).
Outside of a container you can perfectly use the address http://localhost:12434
to access the Docker Model Runner API:
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "k33g/qwen2.5:0.5b-instruct-q5_0",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who is Spock?"
}
]
}'
Let's test the application
The application should now be accessible here: http://domain-name-of-your-pi:6060 and once the model is loaded, you can note that the response times remain acceptable:
🎉 That's all for today. So you'll never be able to ask my little Qwen the same thing as ChatGPT, but you can see that Docker Model Runner with local models opens up new perspectives for you. And you can already start coding your first Generative AI applications.
My next blog posts around this theme will be the following:
Setting up automated LLM testing with Docker Model Runner and Testcontainers (how can I verify that the selected model responds well to my expectations)
Setting up a local CI with Docker Model Runner and Docker Compose
Using Docker Model Runner with GitLab CI
... I'll probably have other ideas in the meantime.
So stay tuned and don't hesitate to share your feedback and ideas with me.
Subscribe to my newsletter
Read articles from Philippe Charrière directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
