Docker Model Runner is available on Linux...

Those who have been following me for some time know that I've developed a sort of addiction to SLMs (Small Language Models) and that I enjoy running them on my Raspberry Pi 5 (8GB RAM) with Ollama. By doing this, my goal is threefold:

Demonstrate that we can do Generative AI without GPU
Demonstrate that SMLs (*) can be useful (coupled with RAG for example or as a "team")
And ultimately, demonstrate that we can also do Generative AI on less powerful machines with what I call "Tiny Language Models" (TLMs).

The release of Docker Model Runner for Linux was excellent news: I was going to be able to verify if my favorite work tool would allow me to satisfy my passion (TLMs on RPi 5). And if it runs on my RPi 5, it will also run on any Linux machine... But also in your CI Pipelines 😉 (I sense a new series of blog posts approaching).

Disclaimer: Docker Model Runner is in beta phase and some points are subject to change in the future.

Prerequisites

You'll need a Linux machine, the RPi 5 with 8 GB RAM and a good SD Card is the perfect candidate, but you can use any Linux machine with at least 8 GB RAM. You can also of course use a VM for your tests.

You'll need to install Docker and Docker Compose. For those worried, know that I run them without problems on Pi0s and Pi3 A+, so on an RPi 5, it'll work without issues.

For info: I use Raspbian on my Pi 5.

You can find installation instructions here:

[Install Docker on Ubuntu](Ref: https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)
Install Docker Compose on Linux

✋ The Docker Compose version must be: v2.36.2

Good practice: add your user to the Docker group

sudo usermod -aG docker $USER
newgrp docker

Installing Docker Model Runner

The installation is extremely simple, just follow these instructions:

sudo apt-get update
sudo apt-get install docker-model-plugin
docker model list # (this will finish the setup)

Testing the installation with a 1st TLM: Qwen 2.5 0.5b

For your first test, I recommend testing with Qwen 2.5 0.5B. The "0.5B" refers to the number of parameters in the model. Parameters are the weights and biases learned by the neural network during training. The more parameters a model has, the more nuances and complexity it can theoretically capture in the data, but it also requires more computational resources and memory. Qwen 2.5 and its variants are available on Docker Hub here: https://hub.docker.com/r/ai/qwen2.5

The 0.5B model is designed to be lightweight and efficient, ideal for applications with resource constraints or for deployment on less powerful devices, while maintaining reasonable capabilities for many tasks.

To download its small version, use the following command:

docker model pull ai/qwen2.5:0.5B-F16

In qwen2.5:0.5B-F16, the "F16" refers to the numerical precision format used to store the model parameters.

You can then verify that the model is properly installed by running the following command:

docker model list

You should see output similar to this:

MODEL NAME           PARAMETERS  QUANTIZATION  ARCHITECTURE  MODEL ID      CREATED       SIZE
ai/qwen2.5:0.5B-F16  494.03 M    F16           qwen2         3e1aad67b4cc  2 months ago  942.43 MiB

And you can ask your first question to Qwen 2.5 with the following command:

In interactive mode:

docker model run ai/qwen2.5:0.5B-F16

Or directly in command mode:

docker model run ai/qwen2.5:0.5B-F16 "Who is Spock?"

And you can note that the response time is quite acceptable for such a small machine:

https://youtu.be/lDRE68ps1Gg

https://youtu.be/lDRE68ps1Gg

But, can I find even smaller models?

Yes, there's for example Smollm2, but it might be a bit too small for my taste and I haven't yet managed to get what I wanted from it. Nevertheless, it's even faster than Qwen 2.5 0.5B on small machines, so I'm keeping it for future tests and to understand in which use cases it could be useful to me (to be continued).

What if you "packaged" your own models? There are numerous models available on Hugging Face in GGUF format that are usable with Docker Model Runner.

GGUF is a file format specifically designed to store and distribute language models (LLM). GGUF has become the de facto standard for distributing open-source LLMs optimized for local inference.

Let's see how to proceed to package your own model for Docker Model Runner.

Packaging your own model for Docker Model Runner

If you type the following command:

docker model help

You'll get this output:

Usage:  docker model COMMAND

Docker Model Runner

Commands:
  inspect          Display detailed information on one model
  install-runner   Install Docker Model Runner
  list             List the available models that can be run with the Docker Model Runner
  logs             Fetch the Docker Model Runner logs
  package          package a model
  pull             Download a model
  push             Upload a model
  rm               Remove models downloaded from Docker Hub
  run              Run a model with the Docker Model Runner
  status           Check if the Docker Model Runner is running
  tag              Tag a model
  uninstall-runner Uninstall Docker Model Runner
  version          Show the Docker Model Runner version

Run 'docker model COMMAND --help' for more information on a command.

And you can see that the package command is available. This command allows you to package a model for Docker Model Runner.

Now, if you type the following command:

docker model package --help

You'll get this output:

Usage:  docker model package [OPTIONS] MODELUsage:  docker model package --gguf <path> [--license <path>...] --push TARGET

package a model

Options:
      --gguf string           absolute path to gguf file (required)
  -l, --license stringArray   absolute path to a license file
      --push                  push to registry (required)

So we need a model in GGUF format and a license. Let's take a tour of Hugging Face to find a model that suits us. ✋ you'll need to create an account to be able to download models.

Let's stick with Qwen2.5-0.5B

I'm particularly fond of "Qwen" models. Later, it will be up to you to do your own explorations on Hugging Face to find the models that suit you best. But for now, let's stick with Qwen2.5-0.5B.

If you search for "Qwen2.5-0.5B gguf", the suggested search result will be: Qwen2.5-0.5B-Instruct-GGUF . Then if you click on the "Files and versions" tab, you'll get a list of model variant files:

imgs qwen

I suggest downloading the qwen2.5-0.5b-instruct-q5_0.gguf file as well as the LICENSE license to your machine, and then running the following script to package the model and push it to Docker Hub:

#!/bin/bash
HF_MODEL=qwen2.5-0.5b-instruct-q5_0.gguf
LICENSE=LICENSE
HANDLE=k33g
DMR_MODEL=qwen2.5:0.5b-instruct-q5_0

docker model package \
--gguf $(pwd)/${HF_MODEL} \
--license $(pwd)/${LICENSE} \
--push ${HANDLE}/${DMR_MODEL}

docker model pull ${HANDLE}/${DMR_MODEL}

✋ Of course, you'll need to adapt the HANDLE variable according to your Docker Hub username.

Once the model is published on Docker Hub, you can use it like any other model. Go back to your Pi and run the following command:

docker model pull k33g/qwen2.5:0.5b-instruct-q5_0

✋ here too change k33g to your Docker Hub username. (or use my version if you want to do tests).

And you can now use this new model like other models:

docker model run k33g/qwen2.5:0.5b-instruct-q5_0 "Who is Spock?"

Testing qwen2.5:0.5b-instruct-q5_0 with a Generative AI application

For this, I've prepared a small Docker Compose project that uses this model with Google's Python framework Agent Development Kit: https://github.com/whales-collective/tiny-agent#

To use it, it's very simple:

git clone https://github.com/whales-collective/tiny-agent.git
cd tiny-agent/agents/
docker compose up

And you can then access the Generative AI application via your browser at the following address: http://domain-name-of-your-pi:6060

While the Compose project loads and builds, let's examine the compose.yml file:

services:
  tiny-agent:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 6060:8000
    environment:
      - PORT=8000
      - OPENAI_API_BASE=${DMR_BASE_URL}/engines/llama.cpp/v1
      - OPENAI_API_KEY="tinymodelsaretheway"
      - MODEL_RUNNER_CHAT_MODEL=${MODEL_RUNNER_CHAT_MODEL}
    depends_on:
      - download-chat-model

  download-chat-model:
    provider:
      type: model
      options:
        model: ${MODEL_RUNNER_CHAT_MODEL}

Google's ADK uses LiteLLM to connect to local LLMs. And LiteLLM uses the OpenAI API to connect to models.

✋ What's important to note in this Compose file are the environment variables MODEL_RUNNER_CHAT_MODEL and DMR_BASE_URL which are defined in the .env file:

DMR_BASE_URL=http://172.17.0.1:12434
MODEL_RUNNER_CHAT_MODEL=k33g/qwen2.5:0.5b-instruct-q5_0

Our application runs in a container and must be able to access the Docker Model Runner API. For this, we must use the following address: http://172.17.0.1:12434 (which will have a "nickname" in the future).

Outside of a container you can perfectly use the address http://localhost:12434 to access the Docker Model Runner API:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "k33g/qwen2.5:0.5b-instruct-q5_0",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Who is Spock?"
            }
        ]
    }'

Let's test the application

The application should now be accessible here: http://domain-name-of-your-pi:6060 and once the model is loaded, you can note that the response times remain acceptable:

https://youtu.be/NPZwO2LXkIM