Deploying Gemma 1B GGUF model with llama.cpp on Ubuntu

Pranjali SachanPranjali Sachan
5 min read

For this guide we should be using an Ubuntu server with atleast 2 GiB of GPU VRAM, 2 VCPU Cores, 8 GiB RAM, and 50 GiB of storage.

Let’s start by updating and installing necessary libraries.

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl

For GPU acceleration, we would need CUDA (for NVIDIA GPUs).

sudo apt install -y nvidia-cuda-toolkit

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA that allows developers to use NVIDIA GPUs for general-purpose computing (GPGPU).

Why is CUDA important?

  • Massive Parallelism: Uses thousands of GPU cores for simultaneous calculations.

  • High Performance: Faster than CPU-based processing for AI/ML, gaming, and scientific computing.

  • Deep Learning Optimization: Enables efficient execution of frameworks like PyTorch, TensorFlow, and LLaMA.cpp.

How CUDA works?

CUDA enables GPU acceleration by offloading computations from the CPU to the GPU.

  • Host (CPU): Sends tasks to the GPU.

  • Device (GPU): Executes tasks in parallel.

  • Memory Transfer: Data is moved between CPU (RAM) and GPU (VRAM).


Next, we will clone and build llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build

# Important!
cmake .. # CPU only (No GPU)
cmake .. -DLLAMA_CUDA=ON # NVIDIA GPU (CUDA)

cmake --build . --config Release
cd

Great! You can view the build output inside llama.cpp/build/bin directory.


Now, we need to download our quantized model from huggingface.co. To download from huggingface.co, we will install huggingface-cli using pip. First, check if python and pip are installed?

python3 --version
pip3 --version

# If Not Installed, run
sudo apt install python3 python3-pip -y

Next, create a virtual env, activate it, and install huggingface-cli:

python3 -m venv environment_name # replace environment_name with the name of your choice
source environment_name/bin/activate
pip3 install --upgrade huggingface_hub
# deactivate # use this to deactivate the virtual env after when finished downloading the model

At this point, you would need to create an account on huggingface.co and create an access token that we can use to login in the huggingface-cli on our server:

huggingface-cli login
# It should prompt you to enter your access token!

Now, we’re ready to download our model - unsloth/gemma-3-1b-it-GGUF. You can use any similar model that satisfies the minimum hardware requirements. We will be downloading gemma-3-1b-it-Q4_K_M.gguf for balanced performance.

huggingface-cli download unsloth/gemma-3-1b-it-GGUF gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models

What is Q4_K_M in models?

Q4_K_M is a quantization format used in .GGUF models to reduce model size while maintaining accuracy and performance. It uses less VRAM, had good accuracy, and works on both GPUs and CPUs.


We will utilize tmux to run the model in background. Ensure tmux is installed:

tmux -V # check tmux version
sudo apt install tmux -y # run only if tmux command not found

# start a new tmux session
tmux new -s llama-server

Awesome, we’re now ready to run the model using llama.cpp/build/bin/lama-server which will start a lightweight chat ui on 127.0.0.1:8080 to interact with the model.

llama.cpp/build/bin/llama-server -m models/gemma-3-1b-it-Q4_K_M.gguf --n-gpu-layers 20 --threads 2

What is —n-gpu-layers 20 flag?

It specifies how many layers of the model should be offloaded on the GPU instead of running on CPU.

How to Choose the Right Value?

  • Higher value (100+) → More layers on GPU → Faster inference but requires more VRAM.

  • Lower value (10-50) → Fewer layers on GPU → Slower inference but consumes less VRAM.

Typical VRAM Requirements:

ModelLayers on GPU (--n-gpu-layers)VRAM Needed
3B20-50~4GB
7B40-70~6GB
13B80-100~10GB
33B100+~24GB+

We can move the tmux session llama-server in the background by pressing Ctrl + B followed by D. Below are additional commands to list, re-attach, or kill a tmux session:

tmux ls # list tmux sessions
tmux attach -t llama-server # re-attach session running llama-server
tmux kill-session -t llama-server # kill sessing running llama-server

Now, to access the chat ui from outside the server, we can setup a reverse-proxy to chat ui running locally on 127.0.0.1:8080. Let’s setup a reverse-proxy using NGINX.

sudo apt install nginx -y   # Install NGINX
nginx -v                    # Check if installed correctly
sudo systemctl start nginx  # Start NGINX service
sudo systemctl status nginx # Check status

Next, we will create a NGINX configuration file and call it llama-server:

sudo nano /etc/nginx/sites-available/llama-server

Add the following configuration to the file and replace the server_name with the actual IP address of your server:

server {
    listen 80;
    server_name yourdomain.com; # replace with actual IP addr of server

    location / {
        proxy_pass http://127.0.0.1:8080; # since our chat ui is running locally on 8080
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Next, enable the configuration file and restart NGINX service:

sudo ln -s /etc/nginx/sites-available/llama-server /etc/nginx/sites-enabled/ # enable config
sudo nginx -t # check for syntax errors
sudo systemctl restart nginx # restart NGINX service

It is recommended to add a Firewall rule to limit access to server from your IP address. I will publish a different article on how to limit access to your server but for now, we will enable the firewall and allow http traffic on our server:

ufw allow http # enable http traffic
ufw enable     # enable firewall
ufw status     # check if http rule is added

Great! Now, you should be able to access the chat ui to interact with the model by navigating to http://ip-of-server/ in your browser!

Screenshot

Thank you for reading this guide! I appreciate any comments or suggestions from readers!

0
Subscribe to my newsletter

Read articles from Pranjali Sachan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pranjali Sachan
Pranjali Sachan