How To Locally Run a Hugging Face Model Using Ollama


Hey guys, this will be a quick guide on locally running a simple model downloaded from Hugging Face using Ollama.
Ollama is a software platform that simplifies running large language models (LLMs) locally on your computer. Its CLI allows you to access and interact with open-source LLMs without the complexities of managing model weights and configurations, offering a private alternative to cloud-based LLM chatbots like Open AI’s ChatGPT or Anthropic’s Claude.
Setting Up ollama
The first step is to visit Ollama’s website and download the app for your supported operating system.
Once that's done, run the software from your application manager.
Nothing seems to happen, but Ollama will start a live server running in the background. You should see an Ollama icon on your desktop menu bar.
This confirms the server is up and running. To confirm the Ollama CLI was successfully downloaded, open your terminal and run ollama
. This should output a list of available commands.
Ollama gives you access to a long list of popular models to pull and run out of the box. The model library contains all supported models.
To run a supported model, use this command:
ollama run MODEL_NAME
This will download and run the specified model if it has not already been downloaded. If it has been downloaded, it will run the model, allowing you to start prompting. Below I am running Meta’s llama 3.2 model.
Easy as that. Now let's get to the good stuff. Running Hugging Face models with Ollama.
Running Hugging Face Transformer models
Hugging Face is a community-driven platform and library that provides open-source tools, AI models, and datasets to help developers build, share, and train machine learning models more easily. It's really the GitHub for the AI LLM community.
In this section, you will download a simple text generation base model from Hugging Face that you can pull into Ollama and run. Here's the problem: most models on Hugging Face are in a .safetensor
format, usually accessed using the Hugging Face Transformers library. Ollama does not support this file type and wouldn't run inside the application.
So we need to convert the type into the supported .gguf
type used by the Ollama inference engine llama.cpp. This file type is a quantized, efficient version of the transformer models.
Download the zipped package of Kyutai’s 2B parameter text generation model, making sure to include all the files contained within the archive.
Since this is all open-source stuff, you can find converted versions like this from great community members. Don't worry. I'll still show you how to do it yourself if you can't find a converted model you'd like to use.
Converting Safetensor Models
For this stage, you’ll need to have installed a few Python dependencies and a pre-built version of the llama.cpp
tool necessary for converting your model file.
Run the following commands to install the dependencies:
pip install transformers safetensors gguf
On Mac and Linux, the Homebrew package manager can be used to install llama.cpp
via:
brew install llama.cpp
safetensors
- Required if the model you’re converting is in
.safetensors
format.
transformers
Provides tools to load and interpret Hugging Face models.
Let's you access the model config, tokenizer, and architecture needed for conversion.
llama.cpp
- Required to generate a GGUF-compatible model, you can run locally.
gguf
- Require this module to transform the model from Hugging Face's format (safetensors) to the GGUF format.
Once successfully installed, run the conversion command on the unzipped package folder containing the .safetensor
file. In my case I named the folder tensormodel.
python convert_hf_to_gguf.py ./tensormodel --outfile ./tensormodel/helium-1-2b.gguf
If successful, you should have a converted model named ‘helium-1-2b.gguf’ in the same directory.
llama.cpp
conversion tool, you can locate the directory where it was installed by running brew info llama.cpp
and call it using the full path.For example:
python <path/to/llama.cpp>/convert_hf_to_gguf.py ./tensormodel --outfile ./tensormodel/helium-1-2b.gguf
Let's take it one more step to quantize the converted file further. This reduces the number of bits used to represent each weight in the model, making it faster to run.
Run the llama-quantize
command on the converted model to quantize it:
<path/to/llama.cpp>/llama-quantize ./tensormodel/helium-1-2b.gguf ./tensormodel/helium-1-2b-q8_0.gguf q8_0
Creating a Custom Modelfile
You’ve got the Ollama servers running, and you’ve got your quantized model ready. Now, you only need to specify the model and pull it into Ollama as an inferable model.
Create a Modelfile
inside your project root, and populate it with the following (Notice no file extension is added):
FROM ./helium-1-2b-q8_0.gguf
# Set the system prompt
SYSTEM """
You are a helpful AI assistant that provides accurate and concise responses.
"""
# Set model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
# Set license
LICENSE apache-2.0
This is your custom model file containing configurations for how Ollama will build your Model. For this guide, your only config interest is the FROM
command, which points to the directory containing your Safetensors weights.
Once that's set, you’re ready to create and run the model in Ollama using the following commands:
ollama create model-name
ollama run model-name
Confirm your model was created by running ollama list in your terminal to see the list of available models.
Here, I named my model helium-model for the purpose of this guide.
And that's it. You’ve now learned to run a custom Hugging Face model locally with Ollama. I’ll be sharing more hands-on LLM guides soon, so stay tuned. Happy prompting!
Subscribe to my newsletter
Read articles from Noble Okafor directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by