How To Locally Run a Hugging Face Model Using Ollama

Hey guys, this will be a quick guide on locally running a simple model downloaded from Hugging Face using Ollama.

Ollama is a software platform that simplifies running large language models (LLMs) locally on your computer. Its CLI allows you to access and interact with open-source LLMs without the complexities of managing model weights and configurations, offering a private alternative to cloud-based LLM chatbots like Open AI’s ChatGPT or Anthropic’s Claude.

Setting Up ollama

The first step is to visit Ollama’s website and download the app for your supported operating system.

Once that's done, run the software from your application manager.

Nothing seems to happen, but Ollama will start a live server running in the background. You should see an Ollama icon on your desktop menu bar.

This confirms the server is up and running. To confirm the Ollama CLI was successfully downloaded, open your terminal and run ollama. This should output a list of available commands.

Ollama gives you access to a long list of popular models to pull and run out of the box. The model library contains all supported models.

To run a supported model, use this command:

ollama run MODEL_NAME

This will download and run the specified model if it has not already been downloaded. If it has been downloaded, it will run the model, allowing you to start prompting. Below I am running Meta’s llama 3.2 model.

Easy as that. Now let's get to the good stuff. Running Hugging Face models with Ollama.

Running Hugging Face Transformer models

Hugging Face is a community-driven platform and library that provides open-source tools, AI models, and datasets to help developers build, share, and train machine learning models more easily. It's really the GitHub for the AI LLM community.

In this section, you will download a simple text generation base model from Hugging Face that you can pull into Ollama and run. Here's the problem: most models on Hugging Face are in a .safetensor format, usually accessed using the Hugging Face Transformers library. Ollama does not support this file type and wouldn't run inside the application.

So we need to convert the type into the supported .gguf type used by the Ollama inference engine llama.cpp. This file type is a quantized, efficient version of the transformer models.

Download the zipped package of Kyutai’s 2B parameter text generation model, making sure to include all the files contained within the archive.

Since this is all open-source stuff, you can find converted versions like this from great community members. Don't worry. I'll still show you how to do it yourself if you can't find a converted model you'd like to use.

Converting Safetensor Models

For this stage, you’ll need to have installed a few Python dependencies and a pre-built version of the llama.cpp tool necessary for converting your model file.

Run the following commands to install the dependencies:

pip install transformers safetensors gguf

On Mac and Linux, the Homebrew package manager can be used to install llama.cpp via:

brew install llama.cpp

`safetensors`

Required if the model you’re converting is in .safetensors format.

`transformers`

Provides tools to load and interpret Hugging Face models.
Let's you access the model config, tokenizer, and architecture needed for conversion.

`llama.cpp`

Required to generate a GGUF-compatible model, you can run locally.

`gguf`

Require this module to transform the model from Hugging Face's format (safetensors) to the GGUF format.

Once successfully installed, run the conversion command on the unzipped package folder containing the .safetensor file. In my case I named the folder tensormodel.

python convert_hf_to_gguf.py ./tensormodel --outfile ./tensormodel/helium-1-2b.gguf

If successful, you should have a converted model named ‘helium-1-2b.gguf’ in the same directory.

💡

I ran this command inside a Python virtual environment. If you encounter issues running the llama.cpp conversion tool, you can locate the directory where it was installed by running brew info llama.cppand call it using the full path.

For example:

python <path/to/llama.cpp>/convert_hf_to_gguf.py ./tensormodel --outfile ./tensormodel/helium-1-2b.gguf

Let's take it one more step to quantize the converted file further. This reduces the number of bits used to represent each weight in the model, making it faster to run.

Run the llama-quantize command on the converted model to quantize it:

<path/to/llama.cpp>/llama-quantize ./tensormodel/helium-1-2b.gguf ./tensormodel/helium-1-2b-q8_0.gguf q8_0

Creating a Custom Modelfile

You’ve got the Ollama servers running, and you’ve got your quantized model ready. Now, you only need to specify the model and pull it into Ollama as an inferable model.

Create a Modelfile inside your project root, and populate it with the following (Notice no file extension is added):

FROM ./helium-1-2b-q8_0.gguf

# Set the system prompt
SYSTEM """
You are a helpful AI assistant that provides accurate and concise responses.
"""

# Set model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096

# Set license
LICENSE apache-2.0

This is your custom model file containing configurations for how Ollama will build your Model. For this guide, your only config interest is the FROM command, which points to the directory containing your Safetensors weights.

Once that's set, you’re ready to create and run the model in Ollama using the following commands:

ollama create model-name
ollama run model-name

Confirm your model was created by running ollama list in your terminal to see the list of available models.

Here, I named my model helium-model for the purpose of this guide.

And that's it. You’ve now learned to run a custom Hugging Face model locally with Ollama. I’ll be sharing more hands-on LLM guides soon, so stay tuned. Happy prompting!