Introduction

As the Open Source LLMs evolve, the power of Generative AI is within reach of all. Against the common belief that you need mighty machines or graphic cards to use these models, many techniques have evolved that allow you to run these models on a normal PC/Mac. Large Language Models(LLMs) can be compressed to run on your Laptop's CPU or GPU locally without the need for an internet connection.

This technique is called Quantization, where the weights of these complex models are compressed into lower bits which in turn reduces the RAM and Compute requirements of these models. This compression can sometimes lead to a slight drop in the model's performance but given the tradeoff of being able to use it in the first place is acceptable.

Depending on your level of technical skills you can explore below tools and frameworks to run LLMs.

LM Studio: An GUI application that allows you to download and run LLMs locally on your CPU/GPU.
Llama.cpp: A library written in C/C++ that you can use to run/quantize models locally on your CPU/GPU. Even LM Studio uses Llama.cpp under the hood to run models.

In this blog, I will be explaining how to use Llama.cpp to run models on a MacBook locally. If you are a beginner, I recommend exploring https://lmstudio.ai/ as it's very easy to download and run models in a few clicks.

How to download a Quantized LLM

You just need to download one GGUF/GGML file from TheBloke on Hugging Face to get started. Search for the model of your choice and download the GGUF file based on your system RAM configuration. I would recommend downloading at least a 4-bit quantized version of the model. For this blog, let's download the Mistral 7B (mistral-7b-instruct-v0.2.Q4_K_M.gguf) file from the Hugging Face URL below.

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main

You can also download these files using Hugging Face CLI, check documentation here HuggingFace CLI

Download Llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Metal Build

On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option.

When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers|-ngl 0 command-line argument.

Run LLM

Run your prompts directly using the below command:

./main -ngl 35 -m <path_to_model> --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<s>[INST] Tell me a joke [/INST]"

You can also start a server on your local host and start chatting with the LLMs

./server -m <path_to_model> --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 35 -n 512

Replace <path_to_model> with the actual path pointing to the GGUF model you downloaded before.

Conclusion

We can run LLMs locally utilizing our trusted CPU/GPUs in many ways. We can further utilize them to build applications using tools such as LangChain & Llama Index. Llama.cpp has Python buildings as well which makes it seamless to instantiate these LLMs in a Python script and use them as needed. This democratizes the use of AI models and reduces the dependency on paid models like ChatGPT & Gemini.

Check the below documentation on how to use LLMs locally with Langchain and Llama Index:

How to Run LLMs Locally