How to Run LLMs Locally
Introduction
As the Open Source LLMs evolve, the power of Generative AI is within reach of all. Against the common belief that you need mighty machines or graphic cards to use these models, many techniques have evolved that allow you to run these models on a normal PC/Mac. Large Language Models(LLMs) can be compressed to run on your Laptop's CPU or GPU locally without the need for an internet connection.
This technique is called Quantization, where the weights of these complex models are compressed into lower bits which in turn reduces the RAM and Compute requirements of these models. This compression can sometimes lead to a slight drop in the model's performance but given the tradeoff of being able to use it in the first place is acceptable.
Depending on your level of technical skills you can explore below tools and frameworks to run LLMs.
LM Studio: An GUI application that allows you to download and run LLMs locally on your CPU/GPU.
Llama.cpp: A library written in C/C++ that you can use to run/quantize models locally on your CPU/GPU. Even LM Studio uses Llama.cpp under the hood to run models.
In this blog, I will be explaining how to use Llama.cpp to run models on a MacBook locally. If you are a beginner, I recommend exploring https://lmstudio.ai/ as it's very easy to download and run models in a few clicks.
How to download a Quantized LLM
You just need to download one GGUF/GGML file from TheBloke on Hugging Face to get started. Search for the model of your choice and download the GGUF file based on your system RAM configuration. I would recommend downloading at least a 4-bit quantized version of the model. For this blog, let's download the Mistral 7B (mistral-7b-instruct-v0.2.Q4_K_M.gguf) file from the Hugging Face URL below.
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main
You can also download these files using Hugging Face CLI, check documentation here HuggingFace CLI
Download Llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Metal Build
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. To disable the Metal build at compile time use the LLAMA_NO_METAL=1
flag or the LLAMA_METAL=OFF
cmake option.
When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers|-ngl 0
command-line argument.
Run LLM
Run your prompts directly using the below command:
./main -ngl 35 -m <path_to_model> --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<s>[INST] Tell me a joke [/INST]"
You can also start a server on your local host and start chatting with the LLMs
./server -m <path_to_model> --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 35 -n 512
Replace <path_to_model> with the actual path pointing to the GGUF model you downloaded before.
Conclusion
We can run LLMs locally utilizing our trusted CPU/GPUs in many ways. We can further utilize them to build applications using tools such as LangChain & Llama Index. Llama.cpp has Python buildings as well which makes it seamless to instantiate these LLMs in a Python script and use them as needed. This democratizes the use of AI models and reduces the dependency on paid models like ChatGPT & Gemini.
Check the below documentation on how to use LLMs locally with Langchain and Llama Index:
Subscribe to my newsletter
Read articles from Shivang Agarwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Shivang Agarwal
Shivang Agarwal
I am a seasoned developer with a strong background spanning 5 years in crafting Advanced Analytics Data Applications. My expertise has been honed through employment with top management consulting firms, where I played a pivotal role in transforming clients' analytics landscapes through the implementation of AI and ML solutions. Currently, I am delving into the realm of Generative AI & LLMs. This endeavor serves as a personal challenge, allowing me to broaden my skill set and delve into a distinct technology.