Training Mistral 7B on a Local Machine with CUDA (RTX 4090)
This guide will help you train the Mistral 7B language model on your local machine using CUDA GPU. This guide was tested on a local machine with an RTX 4090, but it should work with GPUs like Tesla T4 that have a minimum of 16 GB of RAM. This guide is designed for continual pre-training, meaning that after training, the model will be capable of text completion rather than Question Answer tasks.
1. System Requirements
GPU: NVIDIA RTX 4090 (or similar) with CUDA support.
OS: Linux (Ubuntu preferred).
CUDA Toolkit: CUDA 12.1 or newer.
Python 3.8+.
Git and a Python package manager like
pip
.
2. Install Dependencies
First, make sure your GPU drivers and CUDA are properly installed. You can install CUDA with:
sudo apt update
sudo apt install nvidia-cuda-toolkit
After installing CUDA, verify the installation:
nvidia-smi
Next, set up Python dependencies. Create a new virtual environment to keep your workspace clean:
python3 -m venv mistral_env
source mistral_env/bin/activate
Then, install the required packages, including torch
with CUDA support and unsloth
for training:
pip install --upgrade pip
pip install torch==2.3.0+cu121 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/unslothai/unsloth.git
You can also get the latest nightly version of Unsloth if needed:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
3. Download the Dataset
In this example, we will use the "Tiny Stories" dataset from Hugging Face. We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. You can download it with the datasets
library:
from datasets import load_dataset
# Load the Tiny Stories dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")
4. Model Setup
We will use the Mistral 7B model with 4-bit quantization to optimize GPU memory usage. Here is how to set up the model with Unsloth:
from unsloth import FastLanguageModel
import torch
# Define model parameters
max_seq_length = 2048
load_in_4bit = True # Use 4-bit quantization to reduce memory usage
dtype = torch.float16 if torch.cuda.get_device_capability()[0] >= 7 else None
# Load the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-v0.3",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit
)
5. Add LoRA Adapters
To further optimize training, use LoRA (Low-Rank Adaptation) adapters. LoRA is a technique that allows you to train large language models efficiently by only updating a small fraction (1 to 10%) of all parameters, significantly reducing computational requirements. We also add embed_tokens
and lm_head
to allow the model to learn out-of-distribution data, which is important for continual pre-training tasks. This allows you to update only specific parts of the model, making training more efficient:
model = FastLanguageModel.get_peft_model(
model,
r=128, # Choose any number > 0! Suggested values are 8, 16, 32, 64, 128
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head"], # Add for continual pretraining
lora_alpha=32,
lora_dropout=0, # Supports any, but = 0 is optimized
bias="none", # Supports any, but = "none" is optimized
use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora=True,
loftq_config = None, # And LoftQ
)
6. Prepare Dataset for Training
Prepare the dataset for training. It is important to add EOS_TOKEN or tokenizer.eos_token to ensure that the model's generation does not continue indefinitely:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
return {"text": [example + EOS_TOKEN for example in examples["text"]]}
# Apply formatting
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)
7. Training the Model
Now, configure the training arguments and start training using UnslothTrainer
:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments
training_args = UnslothTrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=1,
learning_rate=5e-5,
embedding_learning_rate=5e-6,
fp16=True,
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.00,
lr_scheduler_type="cosine",
seed=3407,
output_dir="outputs"
)
trainer = UnslothTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted_dataset,
dataset_text_field="text",
args=training_args
)
# Start training
trainer.train()
# Save the trained model
model.save_pretrained("outputs/mistral_7b_trained")
tokenizer.save_pretrained("outputs/mistral_7b_trained")
8. Running Inference
After training, you can use the model to generate text. Here's how to load the trained model and generate text starting with a specific prompt:
from transformers import TextIteratorStreamer
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer
import textwrap
# Load the trained model
model = AutoModelForCausalLM.from_pretrained("outputs/mistral_7b_trained").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("outputs/mistral_7b_trained")
text_streamer = TextIteratorStreamer(tokenizer)
max_print_width = 100
inputs = tokenizer(["Once upon a time, in a galaxy, far far away,"], return_tensors="pt").to("cuda")
generation_kwargs = dict(
inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=True
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
# Stream the generated text
length = 0
for j, new_text in enumerate(text_streamer):
if j == 0:
wrapped_text = textwrap.wrap(new_text, width=max_print_width)
length = len(wrapped_text[-1])
wrapped_text = "\n".join(wrapped_text)
print(wrapped_text, end="")
else:
length += len(new_text)
if length >= max_print_width:
length = 0
print()
print(new_text, end="")
Notes
Ensure your CUDA and GPU drivers are up-to-date to fully utilize the RTX 4090's capabilities.
Training may require several hours depending on the dataset size and batch configurations. Adjust
gradient_accumulation_steps
andper_device_train_batch_size
as needed to optimize GPU memory usage.
Author Bio
Rafal Jackiewicz is an author of books about programming in C and Java. You can find more information about him and his work on Amazon.
Subscribe to my newsletter
Read articles from Rafal Jackiewicz directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Rafal Jackiewicz
Rafal Jackiewicz
Rafal Jackiewicz is an author of books about programming in C and Java. You can find more information about him and his work on https://www.jackiewicz.org