Training Mistral 7B on a Local Machine with CUDA (RTX 4090)

This guide will help you train the Mistral 7B language model on your local machine using CUDA GPU. This guide was tested on a local machine with an RTX 4090, but it should work with GPUs like Tesla T4 that have a minimum of 16 GB of RAM. This guide is designed for continual pre-training, meaning that after training, the model will be capable of text completion rather than Question Answer tasks.

1. System Requirements

  • GPU: NVIDIA RTX 4090 (or similar) with CUDA support.

  • OS: Linux (Ubuntu preferred).

  • CUDA Toolkit: CUDA 12.1 or newer.

  • Python 3.8+.

  • Git and a Python package manager like pip.

2. Install Dependencies

First, make sure your GPU drivers and CUDA are properly installed. You can install CUDA with:

sudo apt update
sudo apt install nvidia-cuda-toolkit

After installing CUDA, verify the installation:

nvidia-smi

Next, set up Python dependencies. Create a new virtual environment to keep your workspace clean:

python3 -m venv mistral_env
source mistral_env/bin/activate

Then, install the required packages, including torch with CUDA support and unsloth for training:

pip install --upgrade pip
pip install torch==2.3.0+cu121 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/unslothai/unsloth.git

You can also get the latest nightly version of Unsloth if needed:

pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

3. Download the Dataset

In this example, we will use the "Tiny Stories" dataset from Hugging Face. We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. You can download it with the datasets library:

from datasets import load_dataset

# Load the Tiny Stories dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")

4. Model Setup

We will use the Mistral 7B model with 4-bit quantization to optimize GPU memory usage. Here is how to set up the model with Unsloth:

from unsloth import FastLanguageModel
import torch

# Define model parameters
max_seq_length = 2048
load_in_4bit = True  # Use 4-bit quantization to reduce memory usage

dtype = torch.float16 if torch.cuda.get_device_capability()[0] >= 7 else None

# Load the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-v0.3",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

5. Add LoRA Adapters

To further optimize training, use LoRA (Low-Rank Adaptation) adapters. LoRA is a technique that allows you to train large language models efficiently by only updating a small fraction (1 to 10%) of all parameters, significantly reducing computational requirements. We also add embed_tokens and lm_head to allow the model to learn out-of-distribution data, which is important for continual pre-training tasks. This allows you to update only specific parts of the model, making training more efficient:

model = FastLanguageModel.get_peft_model(
    model,
    r=128,  # Choose any number > 0! Suggested values are 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", 
                    "embed_tokens", "lm_head"], # Add for continual pretraining
    lora_alpha=32,
    lora_dropout=0, # Supports any, but = 0 is optimized
    bias="none", # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora=True,
    loftq_config = None, # And LoftQ
)

6. Prepare Dataset for Training

Prepare the dataset for training. It is important to add EOS_TOKEN or tokenizer.eos_token to ensure that the model's generation does not continue indefinitely:

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    return {"text": [example + EOS_TOKEN for example in examples["text"]]}

# Apply formatting
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

7. Training the Model

Now, configure the training arguments and start training using UnslothTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

training_args = UnslothTrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_ratio=0.1,
    num_train_epochs=1,
    learning_rate=5e-5,
    embedding_learning_rate=5e-6,
    fp16=True,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.00,
    lr_scheduler_type="cosine",
    seed=3407,
    output_dir="outputs"
)

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    args=training_args
)

# Start training
trainer.train()

# Save the trained model
model.save_pretrained("outputs/mistral_7b_trained")
tokenizer.save_pretrained("outputs/mistral_7b_trained")

8. Running Inference

After training, you can use the model to generate text. Here's how to load the trained model and generate text starting with a specific prompt:

from transformers import TextIteratorStreamer
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer
import textwrap

# Load the trained model
model = AutoModelForCausalLM.from_pretrained("outputs/mistral_7b_trained").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("outputs/mistral_7b_trained")

text_streamer = TextIteratorStreamer(tokenizer)
max_print_width = 100

inputs = tokenizer(["Once upon a time, in a galaxy, far far away,"], return_tensors="pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Stream the generated text
length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width=max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end="")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end="")

Notes

  • Ensure your CUDA and GPU drivers are up-to-date to fully utilize the RTX 4090's capabilities.

  • Training may require several hours depending on the dataset size and batch configurations. Adjust gradient_accumulation_steps and per_device_train_batch_size as needed to optimize GPU memory usage.


Author Bio

Rafal Jackiewicz is an author of books about programming in C and Java. You can find more information about him and his work on Amazon.

0
Subscribe to my newsletter

Read articles from Rafal Jackiewicz directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Rafal Jackiewicz
Rafal Jackiewicz

Rafal Jackiewicz is an author of books about programming in C and Java. You can find more information about him and his work on https://www.jackiewicz.org