Overview

Hardware I used:

i5 (11th gen)
24 GB Ram (Can get away with less)
RTX 3050 (Helps but is definitely not necessary)

The use case I trained my model on was as an insight provider for trading stocks. This involves predicting the best action to take based on specific market parameters. Let me walk you through the key components that make this process effective and actionable.

The model takes in several key inputs:

Position: Indicates the current position in the market (Buy/Short/Out) for a specific item.
Current Price: The real-time price of the stock.
MAS (Moving Average Short Term): A short-term moving average to capture recent trends.
MAL (Moving Average Long Term): A long-term moving average that provides a broader view of the market trend.

These moving averages help the model understand the market's trend and volatility. Based on these parameters, the model classifies the best action to take—whether to buy, sell, or hold—depending on my current position and the observed market conditions.

Disclaimer: This tutorial is intended for educational purposes only and does not constitute financial advice. The model's predictions are just one component of a comprehensive trading strategy, and relying solely on this information without a broader strategy could lead to poor outcomes. Always ensure you have a well-rounded trading plan before risking your time and money. This is just a simple example to illustrate how such a model can be built.

Step One: Creating a Dataset

Contrary to popular belief, creating the dataset is often the most challenging step in the entire process. Gathering relevant data and formatting it correctly can be incredibly time-consuming and tedious, especially if you don't have the data readily available. Fortunately, AI tools can assist significantly, streamlining the data collection and preparation process when you're in a pinch.

The first step in fine-tuning BERT for text classification is to create a suitable dataset. This dataset should be relevant to the problem you are trying to solve and contain labeled examples for training. Make sure your dataset is well-balanced and contains enough samples to help BERT learn effectively. You can either collect data from open sources or create your own.

For my use case, I used ChatGPT to generate a dataset of 1,500 entries, with verified and "real-world market" scenarios in mind. Here is the format I used to request the data:

{
    "text": "Position: out, Price: 194.74, MAS: 181.77, MAL: 168.6",
    "label": "buy",
    "risk": 0.14
}

The resulting file should contain the requested number of entries formatted similarly to the example above. Note that sometimes ChatGPT may format the output as a .jsonl (JSON Lines) file. This is expected and should not be altered. The .jsonl format stores each entry on a new line, eliminating the need for commas between entries, which simplifies processing large datasets.

Now that we have our dataset, let's move on to the next step.

Step Two: Writing the Code

Below, I'll break down the code into different sections to help you understand each part of the process.

1. Importing Libraries

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import json
from sklearn.model_selection import train_test_split
from datasets import Dataset

The first step involves importing the necessary libraries:

transformers: Imports BERT tokenizer and model for sequence classification, as well as classes for training.
torch: Helps in utilizing GPU for faster training.
json: Allows reading the dataset from a JSON file.
train_test_split: From sklearn, used to split the data into training and validation sets.
Dataset: Converts our data to the format required for training.

2. Loading Dataset

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load your dataset
with open('classification.json', 'r') as f:
    data = json.load(f)

This section checks if a GPU is available and loads the dataset from a JSON file named classification.json.

3. Preparing the Dataset

# Prepare the dataset: Extract 'text' and 'label' fields for training
def prepare_data(data):
    texts = []
    labels = []

    for entry in data:
        texts.append(entry['text'])
        labels.append(entry['label'])  # Labels are already in 0, 1, or 2 format

    return texts, labels

texts, labels = prepare_data(data)

This function extracts the text and label fields from the dataset. These are then used for training.

4. Splitting the Dataset

# Split dataset into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2)

Here, the dataset is split into training (80%) and validation (20%) sets for effective model evaluation.

5. Tokenizing the Dataset

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# Tokenize the dataset
def tokenize_function(texts):
    return tokenizer(texts, padding="max_length", truncation=True, return_tensors='pt')

train_encodings = tokenize_function(train_texts)
val_encodings = tokenize_function(val_texts)

We load the BERT tokenizer and use it to tokenize both the training and validation texts. Tokenization converts the text into tokens that the model can understand.

6. Converting Encodings to Dataset Format

# Convert encodings to Dataset format
train_dataset = Dataset.from_dict({
    'input_ids': train_encodings['input_ids'], 
    'attention_mask': train_encodings['attention_mask'], 
    'labels': train_labels
})
val_dataset = Dataset.from_dict({
    'input_ids': val_encodings['input_ids'], 
    'attention_mask': val_encodings['attention_mask'], 
    'labels': val_labels
})

The tokenized encodings are converted to a Dataset format required for training, which includes input_ids, attention_mask, and labels.

7. Initializing the Model

# Initialize the model and move it to the appropriate device
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=3).to(device)

We initialize the BERT model for sequence classification. The model is then moved to the GPU if available.

8. Setting Training Arguments

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,  # Adjust if needed
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch",
    fp16=True  # Enable mixed precision if GPU is available
)

Here, we configure the parameters for training, such as the number of epochs, batch size, and learning rate. Enabling mixed precision (fp16) can speed up training if using a GPU.

9. Setting Up the Trainer

# Setup Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

The Trainer class is used to simplify the training loop by combining the model, training arguments, datasets, and other components.

10. Training the Model

# Train the model
trainer.train()

This command initiates the model training process based on the provided training arguments and datasets.

11. Saving the Fine-Tuned Model

# Save the model and tokenizer
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_model')

After training, the model and tokenizer are saved to the specified directory for future use.

12. Loading and Predicting with the Trained Model

# Example function to load and predict using the trained model
def predict(input_text):
    model = BertForSequenceClassification.from_pretrained('./saved_model').to(device)
    tokenizer = BertTokenizer.from_pretrained('./saved_model')

    # Tokenize the input text
    inputs = tokenizer(input_text, return_tensors="pt", padding="max_length", truncation=True).to(device)
    outputs = model(**inputs)

    # Get the predicted class (Buy, Sell, Hold)
    predicted_class = torch.argmax(outputs.logits, dim=-1).item()

    # Convert class label back to a string
    label_map = {0: "Buy", 1: "Sell", 2: "Hold"}
    return label_map[predicted_class]

# Example usage of the trained model
test_text = "Position: out, Price: 172.63, MAS: 173.42, MAL: 178.48"
predicted_response = predict(test_text)

print(predicted_response)  # Should print "Buy", "Sell", or "Hold"

This section provides a function to load the saved model and tokenizer for making predictions. The function takes an input string, tokenizes it, and returns the predicted class (Buy, Sell, or Hold).

I fine-tuned a BERT model to give trading insights