How to Evaluate Machine Learning Models With Cross-Validation

What is Cross-validation?

Cross-validation is a method for evaluating how well a machine-learning model performs on new data. To do this, the data is regularly divided into training and testing sets.

The model is then trained on the training set, and its performance is assessed on the testing set.

To estimate the model's performance, we carry out this process several times, and average the results.

Why is cross-validation important?

It is important because it allows a more thorough and trustworthy evaluation of machine learning models.

It evaluates model performance on various data samples by splitting the data into various subsets and systematically rotating them as training and validation sets.

With this method, the risk of overfitting is reduced, and it's easier to predict how well a model will generalize to new data.

And it also allows practitioners to confidently analyze the model's capabilities, select the right model, and tune the hyperparameters, thereby improving the accuracy and dependability of machine learning solutions.

Types of cross-validation

There are many different types of cross-validation, but the most common ones are:

  1. K-fold cross-validation

  2. Stratified K-fold cross-validation

  3. Hold-out based validation

1. K-fold cross-validation

K-fold cross-validation works by dividing the data into k equal portions. The model is then evaluated on the remaining portion of the data after training on the first k-1 parts of the data.

Each of the remaining portions of the data is used as the test set once during this operation, which is repeated k times. The model's performance is then estimated using the average of the k-test results.

K-fold cross-validation is a common technique for assessing machine learning models because it is simple to construct and offers a trustworthy estimate of the model's performance.

Check the code sample below:

from sklearn.model_selection import KFold

# Reset indices of X and y
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

# Define the number of folds (K)
k = 5

# Create a KFold object
kf = KFold(n_splits=k)

# Iterate over the folds
for train_index, val_index in kf.split(X):
    # Split the data into training and validation sets
    X_train, X_val = X.loc[train_index], X.loc[val_index]
    y_train, y_val = y.loc[train_index], y.loc[val_index]

    # Train and evaluate the model on the current fold
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)

    # Print the performance on the current fold
    print("Validation score:", score)

In the code above, the KFold class is created with a n_splits argument of 5. This means that the data will be split into 5 folds. The for loop then iterates over the folds, and for each fold, the data is split into training and validation sets. The model is then trained on the training set and evaluated on the validation set. The performance of the model on the validation set is then printed.

2. Stratified K-fold cross-validation

Stratified K-fold cross-validation is a technique that improves the assessment of machine learning models by maintaining the class distribution within each fold.

This technique is very helpful for datasets with imbalances since it divides the dataset into K subsets while maintaining the proportion of each class.

Stratified K-fold cross-validation delivers more accurate estimates of model performance and helps prevent the biased evaluation by guaranteeing a representative distribution of classes in each fold.

Check the code sample below:

from sklearn.model_selection import StratifiedKFold
import numpy as np

# Define the number of folds (K)
K = 5

# X: features, y: labels
skf = StratifiedKFold(n_splits=K)

# Convert X and y to numpy arrays if they are not already
X = np.array(X)
y = np.array(y)

# Iterate over the folds
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model using the current fold
    model.fit(X_train, y_train)

    # Evaluate the model
    score = model.score(X_test, y_test)
    print("Validation score:", score)

3. Hold-out-based validation

Hold-out-based cross-validation is a simple method of evaluating a machine-learning model by splitting the data into a training set and a test set. The model is trained on the training set and then evaluated on the test set.

This is easy to implement but can be less accurate than other types of cross-validation, such as k-fold cross-validation.

Check the code sample below:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestRegressor()

# Train the model using the training set
model.fit(X_train, y_train)

# Evaluate the model on the test set
accuracy = model.score(X_test, y_test)

References

https://towardsdatascience.com/what-is-cross-validation-60c01f9d9e75

https://towardsdatascience.com/cross-validation-430d9a5fee22

https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d

https://scikit-learn.org/stable/modules/cross_validation.html

Conclusion

Congratulations! You now possess the fundamental knowledge of cross-validation methods, giving you the confidence to improve the accuracy and dependability of your machine-learning models.

1
Subscribe to my newsletter

Read articles from Afolabi Mahmood Olalekan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Afolabi Mahmood Olalekan
Afolabi Mahmood Olalekan

I write about data science, machine learning coupled with any new thing I come across.