“F1 score in ML: Intro and calculation”

Isha TripathiIsha Tripathi
6 min read

Machine learning is a dynamic field that involves the use of algorithms and statistical models to analyze data and make predictions. It has become a critical tool for solving complex problems in various industries, including finance, healthcare, and technology.

Deep learning, a subset of machine learning, uses artificial neural networks to tackle challenges in image and speech recognition, natural language processing, autonomous driving, etc. The algorithms used in deep learning have shown remarkable accuracy in solving complex problems, in some cases outperforming traditional machine learning techniques.

Evaluating the performance of learning algorithms is a crucial aspect of understanding their capabilities. Historically, accuracy has been the only metric used to compare models. However, this metric only calculates the number of correct predictions made by the model across the entire dataset, which is only applicable when the data is evenly distributed across different classes.

Overcoming the Inadequacies of Accuracy

The inadequacy of accuracy as a sole evaluation metric arises when the distribution of classes in a dataset is unbalanced. In such cases, a model that simply predicts the majority class all the time may still achieve high accuracy while actually performing poorly. For example, in a binary classification problem where 90% of the instances belong to class A and 10% belong to class B, a model that always predicts class A will have an accuracy of 90% while missing all instances of class B.

The F1 score was first introduced in 1979 as a way to address the limitations of accuracy in such scenarios.

What is F-1 Score?

The F1 score is a commonly used metric for evaluating the performance of machine learning models, particularly in the field of binary classification. It is a balance between precision and recall, both of which are important factors in determining the effectiveness of a classifier.

The F1 score takes into account both the true positive rate and the false positive rate, providing a more complete picture of model performance than relying on accuracy alone. In this way, the F1 score can help identify problems such as unbalanced classes, where a model may achieve high accuracy by simply predicting the majority class. By considering both precision and recall, the F1 score provides a more nuanced view of a model’s performance.

How to calculate F1 score?

A confusion matrix is a useful tool for evaluating the performance of a binary classification model. It has four components:

True positive (TP) refers to the number of instances where the model correctly identifies the positive class. False positive (FP) refers to the number of instances where the model predicts the positive class, but it is actually negative. True negative (TN) refers to the number of instances where the model correctly identifies the negative class. False negative (FN) refers to the number of instances where the model predicts the negative class, but it is actually positive. These components allow us to calculate several metrics, including accuracy, precision, recall, and F1 score.

Now, let’s define precision and recall. Precision is the proportion of true positive predictions (i.e. the number of correct positive predictions) out of all positive predictions made by the model.

Recall, on the other hand, is the proportion of true positive predictions out of all actual positive instances in the dataset. Both precision and recall range from 0 to 1, with a higher value indicating better performance./

The formula for the F1 score is:

The F1 score balances the trade-off between precision and recall by taking the harmonic mean of both metrics. High precision means fewer false positives, while high recall means fewer false negatives. The F1 score provides a single measure to evaluate a model’s performance that considers both aspects.

Calculating F1-score

Let’s take a binary classification problem with 100 observations, where our model correctly predicts 80 of them. Out of the 80 correct predictions, 70 are true positive(tp) and 10 are true negative(tn). On the other hand, 20 predictions were incorrect, with 15 being false positives (fp) and 5 being false negatives(fn).

Now, we can calculate precision, recall, and F1 score as follows:

Precision = (True Positives) / (True Positives + False Positives) = 70 / (70 + 15) = 0.82

Recall = (True Positives) / (True Positives + False Negatives) = 70 / (70 + 5) = 0.93

F1 Score = 2 (Precision Recall) / (Precision + Recall) = 2 (0.82 0.93) / (0.82 + 0.93) = 0.87

The F1 score for this model is 0.87, which indicates that the model has a good balance between precision and recall.

How to compute F1 Measures in Python?

In addition to the F1 score, it’s also important to understand the precision and recall of our model. Here is an example of how to calculate precision and recall in Python using sklearn. metrics library:

from sklearn.metrics import precision_score, recall_score

Calculate precision

precision = precision_score(y_true, y_pred) print(precision)

Calculate recall

recall = recall_score(y_true, y_pred) print(recall)

In this example, true labels are y_true and predicted labels are y_pred. The precision_score and recall_score functions are used to calculate the precision and recall, respectively, which are then printed.

The F1 score is calculated using the f1_score function in sklearn. metrics library in Python.

from sklearn.metrics import f1_score
# True labels
y_true = [1, 0, 1, 1, 0, 1]
# Predicted labels
y_pred = [1, 0, 0, 1, 0, 1]
# Calculate F1 score
f1 = f1_score(y_true, y_pred)
print(f1)

Here is a simple example to demonstrate the use of F1-score in a real-world scenario.

Let’s say we’re working on a problem of identifying fraudulent credit card transactions. The dataset consists of a list of credit card transactions, and we have to predict whether each transaction is genuine or fraudulent. We’ll train a binary classification model on this dataset, and use the F1-score to evaluate the model’s performance.

from sklearn.metrics import f1_score y_true = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # True labels (0 = genuine, 1 = fraudulent) y_pred = [0, 0, 1, 0, 1, 0, 1, 1, 1, 1] # Predicted labels

f1 = f1_score(y_true, y_pred, average='binary')

print("F1-score:", f1)

In the code above, using f1_score we calculate the F1-score for the model’s predictions. The y_true variable holds the true labels (0 = genuine, 1 = fraudulent), and the y_pred variable holds the predicted labels. The average parameter is set to ‘binary’ to indicate that this is a binary classification problem. Finally, the F1-score is printed, and we can use this metric to evaluate the performance of the model.

The F1 score provides a balance between precision and recall, which is essential in scenarios where false negatives (fraudulent transactions that are not detected) and false positives (genuine transactions that are marked as fraudulent) have different consequences.

Conclusion

In conclusion, the F1 score plays a crucial role in the evaluation of machine learning models, especially in binary classification problems where balancing precision and recall are crucial. This metric provides a harmonious evaluation of a model’s performance by considering both precision and recall, making it a valuable tool for data scientists in making informed decisions about model selection and optimization. The F1 score is widely used in various industries including healthcare, finance, and marketing for evaluating the performance of models in diagnosing diseases, detecting fraud, and identifying target customers.

0
Subscribe to my newsletter

Read articles from Isha Tripathi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Isha Tripathi
Isha Tripathi