Performance metrics are essential tools for assessing the effectiveness and reliability of classification machine learning models. These metrics provide a structured and quantitative approach to evaluate how accurately a model can assign data points to specific, predefined categories. A thorough evaluation of a model's performance typically includes a range of measures, each offering unique insights into different aspects of its predictive capabilities. Key metrics include accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC-ROC).

Confusion Matrix

A confusion matrix is a tool used to evaluate the performance of a classification model. It is a table that summarizes the results of the model's predictions compared to the actual outcomes. The matrix typically has four components:

True Positives (TP): The cases in which the model correctly predicted the positive class.
True Negatives (TN): The cases where the model correctly predicted the negative class.
False Positives (FP): The instances in which the model incorrectly predicted the positive class (also known as Type I error).
False Negatives (FN): The cases where the model failed to predict the positive class but should have (also known as Type II error).

From these four values, various performance metrics can be calculated, such as accuracy, precision, recall, and F1-score, which help in understanding how well the model is performing.

Accuracy is often seen as the most straightforward metric. It represents the overall proportion of correct predictions made by the model, combining both true positives (correctly identified positive instances) and true negatives (correctly identified negative instances) relative to the total number of predictions. Although accuracy is a useful starting point, it can be misleading in cases where the dataset is imbalanced — for instance, in scenarios where one class significantly outweighs another. In such cases, a high accuracy rate might mask poor performance in predicting the minority class.

Precision is another critical metric that specifically focuses on the accuracy of the positive predictions made by the model. It is calculated as the number of true positives divided by the sum of true positives and false positives. High precision is particularly crucial in contexts where the consequences of false positives are high, such as in fraud detection or medical testing, where incorrect positive identifications can lead to unnecessary interventions or alarm.

\(\mathbf {precision= \frac {TP}{TP+FP}}\)

Recall, also known as sensitivity, measures the model's ability to identify all relevant cases within a dataset. It quantifies this capability by dividing the number of true positives by the sum of true positives and false negatives. High recall values are especially important in areas where the cost of missing a positive case can have severe ramifications, such as in disease screening or safety-critical applications, where overlooking a positive instance could result in dire outcomes.

\(\mathbf {recall= \frac {TP}{TP+FN}}\)

The F1 score is a composite measure that serves as the harmonic mean of precision and recall. This metric is particularly beneficial in scenarios where both false positives and false negatives carry significant weight, as it offers a single score that balances both metrics. It becomes particularly important in the context of imbalanced classes, where one class may be much smaller than the other, leading to inflated accuracy metrics that do not faithfully represent model performance.

\(\mathbf {F1= \frac {Precision * Recall}{Precision+Recall}}\)

Finally, the area under the receiver operating characteristic curve (AUC-ROC) provides a nuanced perspective on the model's capability to differentiate between classes across various classification thresholds. It plots the true positive rate against the false positive rate, outlining the trade-offs involved in model predictions at differing levels of sensitivity and specificity. A high AUC value indicates that the model is effective at distinguishing between classes, giving practitioners a clear indication of performance across a continuum of potential decision thresholds.

By meticulously analyzing these diverse performance metrics, data scientists and machine learning practitioners can uncover the strengths and weaknesses of their models. This multifaceted evaluation empowers them to make informed adjustments and enhancements to their models, ultimately leading to improved performance and more accurate predictions in real-world applications. This rigorous approach not only enhances model robustness but also fosters a deeper understanding of the models' operational characteristics in various contexts.

Let’s analyze these metrics using an example of a classification model.

We will use the MNIST dataset, which is available from the Scikit library. Lets load this data

from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', as_frame=False)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

Create the features and target lables and check the shape of the data

X, y = mnist.data, mnist.target
X.shape

(70000, 784)

Lets check the first data point

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap="binary")
    plt.axis("off")

some_digit = X[0]
plot_digit(some_digit)
plt.show()

It is digit 5

We will divide the data into train and test

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In order to effectively display all performance metrics, we will develop a binary classifier. This classifier will categorize the labels by assigning a value of true when the digit is 5 and a value of false for all other digits. This approach will allow us to analyze the model's ability to correctly identify the presence of the digit 5 compared to other digits.

y_train_5 = (y_train == '5')  # True for all 5s, False for all other digits
y_test_5 = (y_test == '5')

We will be implementing a stochastic gradient descent (SGD) classifier, which is a powerful and efficient approach for optimizing our machine learning model. This method works by updating the model's parameters incrementally, using randomly selected subsets of the training data known as mini-batches. By doing so, we can navigate the loss function more effectively, allowing us to refine our model's performance while also reducing the computational burden typically associated with processing the entire dataset at once. This iterative process helps us find the optimal weights for our model, ultimately leading to better predictions.

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

After calling on model fit lets test this model using first digit by calling predict function

sgd_clf.predict([some_digit])
array([ True])

In order to assess the accuracy of our model, we will utilize a technique known as cross-validation. Specifically, we will employ the cross_val_score function from the scikit-learn library. This function allows us to evaluate the performance of our model by splitting the dataset into multiple subsets, training the model on some of these subsets, and validating it on the remaining ones. By repeating this process several times, we can obtain a more reliable estimate of the model's accuracy.

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.95035, 0.96035, 0.9604 ])

It gives an accuracy of 95% which is good, but to better understand we will create Confusion matrix and will analyze other performance metrics.

We will utilize the cross_val_predict method from the scikit-learn library to generate predicted values based on our model. This method allows us to perform cross-validation and provides a way to obtain predictions for each data point in our dataset by training the model multiple times on different subsets of the data. This approach helps ensure that we get a more accurate estimate of the model's performance.

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

We will utilize the confusion_matrix function provided by the scikit-learn library, which allows us to evaluate the performance of our classification model by comparing the predicted classifications to the actual outcomes. This function generates a matrix that summarizes the correct and incorrect predictions, offering insights into the model's accuracy and error types.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_train_5, y_train_pred)
cm

array([[53892,   687],
       [ 1891,  3530]])

TN=53892, FP=687, FN=1891 and TP= 3530

Lets calculate the precision

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)

0.8370879772350012

Recall

recall_score(y_train_5, y_train_pred)

0.6511713705958311

F1 score

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

0.7325171197343847

We will utilize the precision_recall_curve function from the scikit-learn library, specifically employing the method set to "decision_function." This choice allows us to generate scores for every instance within our dataset. By doing so, we can conduct a comprehensive assessment of our classification model's performance, as it enables the calculation of precision and recall at various threshold levels. This thorough evaluation is crucial for understanding how well our model distinguishes between different classes and identifying the optimal threshold to balance precision and recall effectively.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
threshold = 3000

We will plot the precision , recall and threshold

plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

90 % precision is achieved around 50% recall.

Receiver Operating Characteristic (ROC)

The Receiver Operating Characteristic (ROC) curve is an important tool used for evaluating the performance of binary classifiers. It visually represents the trade-off between the True Positive Rate (TPR), also known as sensitivity, and the False Positive Rate (FPR). TPR indicates the proportion of actual positive cases that are correctly identified by the model, while FPR reflects the proportion of actual negative cases that are incorrectly classified as positive. Additionally, the True Negative Rate (TNR), which is also called specificity, measures the model’s ability to correctly identify negative cases. The ROC curve essentially plots TPR against 1 minus specificity, providing a graphical representation of the classifier's performance across various threshold settings. This allows for a comprehensive assessment of the model's strengths and weaknesses in distinguishing between the two classes.

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Lets get a threshold value for a precision of 90% by using argmax function.

idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
threshold_for_90_precision

Plot the ROC curve

idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(6, 5))  # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)
plt.show()

Dotted line represents the ROC curve and a good classifier stays far away from this dotted line on top left side.

To effectively evaluate the performance of a classification model, we can measure the area under the Receiver Operating Characteristic (ROC) curve. Scikit-learn conveniently provides a function specifically designed for estimating this area. A perfect ROC-AUC score, which indicates flawless model performance, is represented by a value of 1. This score signifies that the model can perfectly distinguish between positive and negative classes.

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

np.float64(0.9604938554008616)

Code for this blog is available at PerformanceMeasures

Performance Measures for Classification model using Scikit