Is the Model making right predictions? - Part 1 of 5 on Evaluation of Machine Learning Models
A student has exams after their training is done. So does the model. There are certain algorithms (now here we don’t mean machine learning models) that are used depending on the problem you have trained the machine learning model for. This is an extremely important concept which most courses that you’ll find give the least weight to and therefore, it is one of the earliest topics of this series.
Before jumping right into the algorithms, we first need to discuss one more idea - preparing the data for testing. This is a topic that should be and would be discussed in a separate post in an excruciating detail. For now, what we’'ll do is take the example dataset from the previous post of this series and segregate it into 2 datasets - one will be used for training and another for testing. This way, we have a dataset for which we know the actual answers and can easily compare the output of the machine learning models we develop.
To do so, we’ll again take use of the scikit-learn
library and use a function called train-test-split
that does exactly what we described before.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
The test size of 0.3 means that 30% of the data from the input training data should be reserved for testing purposes. This way, we have the actual output that model should give stored in the variable y_test
and the output that the model have in y_pred
.
Another important thing to understand is to answer 2 core questions -
Why do we evaluate the model?
What insights do we need to collect from a model in order to make better decisions?
How do we use the metric to optimize the model? (This will be covered in a separate post later in the series)
So all the evaluation algorithms that we are going to look at, we will try to answer these 3 questions. Let’s begin with the evaluation of classification models.
Accuracy
Perhaps one of the most straight forward metric. Let’s say you took an MCQ test that had 100 questions. You answered, 74 correct. Your accuracy is 74%. This is a calculation that all of us have intuitively been doing whenever we see ratios.
In really simple terms, Accuracy is defined as how many times you were correct divided by how many attempts you made. In terms of Machine Learning, you calculate accuracy by how many times the output of the model was correct divided by how many times the model was used.
We can calculate accuracy of our machine learning models using accuracy_score
function from the metrics
module of the library.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
Now, if Accuracy would be the best metric, it would have been the only metric to exist and this blog post would be over right here and I’d probably go to sleep instead of writing this at 11:00PM on a Saturday. But alas, it isn’t.
Let’s take 2 examples here -
Example A:
Actual | Predicted |
1 | 1 |
0 | 1 |
1 | 1 |
0 | 0 |
1 | 1 |
0 | 0 |
1 | 1 |
0 | 0 |
1 | 1 |
0 | 0 |
In this case, there are 9 correct predictions and 1 wrong output. This means, the model is 90% accurate.
Example B:
Actual | Predicted |
1 | 1 |
0 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
In this case as well, the accuracy of the model is 90%. But if you look closely, it has wrongly predicted every time the model should have predicted 0. Now this is a scenario that does occur in a lot of problems (credit card fraud identification, for instance) where there are very few examples of a certain class during training and testing the model, due to which the model gets biased to a certain class if not trained properly and carefully. This means, for any problem with an imbalance between classes, accuracy is a wrong metric to use.
There is one more challenge with using accuracy as the source of truth when working with a multi-class classification problem. It will not help you identify, at the minute level, if the model is confusing two of the classes, or is biased towards a single class or is straight up guessing and got lucky.
All these challenges mean that we need to look into a bit more sophisticated evaluation algorithms that do a bit more than just provide a number as a output.
These challenges, lead us to a stepping stone towards the solution - Confusion Matrix.
Confusion Matrix
It is not an evaluation metric, something that should be cleared at the start itself. It is something that you can use alongside the primary metric to answer the question “is my model getting confused between two classes?” while using the primary metric to optimize the model.
A confusion matrix can look something like this:
Class A | Class B | |
Class A | 45 | 5 |
Class B | 12 | 38 |
Rows of the confusion matrix are the actual classes while the columns are the predicted classes. To interpret the above matrix, there were 45 instances where the model predicted output as ‘A’ and it indeed was ‘A’ while 5 times it predicted ‘B’ while it was actually ‘A’. Similarly, 12 times model predicted ‘A’ while it was ‘B’ and 38 times the model correctly predicted ‘B’.
Similar to how you get accuracy in scikit-learn
, there is a function for confusion matrix that you can use to get the matrix.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
This matrix addresses one of the two core challenges of accuracy and allows you to understand if the model is confusing two classes. However, it doesn’t yet solve for the imbalance dataset issue for which accuracy is a big challenge.
That’s where we get 3 metrics - Precision, Recall, and F1 Score - all of which will be discussed in the next article of the series.
Subscribe to my newsletter
Read articles from Japkeerat Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Japkeerat Singh
Japkeerat Singh
Hi, I am Japkeerat. I am working as a Machine Learning Engineer since January 2020, straight out of college. During this period, I've worked on extremely challenging projects - Security Vulnerability Detection using Graph Neural Networks, User Segmentation for better click through rate of notifications, and MLOps Infrastructure development for startups, to name a few. I keep my articles precise, maximum of 4 minutes of reading time. I'm currently actively writing 2 series - one for beginners in Machine Learning and another related to more advance concepts. The newsletter, if you subscribe to, will send 1 article every Thursday on the advance concepts.