Model agnostic confidence estimation with conformal predictors for AutoML

Introduction

At MindsDB, we know how important it is for a user to trust their predictors. When it comes to machine learning algorithms, developing trustworthy solutions is a great challenge. In this blog post, we will take a closer look into how our machine learning models come up with a confidence estimate for their predictions.

Whenever a machine learning model is moved into production, it is expected to perform roughly as well as it did on a held-out validation dataset. However, we would like something stronger than a simple assumption that the performance of the deployed model will continue to hold as new data comes in. Having a notion of confidence when predicting can signal important information to the end-user, who can then decide whether they trust the predictions or not.

The task of measuring the confidence we have in a predictor’s predictions can be formulated in a simple way. We aim for models that will output the best possible predictions, but they can never be perfect. Because of this, we want a confidence estimation method such that if our predictor is x% confident in the result, this would mean that in x% of N trials (for a large N), the predictor would actually be right. Value convergence of accuracy and confidence metrics is a necessary property of “well calibrated models,” which is what we are after.

There are various methods for confidence estimation, ranging from the naïve —such as interpreting a normalized output vector to be the actual belief distribution— to more sophisticated and principled frameworks [3] [4].

MindsDB uses inductive conformal predictors for this task [5]. The conformal prediction framework is a simple, generic, model-agnostic approach for estimating confidence in our predictions. In this blog post, we will be taking a closer look at what this technique has to offer and how you can leverage it to truly trust predictions from within your database.

Figure 1: MindsDB helps you decide whether to trust the model’s predictions or not.

The theory behind conformal predictors

As the name suggests, conformity is central to this framework. We are especially interested in measuring how different or strange a problem instance is when compared to the data that the model was trained with. The idea is that whenever we have a very conforming data point to predict for, we can be more certain of the prediction that our model emits. On the other hand, if the data point is particularly nonconforming, then we should be more wary of the forecast.

Here’s the kicker: assuming only that our data is modeled by an exchangeable sequence of random variables (roughly speaking, their joint probability distribution is the same irrespective of the sequence order), we can guarantee a certain error rate in the confidence of any machine learning model trained with this framework. As AutoML is concerned with automatically evaluating different models, conformal predictors are especially well suited for MindsDB.

\(\\text{exchangeability:} \\quad P(\[X\_1=x\_1, X\_2=x\_2\]) \\equiv P(\[X\_2=x\_2, X\_1=x\_1\])\)

There are many variants of the framework, but we will focus on inductive conformal prediction (ICP), where we train the model and then calibrate the confidence estimation method just once.

Getting down to the specifics, we can formally define an inductive conformal predictor as the union of a machine learning model and an appropriate nonconformity function. To use them both, we first need to split the data as usual: training, validation and testing splits are required.

The validation dataset is used to get predictions from the trained model to compute calibration scores using the nonconformity function, which should estimate how strange a feature–label pair is. What are some examples of a nonconformity function? It turns out that even a random uniform distribution is a valid option, although not very useful. The only constraint is that this function treats training and testing data points equally. In practice, common choices include the inverse probability error function for classification tasks, and absolute error for regression tasks.

\(\\text{inverse probability error function:} \\quad 1-\\hat{P}(y\_i | x\_i)\)

\(\\text{absolute error function:} \\quad |y - \\hat{y}|\)

These calibration scores define the range of conformity that new data is going to be compared against. Using ICP, any prediction \(y\_i\) for the data point \(x\_i\) will be transformed into a set of predictions. In the case of regression tasks, we now have a range \(x\_i - u \\leq x\_i \\leq x\_i + u\), and in the case of classification tasks, the prediction will be a set of labels among which we believe the correct class is.

The way to arrive at these answers is simpler than one might expect. As we know the data is exchangeable, the conformity of any new sample can be directly compared to the scores obtained in the validation dataset, which can be thought of as a sample with equal-width intervals from our data distribution.

Putting it all together

Let’s consider an example regression task, for which we have four data points in our validation dataset. After computing and sorting their calibration scores, \(z1 \\leq z2 \\leq z3 \\leq z4\), we can say that 20% of any set of sampled data points are going to have a lower score than \(z1\). Likewise, \(z4\) will be at the 80th percentile of this distribution, meaning 20% of the data is going to present higher scores than \(z4\). This holds for both test and production data because we start with the assumption that they all belong to the same distribution.

Table 1: Example dataset

\(z\_1 = 0.5 \\qquad z\_2 = 1.2 \\qquad z\_3 = 2.5 \\qquad z\_4 = 3.5\)

Within this scenario, take a sample s from production data in a deployed regression model that uses ICP. We have the corresponding prediction p and calibration scores for the validation dataset from the previous paragraph, but crucially, we lack the real value y that we want to predict. How confident are we in this prediction?

We can say \(P(|y-p| <= z1) = 0.2, P(|y-p| <= z2) = 0.4, ..., P(|y-p| <= z4) = 0.8\). This can be rephrased as \(P(y \\in p +- z1) = 0.2, P(y \\in p +- z2) = 0.4\), and so on. Suddenly, we have a well-calibrated probability—a.k.a. the confidence we are after—for our point prediction, which has been turned into a prediction range! As previously mentioned, classification tasks follow a similar idea, but the output is a set of labels instead.

What we can do with this is very powerful. In essence, the end user can fix an expected error rate to estimate a prediction region that respects this constraint, or vice versa: if the problem domain imposes a maximum set size or region width, we can get the associated error rate for our predictions as a confidence estimate.

At this point, we can establish an important fact that you’ll observe when using MindsDB: the greater the confidence, the wider your prediction region will be. At first glance, this might sound counterintuitive, but hopefully after what we’ve seen, it makes sense that this is the case.

So, where’s the catch?

If you've been following so far, you may have noticed a couple of important points worth considering as potential limitations when using ICP:

1. This method of estimating confidence does not distinguish between different samples at inference time, i.e., the error rate and the predicted region size are going to be the same for any and all predictions. This is not ideal, as we’d like our prediction region to be as tight as possible, based on the perceived difficulty of the task (in the literature, this is known as the “efficiency” of our region). One way of solving the issue is to introduce normalization.

Normalized conformal prediction is a fairly straightforward extension of the concepts we’ve seen. We will take a deeper look at this in an upcoming blog post, but for now you can think of it as training our predictor to not only predict any given data point, but also to estimate the difficulty of correctly doing so, which can be done with a complementary ML model. This estimation would then be used to appropriately scale the predicted region size to be tighter if the data point seems “easier” to predict.

2. Another limitation is that the error rate is guaranteed on average. When handling classification tasks, this can imply a higher error rate for classes with a smaller number of samples in the training and validation data splits. Conditional conformal predictors are designed to effectively combat this issue, guaranteeing the error rate separately for each class.

MindsDB implementation

We are deep believers in open source, which is why, rather than reinventing the wheel, we’ve opted for the excellent nonconformist library developed by conformal prediction expert Dr. Henrik Linusson. The package includes a wide array of error functions, as well as implementations for inductive, transductive (the original flavor), conditional and mondrian conformal predictors—although we have developed some custom bits of code here and there.

If you are worried about the overhead cost in training and inference runtime, fear not, as ICPs are cheap to calibrate: we need to save and sort the validation conformity scores, and then it’s a matter of computing the value of the inverse nonconformity measure. Unless you are using an expensive normalizer like a k-NearestNeighbors model (which, remember, is optional), then the additional computing expenses are negligible.

Our invitation to the community is to explore both MindsDB and Nonconformist to include confidence estimation capabilities in your ML pipeline, and let us know what you think! We are sure that conformal prediction is especially relevant for dependable machine learning solutions, and we are very excited for the future of this technique.

Thanks for reading!

— Patricio Cerda-Mardini, ML Engineer @ MindsDB

P.S. If this article was helpful, please give us a GitHub star here.

Resources / Links:

[1] Tutorial: Conformal Prediction and Venn Predictors

[2] Nonconformist Python package: https://github.com/donlnz/nonconformist

[3] Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). On Calibration of Modern Neural Networks. ArXiv, abs/1706.04599.

[4] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., & Hinton, G.E. (2017). Regularizing Neural Networks by Penalizing Confident Output Distributions. ArXiv, abs/1701.06548.

[5] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005.

Author:

Patricio Cerda-Mardini, Machine Learning Research Engineer @mindsdb

Patricio Cerda-Mardini is a Machine Learning Research Engineer at MindsDB. As a masters student at PUC Chile, he focused on machine learning methods for human–robot interaction and recommendation systems, areas in which he holds several academic publications. Prior to joining MindsDB, he also interned at EY Chile as a computer vision researcher.

0
Subscribe to my newsletter

Read articles from Patricio Cerda Mardini directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Patricio Cerda Mardini
Patricio Cerda Mardini