Optimizing Machine Learning Workflows
The optimization of Machine Learning Workflows begins with the segregation of data into training, validation, and testing sets.
However, before we do anything with our data, we need to "clean" it up. The broader term that is used in Machine Learning is Data Preprocessing. It is crucial to the success of your model.
Data Splitting
When dealing with our data we typically divide it into parts using a 70 20 10 or 60 20 20 split. This means allocating a portion of the data for training, another for validation and the remaining segment for testing.
The training set plays a role in enabling our model to learn. The validation set helps in refining the model and its hyperparameters. Finally the test set allows us to assess how well the model performs on data.
Now lets explore these concepts further.
Before we proceed though: Hold on. What are hyperparameters? Hyperparameters are user-defined settings that influence the learning process. parameters whose values are predetermined before training a learning algorithm begins. Perhaps that explanation wasn't very clear; lets try to shed light on it.
Hyperparameters are user defined settings that influence the learning process. It's important not to mistake them for model parameters, which are adjusted as the model learns from data. These settings are established prior to initiating the learning process and may include elements like the k in the k-Nearest Neighbors (kNN) algorithm, train-test split ratio, batch size, among others.
So why do we refer to them as "hyper"? The term "hyper" suggests that these parameters hold a top level position and have an impact, on how learning takes place. Let's not delve further into this explanation as it exceeds our limits. Lets proceed with dividing our data.
Remember we split our data into 3 parts. Let's take a look at them:
The training set: As the name suggests, this is the data set that you will be training your model on. It is the most substantial portion of the dataset, where the model learns to identify patterns, trends, etc.
The validation set: For once, we are dealing with Machine Learning terms that are easy to understand - this is the data set that you validate your results against. Kind of like a preparation test. It's used to fine-tune model parameters, and check for issues like overfitting (more on that later) where your model might perform well on the training data but poorly on unseen data.
The test set is used to evaluate the final model's performance, offering an unbiased assessment of how it will perform in the real world on unseen data. This helps us understand the model's ability to make accurate predictions on new data.
It is easy to confuse the validation set and the test set. After all, they do look like they serve similar purposes. To put it very simply, the validation set is used to optimize the model parameters while the test set is used to provide an unbiased estimate of the final model. You can call the validation set as Development Set too if it helps.
Biased and unbiased estimators are a very statistical concepts, but what we mean by "unbiased estimate" is that we want the expected value of our statistic equal to the parameter. For example, if you are estimating the mean of a population, when you take multiple samples from this population and calculate the sample mean for each, you expect that value to be equal to the value of the whole population mean.
Clear as mud? Let's try another way:
An unbiased estimator is like a dart thrower who, on average, hits the bullseye. It doesn't mean every throw hits the center, but over many throws, the average of all throws (which are all in the bulls eye circle) is right on target. Some throws might be a bit to the left, others to the right, but on average, they balance out to the true value you're aiming for. So, if you're using an unbiased estimator to guess something about a whole group (like an average), your guesses won't be consistently too high or too low—they'll be just right on average over many tries.
Moving on...
Next, we need to pick our algorithm and their features - what we mean here is to select a subset of relevant features (variables, predictors) for use in model construction. Our goal is to improve the model's performance by getting rid of unnecessary, irrelevant or redundant data.
This is kind of like fine-tuning your data cleanup for the specific needs of your model.
Here are the 3 algorithms for you to pick from:
Decision Trees can be considered the Swiss army knife of algorithms. They're versatile and can be used for diverse tasks, from credit scoring to medical diagnosis. The choice of features to include in the decision tree is crucial to its performance.
Linear Regression is your friend when dealing with continuous outcomes. It's like trying to guess how much your house might sell for, or predicting the stock market trends, where everything is linear. Feature selection plays a vital role here too, as irrelevant or redundant features can deteriorate the model's performance.
Lastly, we have the heavy lifters - Neural Networks. These algorithms are perfect for complex problems with big datasets. They tackle everything from image recognition to natural language processing. Due to their complexity, feature selection* becomes even more critical to prevent the model from becoming overly complex and overfitting the training data.
Hyperparameter Tuning
After all this work, you now need to adjust how your chosen algorithm learns. At this stage, you are trying to improve how well your algorithm performs with your chose features. This is a very important step as the optimal hyperparameters might depend on which features you have chosen and hyperparameters can influence the effectiveness of different algorithms significantly.
Techniques like Grid Search and Random Search are used in hyperparameter tuning so we need to look at them next.
Grid search is testing all possible combinations of hyperparameters. For example, let's say you have a screw hole and a few boxes of screws. With Grid Search, you will try every screw in every box to find the screw that fits in the hole.
Grid search is comprehensive but it can be very time consuming and may need a lot of computational resources.
Random Search is well, random. In the example above, instead of trying every single screw, you pick a screws randomly to see which one works best or close enough. You are hoping to find a good enough setting without needing to test all of the combinations. Needless to say, Random Search is faster than Grid Search.
Choosing values for random search is a whole different topic altogether but it basically involves understanding your model, prior knowledge of similar problems and iteration.
Grid search is for perfection while Random Search lives by the motto "Perfect is the enemy of good". Oh, by the way, you can use both - use Random Search to get a close enough accuracy and then fine tune further by doing a Grid Search.
After all that, it is time to minimize our costs - no, not the monetary value but what we call the cost function, which measures how wrong our predictions are. The lower the cost, the better our model's predictions.
To lower our cost function, we need to apply optimization algorithms. There are three main methods or strategies to minimize your cost function. Let's use programming analogy to explain these three methods:
Imagine you are trying to optimize your program.
Gradient Descent is optimizing each line in the entire program. It is thorough but can be time-consuming.
Stochastic Gradient Descent (SGD) is where you randomly pick a class or function and optimize that only. When you are done, you select another one. This is a faster approach than Gradient Descent because you are making small and targeted changes, but it may be a while before you optimize the program as you may miss some parts entirely or may need to revisit the same classes multiple times.
Last, we have the Mini-Batch Gradient Descent, which is kind of the middle ground. You select a small, manageable group of classes at a time. This way you are not overwhelmed by the whole program and you can be strategic about the classes you choose.
Remember, we mentioned something called overfitting when we were talking about the validation sets? It's time to take a look at them now:
Overfitting is when you develop a software solution that is very specific for a scenario but does not fit any other scenarios therefore it is very limited in usage and really useless other than that one scenario. In other words, it lacks the flexibility to adopt to new scenarios. It is the same in Machine Learning - an overfitted model performs brilliantly on the specific scenarios it is designed to handle but very poorly on new, unseen data.
If there is overfitting, there must be an underfitting you say and you are right! Underfitting is just the opposite. The solution you develop is so general, it is practically all but useless other than the most simplistic scenarios. In Machine Learning, this translates into your model not being able to recognize underlying patterns and performs poorly both on the training and new data.
Going Beyond Basics: Not Hot Dog
Let's say you want o build an app which takes a photo of the food and tells you if it is a hot dog or not a hot dog.
You want your app to correctly recognize hot dogs (or not hot dogs) correctly across a variety of images, not just the ones you have tested with. So, you collect thousands of photos hot dogs and not hot dogs.
To test how well your application works, you use something called cross-validation. You split your photos into groups and you train your app on some of the groups and test it on some of the other groups. You repeat this process several times and use a different group for testing. This is to help the app identify hot dogs accurately.
Now, for a more systematic approach, you may divide your photos into, let's say (pick a number here), 5 equal groups. You use 4 groups to train your app and 1 group to test it and you rotate which group is used for testing. The number of groups in this scenario is called k, so k = 5, and this method is called k-Fold Cross-Validation.
Since nothing is perfect, you probably notice that the distribution of hot dogs and not hot dogs in your photos is imbalanced. Maybe 75% of your photos are not hot dog images. You make sure that in each of your group this imbalance is represented so each of your groups has 75% of not hot dog pictures. This is called Stratified k-Fold Cross-Validation.
There is one more method called, Time-Series Cross-Validation, however it really does not apply to our software but let's say you wanted to add a feature to predict future trends in hot dog popularity based on image submissions over time. For this, you'd use time-series cross-validation, training your model on earlier data and testing it on subsequent data. This does not help with the basic functionality of the app however, it may be useful for analyzing trends.
\ Feature Selection*
I wanted to explain feature selection at the end because I wanted to use the Not Hot Dog app we were talking about above.
Feature selection is basically, making your app more efficient and accurate by focusing on features that help identifying a hot dog such as shape, color patterns, etc. and identifying and eliminating redundant features such as image size, generic colors, etc.). By focusing on what really matters for making the distinction that an image is an image of a hot dog, our app identifies hot dogs (or not hot dogs) faster and more accurately.
Subscribe to my newsletter
Read articles from TJ Gokken directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by