Introduction

Every year like clock work the media explodes come all star weekend because their favorite player didn’t get chosen. Harsh media narratives and little coverage destroys players marketability ergo destroying their chances of making the all star game. That is why I created a machine learning model to see per the data only who should’ve made the All-Star Game.

Data Collection

To get the data I used basketball reference that has one of the cleanest datasets of per game stats. For every machine learning model you can’t have empty data slots, and in the dataset there was empty slots in some instances. For example, if a player didn’t attempt any 3 pointers then their 3 point percentage didn’t have a value because you can’t do 0/0. To fix this, I just filled in all the null values with 0s. Next, if a player got traded I deleted their duplicate stats row. Finally, for the first version of the model the only features I dropped were string features: player name, team, and awards. For the actual model, I augmented all the rows with an extra column: 1 if they were an all star 0 if they weren’t an all star.

Models

Model 1: Complex Logistic Regression

For this first model the first transformation I decided to do was to scale all of the data. Scoring 20 points is much less impressive than passing 20 assists, but that won’t be reflected in the output because all the model sees is 20 = 20. Luckily, there is a very useful library to scale the data in this case called StandardScaler. StandardScaler converts any value into a Z-score of the value, so how far from the average is the value. This means instead of comparing raw numbers, the model is comparing how impressive those numbers are relative to other players.

from sklearn.preprocessing import StandardScaler


Xtrain = train.drop(columns=['Player','Team', 'All Star'])
ytrain = train['All Star']

Xtest = test.drop(columns=['Player','Team', 'All Star'])
ytest = test['All Star']

scaler = StandardScaler()
XTrainScaled = scaler.fit_transform(Xtrain)
XTestScaled = scaler.transform(Xtest)

The next transformation I made to the model was to change the class weights. Because majority of the sample data are non all stars, the model doesn’t pay too that much attention to what makes players all stars. You can change how close the model pays attention to each class by changing the class weights, so I wrote a double loop to see which class weights have the highest accuracy, here is the graph of the outputs.

Additionally I made one for the F-score which measures how accurate the model is against test data and back again on the training data. (Title is supposed to say F-score)

To maximize both F-score and test accuracy the weights I chose for this model is weight 2 for non all stars and weight 3 as all stars. This led to this accuracy for this version of the model.

I really liked these numbers and was about to use it as my final model, but before I did I wanted to check the coefficients in the model to see which variables mattered the most. Here are the coefficients per the variables.

Now, there are so many things wrong with this. To list a few: MP is negative; Age is more than BLK, AST, and 3P%; pts is decently low; TRB is positive, but ORB is negative etc. Now, I realized the issue with this model despite it’s fantastic accuracy is confounding variables. The model penalizes confounding variables by lowering its coefficients, meaning that a lot of these variables are confounding, which makes sense because a lot of these variables describe how the player scores. So, my thought process on how to make the model more interpretable was to get rid of most of those variables and only keep the unique ones.

Model 2: Simplified Logistic Regression

In this model I got rid of all variables that can possible be confounding by using my knowledge of how basketball works. One variable that I initially dropped was minutes played, but that version of the model had a much lower accuracy than when I included it. The issue is that MP is correlated with every variable because the more you are on the court the more likely you are to put up box score stats. So I made an augmented variable impact which is (PTS+AST+STL-TOV)/MP, so the accuracy of the model remained high without much confounding. Here are all the variables I dropped. I also scaled the data for the same reasons I did in the first model.

Xtrain['impact'] = (Xtrain['PTS'] + Xtrain['AST'] + Xtrain['STL'] - Xtrain['TOV']) / Xtrain['MP'].replace(0, 0.1)
Xtest['impact'] = (Xtest['PTS'] + Xtest['AST'] + Xtest['STL'] - Xtest['TOV']) / Xtest['MP'].replace(0, 0.1)
XtrainSimple = Xtrain.drop(columns=['Age','G','MP','FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB'])
XtestSimple = Xtest.drop(columns=['Age','G','MP','FG','FGA','FG%','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB'])


XTrainSimpleScaled = scaler.fit_transform(XtrainSimple)
XTestSimpleScaled = scaler.transform(XtestSimple)

Next I did the same tests to see what class weight was most optimal for this model and I ended up with weight 1 for non-all stars and 1 for all-stars (balanced weights). This was the output of the model.

Now this accuracy is a bit worse than the last model, but in terms of model interpretability this model has a lot more basketball realism. Here are the model coefficients visualized by a graph.

However, I wanted to try one more method to see if I can get an even more interpretable model.

Model 3: Random Forest

The reason I chose Random Forest as my next model because of how directly it attributes one stat to being an all star. Here is a full and simple to understand explanation of how Random Forest works by Normalized Nerd. This way I don’t have to scale the stats because they aren’t being accessed on the same curve.

For optimizing the model I used the same method as the last 2 models to find the best class weight to maximize accuracy and F-score, and that was 5 for non-all stars and 6 for all-stars. Here is the output of the model.

Now this model is performing slightly worse in every category than model 2, additionally, I don’t like how big of a factor randomness plays in this model, but before I rule this out I want to see which stats make the most decisions on whether or not a player becomes an all star. Here is the output of all of those.

Interesting, I did not expect the two defensive stats (steals and blocks) to be the deciding factors in almost 20% of the decision trees, and points to be in the bottom 4 stats.

Choosing the Final Model

Each model has their upsides, but to chose the final model I want a combination of basketball realism and high accuracy. Although the Complex Logistic Regression model has the highest accuracy and F-score, it has too many confounding variables and therefore lacks basketball realism. Simplified Logistic Regression and Random Forest are both very interpretable and have decent basketball realism so I just choose the more accurate out of the two. Out of those 2 models, Simplified Logistic Regression performs slightly better in every category so that is the final model I chose on.

Results

When I tested this model with data from the 2024-2025 regular season, 12 players had different outcomes than the all star selection. If you want to find out who those 12 players are check out my youtube video. Additionally if you want the all the source code for this project here is my gitHub repository for this project.

Machine Learning All-Star Selection Classifier