Comprehensive Guide to Model Training and Evaluation for Predicting Math Scores
In data science and machine learning projects, model training plays a crucial role in developing accurate predictive models. This article walks you through the entire process—from data import to model evaluation—using Python and popular machine-learning libraries.
Code
Dataset - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
Model Training
notebooks/
├── EDA.ipynb
└── ModelTrainig.ipynb “for model Training”
1. Data Import and Exploration
Firstly, it's essential to import the necessary packages and load your dataset using Pandas:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings
Import the CSV Data as Pandas DataFrame
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.
df = pd.read_csv('../data/stud.csv')
Show Top 5 Records
df.head()
2. Data Preprocessing
After loading the data, it's crucial to preprocess it. This involves:
Handling Categorical Data: Using One-Hot Encoding for categorical variables.
Separating Features and Target: Identifying X (features) and y (target variable).
We will choose to predict math_score
you can try any other like total average etc
X = df.drop(columns=['math_score'],axis=1)
print("Categories in 'gender' variable: ",end=" " )
print(df['gender'].unique())
print("Categories in 'race_ethnicity' variable: ",end=" ")
print(df['race_ethnicity'].unique())
print("Categories in'parental level of education' variable:",end=" " )
print(df['parental_level_of_education'].unique())
print("Categories in 'lunch' variable: ",end=" " )
print(df['lunch'].unique())
print("Categories in 'test preparation course' variable: ",end=" " )
print(df['test_preparation_course'].unique())
Categories in 'gender' variable: ['female' 'male'] Categories in 'race_ethnicity' variable: ['group B' 'group C' 'group A' 'group D' 'group E'] Categories in'parental level of education' variable: ["bachelor's degree" 'some college' "master's degree" "associate's degree" 'high school' 'some high school'] Categories in 'lunch' variable: ['standard' 'free/reduced'] Categories in 'test preparation course' variable: ['none' 'completed']
In gender, We have 2 categories and in race_ethnicity we have 4 like wise we can see there are very less features and very less categories in that so we can perform one hot encoding here to optimize other wise if many features were there we could have target guided ordinal encoding
y = df['math_score']
y
0 72 1 69 2 90 3 47 4 76 .. 995 88 996 62 997 59 998 68 999 77 Name: math_score, Length: 1000, dtype: int64
Now we will create a pipeline with ColumnTransformer to perform things like
Separating numerical features
Applying OnehotEncodeing
Applying StandardScaler
# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()
preprocessor = ColumnTransformer(
[
("OneHotEncoder", oh_transformer, cat_features),
("StandardScaler", numeric_transformer, num_features),
]
)
X = preprocessor.fit_transform(X)
X.shape
(1000, 19)
3. Train-Test Split
Splitting the dataset into training and testing sets is essential to evaluate model performance accurately:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape
((800, 19), (200, 19))
Now we have 800 records in training and 200 for test
4. Model Selection and Training
Selecting appropriate models and training them on the training set:
def evaluate_model(true, predicted):
mae = mean_absolute_error(true, predicted)
mse = mean_squared_error(true, predicted)
rmse = np.sqrt(mean_squared_error(true, predicted))
r2_square = r2_score(true, predicted)
return mae, rmse, r2_square
models = {
"Linear Regression": LinearRegression(),
"Lasso": Lasso(),
"Ridge": Ridge(),
"K-Neighbors Regressor": KNeighborsRegressor(),
"Decision Tree": DecisionTreeRegressor(),
"Random Forest Regressor": RandomForestRegressor(),
"XGBRegressor": XGBRegressor(),
"CatBoosting Regressor": CatBoostRegressor(verbose=False),
"AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]
for i in range(len(list(models))):
model = list(models.values())[i]
model.fit(X_train, y_train) # Train model
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluate Train and Test dataset
model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)
print(list(models.keys())[i])
model_list.append(list(models.keys())[i])
print('Model performance for Training set')
print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
print("- R2 Score: {:.4f}".format(model_train_r2))
print('----------------------------------')
print('Model performance for Test set')
print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
print("- R2 Score: {:.4f}".format(model_test_r2))
r2_list.append(model_test_r2)
print('='*35)
print('\n')
Logs:
Linear Regression
Model performance for Training set
- Root Mean Squared Error: 5.3243
- Mean Absolute Error: 4.2671
- R2 Score: 0.8743
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 5.3960
- Mean Absolute Error: 4.2158
- R2 Score: 0.8803
===================================
Lasso
Model performance for Training set
- Root Mean Squared Error: 6.5938
- Mean Absolute Error: 5.2063
- R2 Score: 0.8071
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.5197
- Mean Absolute Error: 5.1579
- R2 Score: 0.8253
===================================
Ridge
Model performance for Training set
- Root Mean Squared Error: 5.3233
- Mean Absolute Error: 4.2650
- R2 Score: 0.8743
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 5.3904
- Mean Absolute Error: 4.2111
- R2 Score: 0.8806
===================================
K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 5.7077
- Mean Absolute Error: 4.5167
- R2 Score: 0.8555
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.2530
- Mean Absolute Error: 5.6210
- R2 Score: 0.7838
===================================
Decision Tree
Model performance for Training set
- Root Mean Squared Error: 0.2795
- Mean Absolute Error: 0.0187
- R2 Score: 0.9997
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.6371
- Mean Absolute Error: 6.0250
- R2 Score: 0.7603
===================================
Random Forest Regressor
Model performance for Training set
- Root Mean Squared Error: 2.2851
- Mean Absolute Error: 1.8253
- R2 Score: 0.9768
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0959
- Mean Absolute Error: 4.7194
- R2 Score: 0.8473
===================================
XGBRegressor
Model performance for Training set
- Root Mean Squared Error: 0.9087
- Mean Absolute Error: 0.6148
- R2 Score: 0.9963
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.5889
- Mean Absolute Error: 5.0844
- R2 Score: 0.8216
===================================
CatBoosting Regressor
Model performance for Training set
- Root Mean Squared Error: 3.0427
- Mean Absolute Error: 2.4054
- R2 Score: 0.9589
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0086
- Mean Absolute Error: 4.6125
- R2 Score: 0.8516
===================================
AdaBoost Regressor
Model performance for Training set
- Root Mean Squared Error: 5.7843
- Mean Absolute Error: 4.7564
- R2 Score: 0.8516
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0447
- Mean Absolute Error: 4.6813
- R2 Score: 0.8498
===================================
5. Model Evaluation and Comparison
After training and evaluating multiple models, you can compare their performance using metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2) score:
pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)
Model Name R2_Score
2 Ridge 0.880593
0 Linear Regression 0.880345
7 CatBoosting Regressor 0.851632
8 AdaBoost Regressor 0.849847
5 Random Forest Regressor 0.847291
1 Lasso 0.825320
6 XGBRegressor 0.821589
3 K-Neighbors Regressor 0.783813
4Decision Tree0.760313
Linear Regression
lin_model = LinearRegression(fit_intercept=True)
lin_model = lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)
score = r2_score(y_test, y_pred)*100
print(" Accuracy of the model is %.2f" %score)
Accuracy of the model is 88.03
6. Model Visualization and Interpretation
Visualizing model predictions and actual values helps in understanding model performance:
Plot y_pred and y_test
plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');
sns.regplot(x=y_test,y=y_pred,ci=None,color ='red');
Difference between Actual and Predicted Values
pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df
Actual Value Predicted Value Difference
521 91 76.507812 14.492188
737 53 58.953125 -5.953125
740 80 76.960938 3.039062
660 74 76.757812 -2.757812
411 84 87.539062 -3.539062
... ... ... ...
408 52 43.546875 8.453125
332 62 62.031250 -0.031250
208 74 67.976562 6.023438
613 65 67.132812 -2.132812
78 61 62.492188 -1.492188
Conclusion: Key Takeaways
Data Preparation Matters: Effective preprocessing techniques such as One-Hot Encoding and Standard Scaling are crucial for optimizing model performance and handling categorical data appropriately.
Diverse Model Selection: Experimenting with a variety of models—from linear regression to ensemble methods like Random Forests and boosting algorithms—provides insights into which approach best suits the predictive task at hand.
Evaluation Metrics Guide Decision-Making: Metrics like RMSE, MAE, and R2 scores offer quantitative measures of model accuracy and help in selecting the most reliable model for predicting math scores.
Practical Insights through Visualization: Visualizing predictions versus actual values enhances understanding of model behavior and aids in interpreting results effectively.
Continuous Improvement: Iterative testing and refining of models based on evaluation results are essential for refining predictive accuracy and ensuring robust performance in real-world applications.
By leveraging these principles, data scientists can confidently navigate the complexities of model training for educational data, aiming for more accurate predictions and actionable insights.
Subscribe to my newsletter
Read articles from Rahul Saini directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by