Guide to Model Training for Predicting Math Scores: Techniques and Eva

In data science and machine learning projects, model training plays a crucial role in developing accurate predictive models. This article walks you through the entire process—from data import to model evaluation—using Python and popular machine-learning libraries.

Code

https://github.com/RahulSainy/mlprojects/blob/dev/notebook/ModelTrainig.ipynb

Dataset - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977

Model Training

notebooks/
├── EDA.ipynb
└── ModelTrainig.ipynb “for model Training”

1. Data Import and Exploration

Firstly, it's essential to import the necessary packages and load your dataset using Pandas:

# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings

Import the CSV Data as Pandas DataFrame

Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

df = pd.read_csv('../data/stud.csv')

Show Top 5 Records

df.head()

https://gist.github.com/RahulSainy/07cf0ba776df30b2682daec7a2aab8db

2. Data Preprocessing

After loading the data, it's crucial to preprocess it. This involves:

Handling Categorical Data: Using One-Hot Encoding for categorical variables.
Separating Features and Target: Identifying X (features) and y (target variable).

We will choose to predict math_score you can try any other like total average etc

X = df.drop(columns=['math_score'],axis=1)

print("Categories in 'gender' variable:     ",end=" " )
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable:  ",end=" ")
print(df['race_ethnicity'].unique())

print("Categories in'parental level of education' variable:",end=" " )
print(df['parental_level_of_education'].unique())

print("Categories in 'lunch' variable:     ",end=" " )
print(df['lunch'].unique())

print("Categories in 'test preparation course' variable:     ",end=" " )
print(df['test_preparation_course'].unique())

Categories in 'gender' variable: ['female' 'male'] Categories in 'race_ethnicity' variable: ['group B' 'group C' 'group A' 'group D' 'group E'] Categories in'parental level of education' variable: ["bachelor's degree" 'some college' "master's degree" "associate's degree" 'high school' 'some high school'] Categories in 'lunch' variable: ['standard' 'free/reduced'] Categories in 'test preparation course' variable: ['none' 'completed']

In gender, We have 2 categories and in race_ethnicity we have 4 like wise we can see there are very less features and very less categories in that so we can perform one hot encoding here to optimize other wise if many features were there we could have target guided ordinal encoding

y = df['math_score']
y

0 72 1 69 2 90 3 47 4 76 .. 995 88 996 62 997 59 998 68 999 77 Name: math_score, Length: 1000, dtype: int64

Now we will create a pipeline with ColumnTransformer to perform things like

Separating numerical features
Applying OnehotEncodeing
Applying StandardScaler

# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)

X = preprocessor.fit_transform(X)
X.shape

(1000, 19)

3. Train-Test Split

Splitting the dataset into training and testing sets is essential to evaluate model performance accurately:

# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((800, 19), (200, 19)) Now we have 800 records in training and 200 for test

4. Model Selection and Training

Selecting appropriate models and training them on the training set:

def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')

    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)

    print('='*35)
    print('\n')

Logs:

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 5.3243
- Mean Absolute Error: 4.2671
- R2 Score: 0.8743
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 5.3960
- Mean Absolute Error: 4.2158
- R2 Score: 0.8803
===================================


Lasso
Model performance for Training set
- Root Mean Squared Error: 6.5938
- Mean Absolute Error: 5.2063
- R2 Score: 0.8071
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.5197
- Mean Absolute Error: 5.1579
- R2 Score: 0.8253
===================================


Ridge
Model performance for Training set
- Root Mean Squared Error: 5.3233
- Mean Absolute Error: 4.2650
- R2 Score: 0.8743
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 5.3904
- Mean Absolute Error: 4.2111
- R2 Score: 0.8806
===================================


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 5.7077
- Mean Absolute Error: 4.5167
- R2 Score: 0.8555
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.2530
- Mean Absolute Error: 5.6210
- R2 Score: 0.7838
===================================


Decision Tree
Model performance for Training set
- Root Mean Squared Error: 0.2795
- Mean Absolute Error: 0.0187
- R2 Score: 0.9997
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.6371
- Mean Absolute Error: 6.0250
- R2 Score: 0.7603
===================================


Random Forest Regressor
Model performance for Training set
- Root Mean Squared Error: 2.2851
- Mean Absolute Error: 1.8253
- R2 Score: 0.9768
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0959
- Mean Absolute Error: 4.7194
- R2 Score: 0.8473
===================================


XGBRegressor
Model performance for Training set
- Root Mean Squared Error: 0.9087
- Mean Absolute Error: 0.6148
- R2 Score: 0.9963
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.5889
- Mean Absolute Error: 5.0844
- R2 Score: 0.8216
===================================


CatBoosting Regressor
Model performance for Training set
- Root Mean Squared Error: 3.0427
- Mean Absolute Error: 2.4054
- R2 Score: 0.9589
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0086
- Mean Absolute Error: 4.6125
- R2 Score: 0.8516
===================================


AdaBoost Regressor
Model performance for Training set
- Root Mean Squared Error: 5.7843
- Mean Absolute Error: 4.7564
- R2 Score: 0.8516
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.0447
- Mean Absolute Error: 4.6813
- R2 Score: 0.8498
===================================

5. Model Evaluation and Comparison

After training and evaluating multiple models, you can compare their performance using metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2) score:

pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)

Model Name R2_Score

2 Ridge 0.880593

0 Linear Regression 0.880345

7 CatBoosting Regressor 0.851632

8 AdaBoost Regressor 0.849847

5 Random Forest Regressor 0.847291

1 Lasso 0.825320

6 XGBRegressor 0.821589

3 K-Neighbors Regressor 0.783813

4Decision Tree0.760313

Linear Regression

lin_model = LinearRegression(fit_intercept=True)
lin_model = lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)
score = r2_score(y_test, y_pred)*100

print(" Accuracy of the model is %.2f" %score)

Accuracy of the model is 88.03

6. Model Visualization and Interpretation

Visualizing model predictions and actual values helps in understanding model performance:

Plot y_pred and y_test

plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');

sns.regplot(x=y_test,y=y_pred,ci=None,color ='red');

Difference between Actual and Predicted Values

pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df

Actual Value Predicted Value Difference

521 91 76.507812 14.492188

737 53 58.953125 -5.953125

740 80 76.960938 3.039062

660 74 76.757812 -2.757812

411 84 87.539062 -3.539062

... ... ... ...

408 52 43.546875 8.453125

332 62 62.031250 -0.031250

208 74 67.976562 6.023438

613 65 67.132812 -2.132812

78 61 62.492188 -1.492188

Conclusion: Key Takeaways

Data Preparation Matters: Effective preprocessing techniques such as One-Hot Encoding and Standard Scaling are crucial for optimizing model performance and handling categorical data appropriately.
Diverse Model Selection: Experimenting with a variety of models—from linear regression to ensemble methods like Random Forests and boosting algorithms—provides insights into which approach best suits the predictive task at hand.
Evaluation Metrics Guide Decision-Making: Metrics like RMSE, MAE, and R2 scores offer quantitative measures of model accuracy and help in selecting the most reliable model for predicting math scores.
Practical Insights through Visualization: Visualizing predictions versus actual values enhances understanding of model behavior and aids in interpreting results effectively.
Continuous Improvement: Iterative testing and refining of models based on evaluation results are essential for refining predictive accuracy and ensuring robust performance in real-world applications.

By leveraging these principles, data scientists can confidently navigate the complexities of model training for educational data, aiming for more accurate predictions and actionable insights.

Comprehensive Guide to Model Training and Evaluation for Predicting Math Scores