In this Blog, we embark on an exciting journey as data scientists aiming to predict the outcomes of NCAA college basketball games. Our primary objective is to analyze multiple years' worth of game results, process this data meticulously, and utilize it to train a neural network for accurate predictions.

Our overarching goal is to develop a machine learning model that offers us a competitive advantage in predicting game results. Throughout this project, we will navigate the entire life cycle of a machine learning endeavor, which includes:

Designing a Neural Network: We will create a robust neural network architecture using Keras, leveraging its powerful capabilities for building and training models.
Training, Testing, and Validation: We will initiate a comprehensive training process, followed by rigorous testing and validation phases to ensure our model's accuracy and reliability.

By undertaking these steps, we aim to deliver a cutting-edge machine learning project that not only enhances our understanding of data-driven predictions but also equips us with the tools necessary to make informed outcomes in the realm of college basketball.

Data for this project can be grabbed from Games_Calculated.csv.

Columns are:

Date of the game
Home Team
Home Team’s Score
Away Team
Away Team’s Score
Home Team’s Offensive average (points scored) while at home
Home Team’s Defensive average (points given up) while at home
Away Team’s Offensive average while away
Away Team’s Defensive average while away
Score difference from the home team’s perspective

Our primary objective is to thoroughly clean and preprocess the dataset by identifying and rectifying any inconsistencies or errors. This includes standardizing formats, removing duplicates, and addressing missing values to ensure the data is reliable. Additionally, we will identify and eliminate any unnecessary columns that do not contribute to our analytical goals. By refining the dataset in this way, we aim to create a streamlined, normalized version that enhances the learning process, ensuring that the insights derived from our analysis are meaningful and actionable.

For this project, I utilized Google Colab as my development environment. To begin, I mounted my Google Drive to access the data files stored there. This step is crucial as it allows me to work with the dataset directly from my Drive while taking advantage of Colab's computational resources.

Mount the drive to load data

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

Load the CSV games file

game_file = '/content/drive/MyDrive/Colab_Notebooks/live_project/game/Games-Calculated.csv'

Import the Libraries we will use in this project

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

Specify the column names for the dataset and then load the CSV file into a Pandas DataFrame for further analysis.

column_names = ['Date','HomeTeam','HomeScore','AwayTeam','AwayScore',
                'HomeScoreAverage','HomeDefenseAverage','AwayScoreAverage','AwayDefenseAverage',
                'Result']
data = pd.read_csv(game_file,names=column_names)

Lets read the first two rows

data.head(2)

	Date	HomeTeam	HomeScore	AwayTeam	AwayScore	HomeScoreAverage	HomeDefenseAverage	AwayScoreAverage	AwayDefenseAverage	Result
0	2015-11-13	Hawaii	87	Montana State	76	87.0	76.0	76.0	87.0	11
1	2015-11-13	Eastern Michigan	70	Vermont	50	70.0	50.0	50.0	70.0	20

We will eliminate the columns that are unnecessary for our training process to streamline the dataset and improve the efficiency of our model.

updated_data=data.drop(['Date','HomeTeam','HomeScore','AwayTeam','AwayScore'], axis=1)
updated_data.shape

(20160, 5)

So we have 20160 records and 5 features including the the target lable.

Splitting the train and test data

We need to divide the dataset into two parts using an 80:20 ratio. This means that 80% of the data will be allocated for training our model, while the remaining 20% will be reserved for testing its performance. To achieve this, we will utilize the Pandas library, which provides powerful data manipulation tools. We will first load the dataset into a Pandas DataFrame, then use functions to randomly shuffle and split the data accordingly. This ensures that both the training and testing sets are representative samples of the original dataset.

trainX=updated_data.sample(frac=0.8,random_state=0)
testX=updated_data.drop(trainX.index)

trainX.shape

(16128, 5)

testX.shape
testX.shape
(4032, 5)

Currently, we have divided our dataset into two parts: 80% of the data will be used for training our model, while the remaining 20% will serve as the test dataset. Our next step is to create the target variables for both the training and test sets.

trainY=trainX.pop('Result')
testY=testX.pop('Result')

Normalizing the data

We will apply data normalization techniques to both the training and testing datasets. Specifically, we will implement z-score standardization, which involves converting our data into a standard format. This process will ensure that each feature has a mean of zero and a standard deviation of one, allowing for a more accurate comparison across different scales and distributions. By doing this, we aim to enhance the performance of our machine learning models and improve the overall predictive accuracy.

def z_score_standardization(df):
    df_scaled = df.copy()
    for column in df.columns:
        df_scaled[column] = (df[column] - df[column].mean()) / df[column].std()
    return df_scaled

We call the above function for both test and train data to get scaled data.

scaledTrainX=z_score_standardization(trainX)
scaledTestX=z_score_standardization(testX)

Building the Model

We will develop a sequential model using Keras, which is a high-level neural networks API. The model will consist of two hidden layers, each containing 32 neurons and utilizing the ReLU (Rectified Linear Unit) activation function to introduce non-linearity. This choice allows the model to learn complex patterns in the data.

Following the two hidden layers, we will include an output layer with a single neuron. This layer will provide the model's prediction, suitable for tasks such as binary classification or regression.

To optimize the model’s performance, we will compile it using the RMSprop optimizer, which is effective for training deep learning models. Additionally, we will define a specific loss function suitable for our task and set up metrics to track the model's performance during training. This comprehensive approach will help ensure that our model is well-structured and efficient.

def buildModel():
  model=keras.models.Sequential([
      keras.layers.Dense(32,activation='relu',input_shape=[4]),
      keras.layers.Dense(32,activation='relu'),
      keras.layers.Dense(1)
      ])
  model.compile(optimizer='rmsprop',loss='mean_squared_error',metrics=['accuracy','MeanAbsoluteError','MeanSquaredError'])

  return model

Create Model and call fit for 100 epochs

model=buildModel()
history = model.fit(scaledTrainX, trainY, epochs=100, validation_split=0.2)

404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.9093 - MeanSquaredError: 103.3424 - accuracy: 0.0156 - loss: 103.3424 - val_MeanAbsoluteError: 7.8313 - val_MeanSquaredError: 102.0163 - val_accuracy: 0.0130 - val_loss: 102.0163
Epoch 99/100
404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.9773 - MeanSquaredError: 104.4646 - accuracy: 0.0166 - loss: 104.4646 - val_MeanAbsoluteError: 7.8514 - val_MeanSquaredError: 102.4821 - val_accuracy: 0.0124 - val_loss: 102.4821
Epoch 100/100
404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.8406 - MeanSquaredError: 102.1461 - accuracy: 0.0168 - loss: 102.1461 - val_MeanAbsoluteError: 7.8153 - val_MeanSquaredError: 101.8889 - val_accuracy: 0.0133 - val_loss: 101.8889

MeanAbsoluteError: 7.8406 - MeanSquaredError: 102.1461 - accuracy: 0.0168 - loss: 102.1461 - val_MeanAbsoluteError: 7.8153 - val_MeanSquaredError: 101.8889 - val_accuracy: 0.0133 - val_loss: 101.8889

We are about to proceed with evaluating the model by utilizing our test dataset to assess its performance and accuracy.

test_loss, mae, test_acc, mse = model.evaluate(scaledTestX, testY)

126/126 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - MeanAbsoluteError: 7.3928 - MeanSquaredError: 93.3554 - accuracy: 0.0240 - loss: 93.3554

Lets plot training and validation loss using Matplotlib

history_dict = history.history
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "bo", label="Training loss")
plt.plot(epochs, val_loss_values, "r", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

history_dict = history.history

Lets also plot MeanAbsoluteError

abs_error = history_dict["MeanAbsoluteError"]
val_abs_error = history_dict["val_MeanAbsoluteError"]
epochs = range(1, len(abs_error) + 1)
plt.plot(epochs, abs_error, "bo", label="Training  mean absolute error")
plt.plot(epochs, val_abs_error, "r", label="Validation mean absolute error")
plt.title("Training and validation mean absolute error")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

jupyter notebook. for this project can be found at BaskeBallGame Prediction

Implementing a neural network using Keras for NCAA college basketball game data.