Hello everyone, I hope the journey has been productive thus far. We’ve been learning Machine Learning and Artificial Intelligence in these articles and I hope you’re much more of an Artificial Intelligence researcher and a Machine Learning engineer than you were when we started this series.

In the past couple of articles, we’ve been learning how to build simple Machine Learning models using Neural Networks in Keras, from Binary Classification to Multiclass Classification. In this article, we’re going to continue with the trend but this time we’re building a Machine Learning model to solve a Regression problem.

A reminder of our reference book currently for this series of articles, Deep Learning with Python by Francois Chollet. It contains much more explanations than what I can share in these articles and you’ll do well to grab a copy of the book to even further enhance your Machine Learning and Artificial Intelligence knowledge.

Alright without further ado, let’s dive into building a Machine Learning model to solve a Regression problem.

Regression is another common type of Machine Learning problem, it involves predicting a continuous value instead of a discrete label: for instance, predicting the temperature tomorrow, given meteorological data or predicting the time that a software project will take to complete, given its specification.

Regression differs from Classification, which assigns data to predefined categories. Regression, on the other hand, establishes a relationship between variables.

For this task, we’re attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on.

The dataset we’re using for this task differs from the ones we used in the previous tasks because it has relatively few data points: only 506 examples, split between 404 training samples and 102 test samples. Each feature in the input data(for example, the crime rate) has a different scale. For instance, some values are proportions which take values between 0 and 1, others take values between 1 and 12, others between 0 and 100 and so on.

Working through this task, we’ll learn how to handle working with relatively few samples in a dataset and we’ll add to our data processing skills before feeding data into a Neural Network so it’s easier for the model to make useful representations.

Like in the previous two tasks, we’ll start by loading our dataset into Google Colab, this dataset also comes prepackaged with Keras. We’ll do it by running:

from tensorflow.keras.datasets import boston_housing
import numpy as np

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# look at the data:
print(train_data.shape) # (404, 13)
print(test_data.shape) # (102, 13)

# We have 404 training samples and 102 test samples, each with 13 numerical features i.e
# 404 rows and 13 columns for training samples, 102 rows and 13 columns for test samples 
# columns includes, per capita crime rate, average number of rooms per dwelling, accessibility to highways, etc.

print(train_targets) # [ 15.2, 42.3, 50. ... 19.4, 19.4, 29.1 ]
# The targets or labels are the median values of owner-occupied homes, in thousands of dollars

Alright our dataset is loaded, we’ve viewed it to understand it better, next step is to preprocess our data to be fed into the Neural Network.

It would be problematic to feed into a Neural Network values that all take wildly different ranges as is the case with the values in the columns. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult.

A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature or column in the input data, we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in NumPy.

mean = train_data.mean(axis=0) # calculate the mean of each column in the training data. axis=0 ensures the mean is computed for each column separately
train_data -= mean # subtract the mean for each column from the corresponding column values. this centers the data around 0
std = train_data.std(axis=0) # calculate the standard deviation of each column.  
train_data /= std # divide each column by its standard deviation, so each column as a unit standard deviation.
test_data -= mean # apply the same mean subtraction to the test data using the mean calculated from the training data, ensuring the test data is centered the same way as the training data.
test_data /= std # scales the test data by the standard deviation calculated from the training data.

# Note: the quantities used for normalizing the test data are computed using the training data. 
# You should never use any quantity computed on the test data in your workflow, even for something as simple as data normalization.

Alright, so we’ve preprocessed our data, now it’s ready to be fed into the Neural Network. Next step now is our model architecture.

Because we have so few samples, we’ll use a very small model with two intermediate layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using a small model is one way to mitigate overfitting.

Our model architecture is thus:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow import keras
from tensorflow.keras import layers

def build_model():
    model = keras.Sequential([
        layers.Dense(64, activation="relu"),
        layers.Dense(64, activation="relu"),
        layers.Dense(1)
    ])
    model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
    return model

# Note: we're using a function here instead of building the model directly because we need to instatiate the same model multiple times.

The model ends with a single unit and no activation (it will be a linear layer). This is a typical setup for Scalar Regression (a Regression problem where you’re trying to predict a single continuous value). The model is free to learn to predict values in any range.

Note also that we compile the model with the mse loss function—mean squared error, the square of the difference between the predictions and the targets. This is a widely used loss function for Regression problems.

We’re also monitoring a new metric during training: mean absolute error (MAE). It’s the absolute value of the difference between the predictions and the targets. For instance, an MAE of 0.5 on this problem would mean your predictions are off by $500 on average.

Okay so before we begin training the model, we need to importantly separate some of the data for validation. But because we have so few examples, the validation set would end up being very small (for instance, about 100 examples). As a consequence, the validation scores might change a lot depending on which data points we chose for validation and which we chose for training. This would prevent us from reliably evaluating our model.

The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions (typically K = 4 or 5), instantiating K identical models, and training each one on K - 1 partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the K validation scores obtained.

In simpler terms, K-fold cross validation involves dividing the training data into 4 or 5 let’s say 4 in this instance, train the model on 3 parts while evaluating on the 4th part, do this again with a different three evaluating on a different 4th until you’ve gone through all the data, then take the average of all the validation scores as the validation score for the model.

In code it looks like:

k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []

for i in range(k):
    print(f"Processing fold #{i}")
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] # prepares the training data: data from partition #k
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
        train_data[(i + 1) * num_val_samples:]],
        axis=0) # prepares the training data: data from all other partitions
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
        train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    model = build_model() # Builds the Keras model (already compiled)
    model.fit(partial_train_data, partial_train_targets,
        epochs=num_epochs, batch_size=16, verbose=0) # Trains the model (in silent mode, verbose = 0)
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0) # Evaluates the model on the validation data 
    all_scores.append(val_mae)

# Running the above with num_epochs = 100 yields the following results:
print(all_scores) # [1.8413702249526978, 2.450035572052002, 2.472498893737793, 2.5193967819213867]
print (np.mean(all_scores)) # 2.32082536816597

The different runs show different validation scores, from 1.8 to 2.5. The average (2.3) is a much more reliable metric than any single score—that’s the entire point of K-fold cross validation. In this case, we’re off by $2,300 on average, which is significant considering the prices range from $10,000 to $50,000.

Let’s try training the model a bit longer: 500 epochs. To keep a record of how well the model does at each epoch, we’ll modify the training loop to save the per-epoch validation score log for each fold.

In code it looks like:

k = 4
num_val_samples = len(train_data) // k
num_epochs = 500
all_mae_histories = []

for i in range(k):
    print(f"Processing fold #{i}")
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] # prepares the training data: data from partition #k
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
        train_data[(i + 1) * num_val_samples:]],
        axis=0) # prepares the training data: data from all other partitions
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
        train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    model = build_model() # Builds the Keras model (already compiled)
    history = model.fit(partial_train_data, partial_train_targets, validation_data=(val_data, val_targets),
        epochs=num_epochs, batch_size=16, verbose=0) # Trains the model (in silent mode, verbose = 0)
    mae_history = history.history["val_mae"]
    all_mae_histories.append(mae_history)

We can then compute the average of the per-epoch MAE scores for all folds with:

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

Let’s plot this:

plt.plot(range(1, len(average_mae_history) + 1), avaerage_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()

Running this in Colab, the validation MAE for the first few epochs is dramatically higher than the values that follow. Let’s omit the first 10 data points.

truncated_mae_history = average_mae_history[10:]
plt.plot(range(1, len(truncated_mae_history) + 1), truncated_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()

From this graph we can see the validation MAE stops improving significantly after 120-140 epochs (this number includes the 10 epochs we omitted). Past that point we start overfitting.

Now we can train a final production model on all of the training data, with the best parameters and then look at its performance on the test data.

We’ll do that here:

model = build_model() # freshly compiled model from the function created earlier
model.fit(train_data, train_targets, epochs=130, batch_size=16, verbose=0) # train on the entire data
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

# final result = mae: 2.1999

We’re still off by about $2,100. It’s an improvement! You can try varying the number of layers in the model, or the number of units per layer, to see if you can squeeze out a lower test error.

The predict() method for our model for this Scalar Regression problem, returns the model’s guess for the sample’s price in thousands of dollars:

We can test it on new data like so:

predictions = model.predict(test_data)
print(predictions[0]) # [7.089668]

# The first house in the test set is predicted to have a price of about $7,000

And that’s it, we’ve trained a Machine Learning model to solve a Regression problem. A couple key points before I say bye:

Regression is done using different loss functions than we used for Classification. Mean Squared Error (MSE) is a loss function commonly used for Regression.
Similarly, evaluation metrics to be used for Regression differ from those used for Classification. The concept of accuracy doesn’t apply for Regression. A common Regression metric is mean absolute error (MAE).
When columns in the input data have values in different ranges, each column should be scaled independently as a preprocessing step using feature-wise normalization.
When there is little data available, using K-fold validation is a great way to reliably evaluate a model.

And that’s it! with this article you’ve completed three mini Machine Learning projects, you can build on this knowledge and keep on improving.

I’ll be going on a hiatus to learn more so I can share more with these articles. It’ll be awhile before my next article but till then keep learning and when I resume writing these articles let’s be much more closer to our goal of becoming Machine Learning engineers or Artificial Intelligence researchers.

Till next time, Bye.

Learning Machine Learning and Artificial Intelligence with Blast

Subscribe to my newsletter

Victor Onoja Odoh

Victor Onoja Odoh