Hello there! I started writing this series of articles to deepen my knowledge of Machine Learning and Artificial Intelligence and it’s been a great journey so far. In the previous article, we built a Binary Classification Model with Neural Networks in Keras to classify movie review sentiments as either positive or negative. This article builds on that knowledge, we’ll be using a similar flow to how we went about the task in the previous article.

This article answers a question you might have had while working through the task in the previous article to solve a Binary Classification problem, what if you wanted to classify data into multiple categories. The solution to that is a Multiclass Classification Model which we’re going to tackle in this article.

We’ll be working with the Reuters dataset, a set of short newswires and their categories. It’s a simple, widely used toy dataset for text classification. There are 46 different categories and each category has at least 10 examples in the training set. This dataset comes prepackaged with Keras as well.

We’re classifying each short newswire into one of the 46 categories so this problem is more specifically an instance of Single-Label Multiclass Classification. If each short newswire topic could be classified into different categories, we’d be dealing with a Multilabel Multiclass Classification.

Alright then, time to get started…

Like in the previous article, our first step would be to load the dataset into our Google Colab workspace. You can do so by running:

from tensorflow.keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

# Just like the imdb dataset we're also restricting the data to the 10,000 most frequently occuring words

print(len(train_data)) # 8982
print(len(test_data)) # 2246

We have 8,982 training examples and 2,246 test examples. Great!

Just as with the IMDB dataset, each example is a list of integers (word indices). You can print out an example and also decode the integers back to words, an example was done in the previous article, the same code would work here.

print(train_labels[10]) # 3

# The label associated with an example is an integer between 0 and 45 i.e a category index

Alright, the dataset is loaded, next step is to preprocess the data for our model. To do this, we need to vectorize the data. This is done so as to feed the model the data in a format that is more understable to it, vectors. The code to vectorize the data can be picked up from the previous article, then run:

x_train = vectorize_sequence(train_data)
x_test = vectorize_sequence(test_data)

# the code for this function that converts lists of integers into lists of 10,000 dimensional vectors can be found in the previous article.

For our labels, it’s slightly different. It’s the same idea, we’ll be converting the category index into a vector of 46 dimensions where the entries will all be 0 except for the position representing the category which will be represented by 1. This is called one-hot encoding or categorical encoding as opposed to multi-hot encoding which we used to vectorize the data.

To encode the labels run:

import numpy as np

def to_one_hot (labels, dimension=46):
    results = np.zeros ((len(labels), dimension)
    for i, label in enumerate(labels):
        results(i, label) = 1.
    return results

y_train = to_one_hot(train_labels)
y_test = to_one_hot(test_labels)

# note that there is also a built in way in Keras to do this, works the same
# from tensorflow.keras.utils import to_categorical
# y_train = to_categorical(train_labels)
# y_test = to_categorical(test_labels)

Now we’ve completely processed our data for training, it’s ready to be fed to the model. We move on to the next step which is building the model.

This problem is similar to the problem in the previous article, in both cases we’re trying to classify short snippets of text but the difference is that the output classes have gone from two in binary classification to 46 in this case of multiclass classification.

We can see that the dimensionality of output space is much larger in this problem. We’ll use a stack of Dense layers in our model as we did for the previous task as well. In Dense layers, each layer can only access information present in the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers.

In the previous article, we used 16-dimensional intermediate layers, but that might be too limited to learn 46 different classes as is required by this problem. For this reason, we’ll be going with larger layers, 64 units.

Our model architecture looks like this:

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(46, activation="softmax")
])

Two things you should note about this model architecture:

First, we end the model with a Dense layer of 46 units, to represent the 46 categories.
Second, the last layer uses a softmax activation, this is so the model outputs a probability distribution over the 46 different categories.

We’ve established our model architecture, next thing is to select our loss function and optimizer. The best loss function to use in this case is categorical_crossentropy. For optimizer, we’ll go with rmsprop which is good for most use cases.

Next, we’ll compile our model like so:

model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

Before we train the model we’ll set apart some validation data, like so:

x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = y_train[:1000]
partial_y_train = y_train[1000:]

Now let’s train the model for 20 epochs by running:

history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val)

Next we can display its loss and accuracy curve like we did in the previous article. You can use the same code for plotting the curves from the previous article.

From the graphs, we can see that the model starts to overfit after about the ninth epoch. So we’ll train the model again from scratch for 9 epochs then evaluate it on the test set.

model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(46, activation="softmax")
])

model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(partial_x_train, partial_y_train, epochs=9, batch_size=512)

results = model.evaluate(x_test, y_test)
print(results) #[0.9503293633460999, 0.7831701040267944]

From the result, we can see the model achieves an accuracy of about 80%. We can use this model on new test samples by calling the predict method which returns a class probability distribution over all 46 categories.

predictions = model.predict(x_test)

print(np.argmax(predictions[0]) # 4 i.e category 4 for the first short newswire

And there we have it. We’ve trained a Multiclass Classification Model. Between this article and the previous one, we’ve become better Machine Learning Engineers or Artificial Intelligence Researchers.

To build a Machine Learning model, we now know we need to:

Load the dataset.
Process the dataset so it can be fed into the model.
Choose an architecture for the model.
Choose a loss function and an optimizer.
Compile the model.
Separate some of the data for validation.
Train the model.
Compare the loss and accuracy curve for the train and validation data.
Retrain the model to prevent overfitting.
Evaluate the model on the test data.
Use the model on new data by calling the predict method.

There’s still a lot more that goes into Machine Learning and Artificial Intelligence research but this is a good starting point as any. There are still further things to learn and then we’ll be able to build models to solve any problem we encounter.

A couple key points before I say bye:

For the model architecture, because the final outputs are 46-dimensional, we should avoid intermediate layers with fewer than 46 units.
If you’re trying to classify data points among N classes, your model should end with a Dense layer of size N.
In a Single-Label Multiclass Classification problem, your model should end with a softmax activation so that it will output a probability distribution over the N output classes.
Categorical crossentropy is almost always the loss function suitable for this problem.

You should try out different architectures to see their effect on the model. I’ll be ending this article here and in the next article, we’ll work on building a model for a regression problem.

Learning Machine Learning and Artificial Intelligence with Blast

Subscribe to my newsletter

Victor Onoja Odoh

Victor Onoja Odoh