Neural networks are the cornerstone of modern machine learning, powering applications like image recognition, natural language processing, and even autonomous vehicles. While frameworks like TensorFlow and PyTorch abstract away much of the complexity, building a neural network from scratch gives you invaluable insight into how these models work at a fundamental level.

In this post, we'll build a simple feedforward neural network from scratch using Python. We'll use the MNIST dataset of handwritten digits (0–9) to train our model. By the end of this guide, you’ll not only understand how neural networks function but also how to implement one from the ground up.

What is a Neural Network?

A neural network is a computational model inspired by the human brain, consisting of layers of interconnected nodes (neurons) that learn to recognize patterns in data. Neural networks are powerful tools for classification, regression, and feature extraction, making them ideal for tasks like image recognition, speech processing, and more.

Key Components of a Neural Network

Input Layer: The layer that receives input data, such as an image or a feature vector.
Hidden Layers: These layers perform computations on the data using weights and biases. They enable the network to learn complex patterns.
Output Layer: The layer that produces the model's predictions. For classification tasks, it outputs the probabilities for each class.
Activation Functions: Functions applied to neurons to introduce non-linearity into the network, enabling it to model complex relationships. Common activation functions include ReLU, Sigmoid, and Softmax.
Weights and Biases: Parameters that the network learns during training to minimize error. Weights connect the neurons between layers, and biases are added to the neurons to shift their outputs.

Step 1: Loading and Preprocessing Data

Before training a neural network, it’s crucial to preprocess the data to ensure it is in a suitable format for the network.

Why Preprocessing is Important

Raw data often requires cleaning and transformation. In the case of image data, such as the MNIST dataset, preprocessing typically includes:

Normalization: Scaling pixel values to a smaller range (0 to 1) to help the network converge faster during training.
Shuffling: Randomly rearranging the data to ensure that the model is not biased toward any particular pattern in the data.
Splitting: Dividing the data into a training set (for learning) and a validation set (for evaluating model performance).

Loading and Shuffling the Data

The MNIST dataset is stored in a CSV format, where each row represents a flattened 28x28 grayscale image (784 pixels) and a label. We'll load and shuffle the data to ensure randomness during training.

import numpy as np
import pandas as pd

def load_and_shuffle_data(filepath):
    df = pd.read_csv(filepath)  # Load CSV into a DataFrame
    data_array = df.values      # Convert DataFrame to NumPy array
    np.random.shuffle(data_array)  # Shuffle the rows
    return data_array

Splitting and Normalizing Data

We will split the dataset into a training set and a validation set. The pixel values are normalized by dividing them by 255 (since pixel values range from 0 to 255) to scale them between 0 and 1.

def split_and_normalize_data(data, dev_size=1000):
    total_samples, total_features = data.shape

    # Validation Data
    validation_data = data[:dev_size].T  # Transpose for easier manipulation
    val_labels = validation_data[0]     # Labels are in the first row
    val_features = validation_data[1:] / 255.0  # Normalize pixel values

    # Training Data
    train_data = data[dev_size:].T      # Remaining samples for training
    train_labels = train_data[0]
    train_features = train_data[1:] / 255.0

    return val_features, val_labels, train_features, train_labels

Step 2: Initializing Neural Network Parameters

Network Architecture

We'll create a simple feedforward neural network with:

Input Layer: 784 neurons (since MNIST images are 28x28 pixels, resulting in 784 features).
Hidden Layer: 10 neurons (chosen arbitrarily for simplicity).
Output Layer: 10 neurons (one for each digit class: 0–9).

The weights and biases connecting these layers are initialized randomly. These parameters will be learned during the training process.

def initialize_parameters(input_dim=784, hidden_units=10, output_units=10):
    weights1 = np.random.uniform(-0.5, 0.5, (hidden_units, input_dim))  # Weights for input-to-hidden layer
    bias1 = np.random.uniform(-0.5, 0.5, (hidden_units, 1))            # Biases for hidden layer
    weights2 = np.random.uniform(-0.5, 0.5, (output_units, hidden_units)) # Weights for hidden-to-output layer
    bias2 = np.random.uniform(-0.5, 0.5, (output_units, 1))            # Biases for output layer

    return weights1, bias1, weights2, bias2

Step 3: Forward Propagation

Forward propagation is the process of passing input data through the network to compute predictions. It involves matrix multiplication between the data and the weights, followed by activation functions to introduce non-linearity.

Activation Functions

ReLU (Rectified Linear Unit): Used for the hidden layers. ReLU sets all negative values to zero, allowing the model to learn non-linear patterns.
Softmax: Used for the output layer to convert the network’s raw scores into probabilities. The softmax function ensures that the predicted values sum to 1, which makes them interpretable as probabilities.

def relu_activation(z):
    return np.maximum(z, 0)

def softmax_activation(z):
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))  # Stability trick for large values
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

Forward Pass

In forward propagation, we calculate the pre-activations (the weighted sums of inputs) and then apply activation functions.

def forward_pass(weights1, bias1, weights2, bias2, features):
    preactivation1 = np.dot(weights1, features) + bias1   # Input-to-hidden layer computation
    activation1 = relu_activation(preactivation1)         # Apply ReLU activation

    preactivation2 = np.dot(weights2, activation1) + bias2 # Hidden-to-output layer computation
    activation2 = softmax_activation(preactivation2)       # Apply Softmax activation

    return preactivation1, activation1, preactivation2, activation2

Step 4: Backward Propagation

Backward propagation is the process of calculating gradients of the loss function with respect to the model parameters (weights and biases). These gradients will be used to update the parameters and minimize the loss.

Loss Function: Cross-Entropy Loss

For classification tasks, we use the cross-entropy loss, which measures the difference between the true label and the predicted probabilities.

def compute_gradients(preactivation1, activation1, preactivation2, activation2, weights1, weights2, features, labels, num_samples):
    encoded_labels = encode_labels(labels)  # One-hot encode labels

    delta2 = activation2 - encoded_labels  # Output layer error term

    grad_weights2 = np.dot(delta2, activation1.T) / num_samples   # Gradients w.r.t weights of output layer
    grad_bias2 = np.sum(delta2, axis=1, keepdims=True) / num_samples

    delta1 = np.dot(weights2.T, delta2) * relu_derivative(preactivation1)   # Hidden layer error term

    grad_weights1 = np.dot(delta1, features.T) / num_samples   # Gradients w.r.t weights of hidden layer
    grad_bias1 = np.sum(delta1, axis=1, keepdims=True) / num_samples

    return grad_weights1, grad_bias1, grad_weights2, grad_bias2

Step 5: Training the Neural Network

Training a neural network involves iterating over the data, performing forward propagation, calculating the gradients using backward propagation, and updating the parameters using an optimization algorithm (e.g., gradient descent).

def update_parameters(weights1, bias1, weights2, bias2, grad_weights1, grad_bias1, grad_weights2, grad_bias2, learning_rate):
    weights1 -= learning_rate * grad_weights1   # Update weights for hidden layer
    bias1 -= learning_rate * grad_bias1         # Update biases for hidden layer

    weights2 -= learning_rate * grad_weights2   # Update weights for output layer
    bias2 -= learning_rate * grad_bias2         # Update biases for output layer

    return weights1, bias1, weights2, bias2

Training Loop

In each iteration, we perform forward propagation, compute the gradients, and update the parameters.

def train_neural_network(features, labels, learning_rate=0.10, num_iterations=500):
    num_samples = features.shape[1]

    weights1, bias1, weights2, bias2 = initialize_parameters()   # Initialize parameters

    for iteration in range(num_iterations):
        preactivation1, activation1, preactivation2, activation2 = forward_pass(weights1, bias1, weights2, bias2, features)

        grad_weights1, grad_bias1, grad_weights2, grad_bias2 = compute_gradients(preactivation1, activation1, preactivation2, activation2, weights1, weights2, features, labels, num_samples)

        weights1, bias1, weights2, bias2 = update_parameters(weights1, bias1, weights2, bias2, grad_weights1, grad_bias1, grad_weights2, grad_bias2, learning_rate)

        if iteration % 10 == 0:
            predictions = predict_labels(activation2)
            accuracy = calculate_accuracy(predictions, labels)
            print(f"Iteration {iteration}: Accuracy = {accuracy:.4f}")

    return weights1, bias1, weights2, bias2

Conclusion

By following these steps, we’ve built a neural network from scratch that can classify handwritten digits from the MNIST dataset. Understanding the inner workings of a neural network is crucial for gaining insights into how machine learning models learn patterns from data.

While building neural networks from scratch is a great way to learn, in practice, you’ll likely use higher-level libraries like TensorFlow or PyTorch for efficiency. However, understanding these fundamentals will help you become a better practitioner of machine learning and give you the ability to troubleshoot and optimize your models at a deeper level.

Happy coding!

Check out the GitHub repo to find out more.

Neural Network from Scratch in Python: A Step-by-Step Guide