Building a Neural Network from Scratch in Python and NumPy

Harvey DucayHarvey Ducay
11 min read

Ever looked at a neural network formula and felt a disconnect from what it's actually doing? You're not alone. Frameworks like TensorFlow and PyTorch are powerful, but their high-level abstractions can hide the beautiful, intuitive mathematics that make learning possible.

Today, we're going to build a neural network from scratch using only Python and NumPy. But more importantly, at every step we'll stop and connect the code to the core mathematical ideas.

Our goal isn't just to make it work, it's to understand how a collection of matrix multiplications and derivatives can learn to recognize something as complex as a handwritten digit.

Let's translate the math into intuition.


Files and Dependencies

Every learning process starts with information. For our network, that information is the MNIST dataset.

Our project relies on this main library:

  • NumPy: This is our mathematical workhorse. It will perform the vector and matrix operations that form the very language of neural networks.
import numpy as np

The MNIST dataset contains 70,000 images (28x28 pixels) of handwritten digits. We load it from a NumPy-native .npz file.

# Define the path to your dataset file
file_path = '/kaggle/input/mnist-from-scratch/mnist.npz'

data = np.load(file_path)
print(f"Arrays in the file: {data.files}")

# Unpack the data into training and testing sets
x_train, y_train = data['x_train'], data['y_train']
x_test, y_test = data['x_test'], data['y_test']

Our network doesn't see images, it sees numbers. To make these numbers easier to work with, we perform normalization.

The Math: We take pixel values from [0, 255] and scale them to [0, 1].
The Intuition: This ensures that all our input features are on a similar scale. During training, the gradients (which tell us how to update our weights) are sensitive to the scale of the input. Normalization prevents gradients from becoming too large and unstable, leading to a smoother, more predictable learning process.

# Normalize pixel values to the [0, 1] range
x_train = x_train / 255.0
x_test = x_test / 255.0

Building the Neural Network

We are building a brain with a specific structure: an input layer, two hidden layers, and an output layer. The "learning" happens in the connections between these layers. These connections are defined by weights and biases.

Initializing Weights and Biases

The Math: We create matrices of random numbers. The shape of each matrix is (neurons_in_layer, neurons_in_previous_layer).

The Intuition:

  • Weights (w): Think of a weight as the strength or importance of a connection. A large weight means the signal from a previous neuron has a strong influence. We initialize them randomly to break symmetry. If all weights started at zero, every neuron in a layer would learn the exact same thing, and our network would be no better than a single neuron. Randomness ensures they each start on a unique path to find different features.

  • Biases (b): Think of a bias as an "activation threshold." It's a value that determines how easy it is for a neuron to "fire" (output a high value). A neuron with a large negative bias will require a very strong input signal to become active.

# --- Weights: The strength of connections ---
# w_i_h1 connects 784 input neurons to 64 hidden layer 1 neurons
w_i_h1 = np.random.uniform(-0.5, 0.5, (64, 784))
# w_h1_h2 connects 64 L1 neurons to 32 L2 neurons
w_h1_h2 = np.random.uniform(-0.5, 0.5, (32, 64))
# w_h2_o connects 32 L2 neurons to 10 output neurons
w_h2_o = np.random.uniform(-0.5, 0.5, (10, 32))

# --- Biases: The neuron's activation threshold ---
b_i_h1 = np.random.uniform(-0.5, 0.5, (64, 1))
b_h1_h2 = np.random.uniform(-0.5, 0.5, (32, 1))
b_h2_o = np.random.uniform(-0.5, 0.5, (10, 1))

On a side note, I’ve set up 64 neurons first the first hidden layer and 32 neurons for the second hidden layer. These numbers are arbitrary and can be replaced with almost any number of integer neurons and still perform the same. I feel like using 64 and 32 for the two layers are the easiest to understand and make an example out of. Also, the weights were initialized as a uniform distribution from -0.5 to 0.5, as well as the biases.


Training Loop

The training loop is where the magic happens. It's a cycle of guessing, checking, and correcting that, when repeated thousands of times, allows the network to learn. We will dissect this process into four distinct stages that occur for every single image in our training data.

Here's the full code block for context. We will break it down piece by piece below.

# --- The Full Training Loop ---
for epoch in range(epochs):
    # ... (code for tracking error and accuracy)
    for image, label in zip(x_train, y_train):
        # STAGE A: Data Preparation
        image = image.reshape(784, 1)
        label_vec = np.zeros((10, 1))
        label_vec[label] = 1

        # STAGE B: Forward Propagation
        h1_pre = w_i_h1 @ image + b_i_h1
        h1 = 1 / (1 + np.exp(-h1_pre))
        h2_pre = w_h1_h2 @ h1 + b_h1_h2
        h2 = 1 / (1 + np.exp(-h2_pre))
        o_pre = w_h2_o @ h2 + b_h2_o
        exps = np.exp(o_pre - np.max(o_pre))
        o = exps / np.sum(exps)
        o = np.clip(o, epsilon, 1 - epsilon)

        # STAGE C: Loss Calculation
        error = -np.sum(label_vec * np.log(o))
        # ... (update total error and accuracy)

        # STAGE D: Backpropagation
        delta_o = o - label_vec
        w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
        b_h2_o += -learning_rate * delta_o

        delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (1 - h2))
        w_h1_h2 += -learning_rate * delta_h2 @ np.transpose(h1)
        b_h1_h2 += -learning_rate * delta_h2

        delta_h1 = np.transpose(w_h1_h2) @ delta_h2 * (h1 * (1 - h1))
        w_i_h1 += -learning_rate * delta_h1 @ np.transpose(image)
        b_i_h1 += -learning_rate * delta_h1

Data Preparation

Before we can feed data to our network, we must format it correctly.

# Reshape the input image from a 28x28 matrix to a 784x1 vector
image = image.reshape(784, 1)

# Create a "one-hot" encoded vector for the label
label_vec = np.zeros((10, 1))
label_vec[label] = 1

1. Flattening the Image:
Our network's input layer has 784 neurons, arranged as a single line. The raw image is a 28x28 grid of pixels. By reshape-ing it into a (784, 1) column vector, we are "unspooling" the grid into a flat list that can be directly multiplied by our first weight matrix (w_i_h1), which has a shape of (64, 784).

2. Vectorizing the Label (One-Hot Encoding):
A human understands the label 7, but our network's output is a vector of 10 probabilities. To measure the error, we need a "ground truth" target in the same format. One-hot encoding does this. For the label 7, it creates a vector that is zero everywhere except for a 1 at the 7th index, representing "100% confidence that the digit is 7."


Forward Propagation

Here, data flows forward through the network, from the input pixels to the final probability outputs.

# Hidden Layer 1
h1_pre = w_i_h1 @ image + b_i_h1
h1 = 1 / (1 + np.exp(-h1_pre))  # Sigmoid activation

# Hidden Layer 2
h2_pre = w_h1_h2 @ h1 + b_h1_h2
h2 = 1 / (1 + np.exp(-h2_pre))  # Sigmoid activation

# Output Layer
o_pre = w_h2_o @ h2 + b_h2_o
exps = np.exp(o_pre - np.max(o_pre)) # Stable Softmax
o = exps / np.sum(exps)
o = np.clip(o, epsilon, 1-epsilon)

At each layer, we do two things:

  1. Linear Combination: Z = W @ A_prev + b. This is a weighted sum. The matrix multiplication (@) aggregates signals from the previous layer, and the bias (b) shifts the result.

  2. Activation: A = g(Z). We pass the result through a non-linear activation function.

A Closer Look at Our Activation Functions

  • Sigmoid (for Hidden Layers): σ(z) = 1 / (1 + e⁻ᶻ)

    • The Math: This function takes any real number z and "squashes" it into a range between 0 and 1.

    • The Intuition: It acts like a dimmer switch or a "gate." A value close to 1 means the neuron is highly "active" and passing its signal forward. A value close to 0 means it's "inactive." Crucially, it's non-linear. Without non-linearity, stacking layers would be pointless; the entire network would collapse into a single, less powerful linear transformation.

  • Softmax (for the Output Layer): S(zᵢ) = eᶻᵢ / Σ eᶻⱼ

    • The Math: It exponentiates each input score (making them all positive) and then divides by the sum of all exponentiated scores.

    • The Intuition: This is why we use Softmax for the output layer in a classification problem. Its output has two beautiful properties:

      1. All output values are between 0 and 1.

      2. All output values sum to 1.
        This transforms the network's raw final scores (o_pre) into a probability distribution. We can interpret the output o as the network's confidence in each digit. We can't use a function like ReLU (max(0,z)) here because its outputs don't sum to 1 and can't be interpreted as probabilities.

Finally, you could see that value of “o” is is clipped by epsilon. It's to prevent the computation from breaking. The loss function uses np.log(), and log(0) is negative infinity.


Cross-Entropy Loss

Now that we have a prediction (o) and a ground truth (label_vec), we can quantify how wrong the network was.

error = -np.sum(label_vec * np.log(o)) # Cross-Entropy Loss
  • The Math: L = -Σ yᵢ log(pᵢ), where y is the true label (our one-hot label_vec) and p is the prediction (o). Since y is 1 for the correct class and 0 for all others, this simplifies to L = -log(p_correct).

  • The Intuition: This measures "surprise." If the network predicts a high probability for the correct digit (e.g., p_correct = 0.95), then -log(0.95) is a very small error. If the network is confidently wrong (e.g., p_correct = 0.01), then -log(0.01) is a very large error. This loss function heavily penalizes confident mistakes.


Backpropagation

This is the most mathematically rich part. We use the error to figure out how to adjust every single weight and bias. The goal is to calculate the gradient of the loss function with respect to each parameter (∂L/∂W, ∂L/∂b). The gradient tells us the direction of steepest ascent for the loss, so we take a small step in the opposite direction to reduce the error. This entire process is a practical application of the Chain Rule from calculus.

Step 1: The Output Error Gradient

delta_o = o - label_vec
  • The Math: This is the derivative of the Loss with respect to the pre-activation output scores, ∂L/∂o_pre. For the specific combination of Softmax and Cross-Entropy Loss, the calculus simplifies beautifully to this intuitive form: (prediction - actual). This vector tells us the magnitude and direction of the error for each output neuron.

Step 2: Update Output Layer Weights & Biases

w_h2_o += -learning_rate * delta_o @ np.transpose(h2)
b_h2_o += -learning_rate * delta_o
  • The Math: To get the gradient for the weights (∂L/∂w_h2_o), the Chain Rule states: ∂L/∂w_h2_o = (∂L/∂o_pre) * (∂o_pre/∂w_h2_o).

  • Connecting to Code: We already have ∂L/∂o_pre (it's delta_o). The term ∂o_pre/∂w_h2_o is simply the activation of the layer that fed into it, h2. Therefore, the full gradient is delta_o @ np.transpose(h2). We then take a small step (-learning_rate) in the opposite direction of this gradient. The bias update is even simpler as its derivative is just 1.

Step 3: Propagate Error to Hidden Layer 2

delta_h2 = np.transpose(w_h2_o) @ delta_o * (h2 * (1 - h2))
  • The Math: Now we need the error for the next layer back, ∂L/∂h2_pre. The Chain Rule expands: ∂L/∂h2_pre = (∂L/∂o_pre) (∂o_pre/∂h2) (∂h2/∂h2_pre).

  • Connecting to Code:

    • (∂L/∂o_pre) * (∂o_pre/∂h2): This is the output error (delta_o) propagated backward through the weights (w_h2_o). This is np.transpose(w_h2_o) @ delta_o. It tells us how much each h2 neuron contributed to the final output error.

    • (∂h2/∂h2_pre): This is the derivative of the Sigmoid activation function, which conveniently is σ(z) (1 - σ(z)), or in our code, h2 (1 - h2).

    • The term h2 * (1-h2) has a powerful intuition: neurons that were very certain (output near 0 or 1) have a small derivative and are changed very little. Neurons that were uncertain (output near 0.5) have the largest derivative and are updated the most. The network focuses its learning on its points of uncertainty!

Steps 4-6: Continue the Process Backward
The remaining lines repeat this exact pattern. The error delta_h2 is used to update w_h1_h2 and b_h1_h2, and then it's propagated further back to create delta_h1, which is used to update the final set of weights and biases. This "chain" is what gives the Chain Rule its name.


Conclusion

We took a blank file, a handful of mathematical principles, and a collection of pixels, and we forged a system that can learn. But the most valuable thing we built today wasn't a digit classifier, it was intuition.

In a world dominated by powerful frameworks like PyTorch and TensorFlow, it's tempting to jump straight to the high-level commands.

These tools are incredible for productivity and are essential for building state-of-the-art models. But used without a foundational understanding, they can become "magic black boxes" that work in mysterious ways. When they break, we don't know why. When we need to innovate, we don't know how.

This project was our refusal to accept the magic box.

Take this blog with a grain of salt, but I’d rather share a project that’s done from scratch and close from intuition (that may be prone to more mistakes too) in order to open express my creativity of applied mathematics, and also further the discussion when it comes to AI.

code link: https://github.com/HarvsDucs/mnist_from_scratch/blob/main/main.py

I also highly suggest checking out 3blue1brown’s video about backpropagation calculus: https://www.youtube.com/watch?v=tIeHLnjs5U8

2
Subscribe to my newsletter

Read articles from Harvey Ducay directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Harvey Ducay
Harvey Ducay