Deep Learning for Coders with fastai and PyTorch: My Journey Through Chapters 4–7

Krupa SawantKrupa Sawant
8 min read

When you first see a deep learning model recognize a cat or a handwritten number, it feels like magic. But in Deep Learning for Coders with fastai and PyTorch, Chapters 4–7 take you from that magic moment to fully understanding how it works under the hood — from raw pixels to building your own neural network from scratch.

What are pixels?

Every image is made up of pixels — tiny squares of color or brightness.
In the MNIST dataset, each image is 28×28 pixels in grayscale, so each pixel has just one value, between 0 (black) and 1 (white) after normalization.

1.The MNIST dataset

MNIST is a classic dataset of 70,000 handwritten digits (0–9).

  • 60,000 images for training

  • 10,000 for testing
    Each image comes with a label telling us which digit it represents — for example, “this is a 3.”

When loaded into Python, each MNIST image becomes a NumPy array:

image.shape  # (28, 28)

We can flatten this into a single vector of length 784 (28×28) so it’s easier to feed into a simple model.

2.Tensors vs NumPy arrays

A tensor is like a NumPy array but with two big advantages:

  1. Runs seamlessly on GPUs for massive speed boosts.

  2. Supports automatic differentiation (autograd) — essential for training neural networks.

In PyTorch, torch.tensor() gives us a GPU-ready, autograd-friendly version of a NumPy array.

Tensors: ranks & shapes

  • Rank = number of dimensions (0D scalar, 1D vector, 2D matrix, etc.).

  • Shape = size of each dimension (e.g., (28, 28) for one MNIST image).

These ideas matter because deep learning models move between shapes constantly — for example, flattening a 2D image into a 1D vector.

3.Averaging “all 3’s”

Coming back to our classifier to understand what makes a “3” different from a “7”:

  1. Stack all the “3” images into one big tensor.

  2. Take the average across them.

  3. You’ll see a fuzzy “prototype” 3 — the most common pixel pattern for that digit.

Do the same for 7’s, and you can classify a new image by seeing which prototype it’s closer to.

4.Using MSE for classification

Even though Mean Squared Error (MSE) is more common in regression, we can use it here for our toy 3-vs-7 classifier:

Lower MSE means our predictions are closer to the correct labels.
It’s not the best choice for classification in general, but it works for building intuition.

5.Broadcasting for metrics

Broadcasting is a PyTorch feature that automatically expands arrays of different shapes so they can work together in a single operation.
It lets us compute differences, averages, or accuracies over thousands of images at once — no for-loops needed.

6.The Magic of SGD

Once we have a loss function, the next question is: how do we make it smaller?
That’s where Stochastic Gradient Descent (SGD) comes in — the workhorse of modern deep learning.

No matter whether you’re training an image classifier, a translation model, or a large language model, the basic loop is the same.


The steps of SGD

  1. Initialize weights randomly.

  2. Forward pass — make predictions from the current weights.

  3. Loss calculation — measure how far off the predictions are.

  4. Backward pass — calculate the gradient (how each weight affects the loss).

  5. Update weights:

    wnew=w−α⋅gradw_{\text{new}} = w - \alpha \cdot \text{grad}wnew​=w−α⋅grad

    Here α\alphaα is the learning rate — how big a step we take.

  6. Repeat until the loss stops improving.

SGD is so universal in deep learning that Jeremy Howard calls it “the only thing you really need to know to train all deep learning models.”

From Linear Models to Neural Nets

7.Adding activation functions

If you just stack linear layers, the whole model is still linear — and can’t capture complex patterns.
We fix this with non-linear activation functions:

  • ReLU: Turns negatives into 0, keeps positives as they are — fast and effective.

  • Sigmoid: Squashes outputs into 0–1, good for binary probabilities.

  • Softmax: Turns outputs into probabilities summing to 1 — perfect for multiclass problems.


8.Mini-batching

SGD works even better when we use mini-batches:

  • Instead of all data at once (slow) or one at a time (noisy), we use small groups (e.g., 64 images).

  • This speeds up training on GPUs.

  • It also makes the updates more stable, because each batch is an average of many samples.

Mini-batching is the missing ingredient that turns our pixel-matching toy example into a scalable training process for large datasets.

Chapter 5 : From Digits to Pet Breeds and Beyond

After classifying 3’s and 7’s, it’s time for something more challenging:

Can we identify the exact breed of a pet from a photo?

This isn’t just dog vs cat — it’s a multiclass problem with many breeds. That means we need better tools and a more powerful model.


1. Resizing and Presizing

Our images come in all shapes and sizes. Models expect inputs of the same dimensions, so we need to resize them.
But simply resizing can distort important features (e.g., stretching a cat’s face).

Presizing is a better approach:

  1. First crop the image to the right aspect ratio.

  2. Then resize it to the target size (e.g., 224×224 pixels).

This preserves more useful detail for the model.


2. Checking our DataBlocks

In fastai, we define a DataBlock to manage our data pipeline:

  • Where the data is

  • How to get labels

  • How to transform and batch data

We can preview the DataBlock with .show_batch() to confirm:

  • Images look correct

  • Labels are correct

  • No preprocessing mistakes


3. Cross Entropy Loss and Softmax (Why?)

In MNIST, we used MSE for simplicity. But here, we have multiple classes and want the model to output a probability for each breed.

Softmax:

  • Takes the raw scores (logits) from the model.

  • Converts them into probabilities that sum to 1.

Cross Entropy Loss (CCE):

  • Compares the predicted probability for the correct class with 1 (perfect confidence).

  • Punishes low probability for the correct class:

Loss=−log⁡(probability of correct class)\text{Loss} = -\log(\text{probability of correct class})Loss=−log(probability of correct class)

Why not MSE?

  • MSE works poorly with probabilities — the gradients become too small.

  • CCE is designed for classification with Softmax, giving clearer learning signals.


4. Improving the Model — Learning Rate Finder

Choosing a good learning rate is crucial — too low is slow, too high can explode.

Fastai’s learning rate finder:

  • Tries a range of learning rates in one run.

  • Plots loss vs learning rate.

  • Pick the highest rate before the loss spikes.


5. Unfreezing and Discriminative Learning Rates

We often start with a pretrained model (e.g., ResNet) and only train the final layer first — this is freezing the earlier layers.
Once the final layer is tuned, we can unfreeze the whole model to fine-tune all layers.

Discriminative learning rates:

  • Use different learning rates for different parts of the network.

  • Early layers (general features like edges) → small LR.

  • Later layers (task-specific features) → larger LR.


Choosing the Right Activation, Loss, and Metrics (Chapter 6)

The Pet Breed problem is multiclass, but not all classification problems are the same. Here’s the breakdown:

Classification TypeDescriptionActivation FunctionLoss Function
Binary ClassificationOne yes/no answer per exampleSigmoid (0–1 output)Binary Cross-Entropy (BCE)
Multiclass ClassificationExactly one correct answer out of manySoftmaxCategorical Cross-Entropy (CCE)
Multilabel ClassificationMultiple independent yes/no answers per exampleSigmoid (per label)Binary Cross-Entropy (BCE) per label

Metrics for Evaluation

Once we choose activation and loss, we still need to measure performance:

  • Accuracy: % correct predictions (best for balanced data).

  • Precision: Out of all predicted positives, how many were correct?

  • Recall: Out of all actual positives, how many did we find?

  • F1-score: Balance of precision and recall.

  • ROC-AUC: Probability a positive is ranked above a negative.


By the end of Chapter 6, you know not just how to train a classifier, but which architecture and evaluation tools fit your problem type.

Chapter 7 — Training from Scratch in PyTorch

Fastai hides a lot of details — now we build the training loop ourselves.


1. Dataset & DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)

2. Model definition

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(784, 128)
        self.output = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.hidden(x))
        return self.output(x)  # logits

3. Loss & optimizer

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

4. Training loop

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_fn(pred, yb)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

5. Validation

with torch.no_grad():
    val_pred = model(val_xb)
    val_loss = loss_fn(val_pred, val_yb)
    val_acc = (val_pred.argmax(dim=1) == val_yb).float().mean()

From a simple 3 vs 7 digit classifier to a full PyTorch training loop, Chapters 3–7 took us through tensors, loss functions, SGD, activations, batching, and fine-tuning.

By the end, we’re not just using fastai — we understand the core steps every neural network uses to learn.

0
Subscribe to my newsletter

Read articles from Krupa Sawant directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Krupa Sawant
Krupa Sawant