Part 5: Striking the Balance — Understanding Underfitting and Overfitting in Linear Models

Abhilash PSAbhilash PS
9 min read

In Part 4, we focused on improving our model. But how do we know if it’s too weak or too aggressive?
In this final post of the series, we’ll explain underfitting, overfitting, and the bias-variance tradeoff — one of the most important ideas in machine learning.

We’ll learn how to visualize it, fix it, and answer questions about it in interviews.

Introduction

When building machine learning models, there are two classic traps that even seasoned data scientists can fall into: underfitting and overfitting. These two issues can silently ruin a model’s performance, yet they are some of the most intuitive concepts once you get the hang of them.

Here, we’ll break down underfitting and overfitting with:

  • Simple definitions and metaphors

  • Hands-on code and visualizations (using Python & NumPy)

  • How to detect and fix both problems

  • A final checklist to evaluate if our model is in the sweet spot

Whether we're just starting out or brushing up on fundamentals, this guide will give us a solid understanding.

The Big Picture: What Are We Trying to Do?

When we train a machine learning model, our goal is to learn patterns from data that generalize well to new, unseen data.

Imagine we're tutoring a student. We want them to understand the concept (generalization), not just memorize answers to specific questions (overfitting) or misunderstand everything (underfitting).

What is Underfitting?

Definition: A model is said to be underfitting when it is too simple to capture the underlying trend in the data.

Symptoms:

  • High training error

  • High test error

  • Poor performance on both seen and unseen data

Analogy:

Imagine fitting a straight line through data that clearly forms a curve. Our model is too naive to catch what’s really happening.

Causes:

  • Model is too simple (e.g., linear model for nonlinear data)

  • Not enough training time (early stopping)

  • Poor features

What is Overfitting?

Definition: A model overfits when it memorizes the training data, including noise and outliers, and fails to generalize to new data.

Symptoms:

  • Very low training error

  • Very high test error

Analogy:

Imagine a student who memorizes every answer from the practice test. When they see a new question in the exam, they panic.

Causes:

  • Model is too complex (e.g., very deep tree, high-degree polynomial)

  • Too many parameters for the size of the data

  • Noisy training data

  • Insufficient regularization

Bias-Variance Tradeoff

Understanding the Theory Behind the Balance

While it's easy to grasp underfitting and overfitting visually, there's a deeper concept that unites them: the bias-variance tradeoff. This tradeoff helps explain why models behave the way they do as complexity changes.

Definition of Bias (in Machine Learning): Bias refers to the error introduced by approximating a complex problem with a simplified model. In simpler terms, it’s when a model ignores key patterns because it makes strong assumptions.

High Bias → Underfitting

  • Happens when the model is too simple to capture patterns in the data.

  • Tends to make strong assumptions about the data (e.g., assuming all relationships are linear).

  • Leads to consistently poor predictions, both on training and test sets.

Think of a student who didn’t study enough and tries to guess every answer based on a single rule — they’re wrong most of the time.

Definition of Variance (in Machine Learning): Variance measures how sensitive a model is to slight changes in the training data. It reflects how much predictions would change if trained on a different sample from the same source.

High Variance → Overfitting

  • Occurs when the model is too complex and tries to fit every detail of the training data, including noise.

  • Sensitive to even slight changes in the data.

  • Performs well on training data but poorly on unseen data.

Like a student who memorizes every question on a practice test — they fail when the test format changes slightly.

The Ideal Zone: Balance

  • A good model strikes a balance between bias and variance.

  • It is complex enough to capture patterns, but simple enough to ignore noise.

  • This sweet spot often lies somewhere in the middle of the complexity spectrum.

📌 Rule of Thumb: Increasing model complexity reduces bias but increases variance. The goal is to minimize total error, which comes from both.

$$\text{Total Error} = \underbrace{\text{Bias}^2}{\text{error from wrong assumptions}} + \underbrace{\text{Variance}}{\text{error from overreacting to noise}} + \text{Irreducible Error}$$

Visualizing the Problem

Let’s use Python and NumPy to simulate and visualize:

import numpy as np
import matplotlib.pyplot as plt

# Synthetic dataset
np.random.seed(1)
x = np.linspace(0, 10, 20)
y = 3 * x**2 + 2 * x + 1 + np.random.randn(20) * 15

# Fit & predict function
def fit_predict(x, y, degree):
    coeffs = np.polyfit(x, y, degree)
    x_line = np.linspace(min(x), max(x), 200)
    y_line = np.polyval(coeffs, x_line)
    return x_line, y_line

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, deg in enumerate([1, 2, 15]):
    x_line, y_line = fit_predict(x, y, deg)
    axes[i].scatter(x, y, color='blue', label='Data')
    axes[i].plot(x_line, y_line, color='red', label=f'Degree {deg}')
    axes[i].set_title(['Underfitting', 'Good Fit', 'Overfitting'][i])
    axes[i].legend()
    axes[i].grid(True)
plt.tight_layout()
plt.show()

This code shows:

  • A linear model struggling to capture the pattern (underfit)

  • A quadratic model doing well (good fit)

  • A complex polynomial model that zigzags wildly (overfit)

Training vs Validation Curve Plot

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Synthetic dataset
np.random.seed(1)
x = np.linspace(0, 10, 20)
y = 3 * x**2 + 2 * x + 1 + np.random.randn(20) * 15

# Reshape and split
x = x.reshape(-1, 1)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42)

train_errors = []
val_errors = []
degrees = range(1, 16)

for d in degrees:
    coeffs = np.polyfit(x_train.flatten(), y_train, d)
    model = np.poly1d(coeffs)
    y_train_pred = model(x_train.flatten())
    y_val_pred = model(x_val.flatten())

    train_errors.append(mean_squared_error(y_train, y_train_pred))
    val_errors.append(mean_squared_error(y_val, y_val_pred))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(degrees, train_errors, label='Training Error', marker='o')
plt.plot(degrees, val_errors, label='Validation Error', marker='o')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff: Error vs. Model Complexity')
plt.legend()
plt.grid(True)
plt.tight_layout()

the chart we've generated is a Bias-Variance Tradeoff visualization, showing how model complexity (polynomial degree) affects training and validation error.

X-axis: Model Complexity, represented by the degree of the polynomial (from 1 to 15).

Y-axis: Mean Squared Error (MSE) — lower is better.

Blue Line: Training Error — how well the model fits the data it was trained on.

Orange Line: Validation Error — how well the model performs on unseen data.

Interpretation of the Plot: Error vs Model Complexity

This chart shows how model performance changes as we increase complexity by using higher-degree polynomials (from 1 to 15):

Degrees 1–11 – Sweet Spot or Data Quirk?

  • Both training and validation errors are very low and nearly equal.

  • At first glance, this looks like we’ve nailed the sweet spot — the model is generalizing well.

  • However, with such consistently low error across degrees, it's worth asking:

    “Is the dataset too small or too easy?”

  • This could happen if:

    • The data has a strong, clean pattern.

    • We have too few data points (e.g., only 20 samples).

    • Even simple models can perfectly fit it — which means true underfitting is hard to visualize here.

Degrees 12–15 – Clear Overfitting Zone

  • Validation error spikes dramatically, while training error stays very low.

  • This is classic overfitting:

    • The model starts to memorize every tiny fluctuation in training data — even noise.

    • It loses the ability to generalize to unseen data.

  • This is a clear sign of high variance.

What This Tells Us (for Linear Regression Learners)

  • As we increase model complexity:

    • Training error always goes down (we can always memorize more).

    • Validation error decreases up to a point, then increases again — forming the classic U-shaped curve.

  • The goal is to stop at the lowest point of validation error — that’s your sweet spot.

Conclusion

Even with linear regression, when extended via polynomial features, it’s possible to overfit.
This plot helps us visually detect when our model is becoming too complex for the data it’s learning from.

Detecting Underfitting & Overfitting

Use a training vs. validation error curve:

AspectUnderfittingOverfitting
Training ErrorHighVery Low
Test ErrorHighHigh
Model TypeToo SimpleToo Complex
GeneralizationPoor on both seen and unseen dataPoor on unseen data
FixesIncrease complexity, add featuresRegularization, simplify, more data

Remedies and Fixes

To Fix Underfitting:

  • Use a more complex model

  • Add more features or transformations

  • Reduce regularization (We will come to this later)

  • Train longer

To Fix Overfitting:

  • Simplify the model (fewer parameters)

  • Use regularization (L1, L2)

  • Get more data

  • Use dropout, for neural networks. (We will come to this later)

  • Use cross-validation

Bonus: A Real-World Example

Let’s say we’re predicting exam scores based on hours studied. Our dataset:

Hours Studied (x)Actual Score (y)
042
147
253
358
467

If our predicted values were: 40, 45, 50, 55, 60 → we’d see residuals increasing (underfitting).
If they were: 42, 47, 53, 58, 67 → perfect predictions (possibly overfitting unless this generalizes well).

Quick Flashcards

Q: What is underfitting?
A: When the model is too simple to learn the data's structure — high training and test error.

Q: What is overfitting?
A: When the model memorizes the training data, including noise — low train error, high test error.

Q: What causes overfitting?
A: Too complex model, too many parameters, noisy data, not enough regularization.

Q: What is the bias-variance tradeoff?
A: It's the balance between underfitting (high bias) and overfitting (high variance) to minimize total error.

Q: How can you fix underfitting?
A: Use a more complex model, train longer, improve features, reduce regularization.

Q: How can you fix overfitting?
A: Use regularization, collect more data, simplify the model, or use dropout (in neural networks).

Conclusion

Understanding underfitting and overfitting is a foundational skill in machine learning. We don’t need to be a math genius to recognize them. We just need to:

  • Visualize often

  • Track performance on both training and test sets

  • Tweak your models thoughtfully

Once we develop the intuition, spotting these patterns becomes second nature.

What’s next?

We’ve now completed the core 5-part series on linear regression and supervised learning! What’s next? Regularization — our tool to tame overfitting without losing performance. Stay tuned for the next post, where we’ll explore Ridge and Lasso regression, and how to choose the right complexity automatically.

Make your models robust and reliable.

0
Subscribe to my newsletter

Read articles from Abhilash PS directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhilash PS
Abhilash PS