Part 6: A Deep Dive into Linear Regression Assumptions

Abhilash PSAbhilash PS
16 min read

Before diving deeper into machine learning models, it’s critical to understand the assumptions that linear regression rests upon. These assumptions — linearity, independence, constant variance (homoscedasticity), and normality of residuals — form the foundation for reliable, unbiased predictions.

🎓 We touched on these briefly in Part 4: Linear Regression – Key Techniques for Better Model Performance, but here we’ll take a closer look.

In this post, we’ll break each one down with real-world intuition, show how to check them using Python, and explain why they matter.

Linearity

Assumption: The relationship between the independent variable(s) and the dependent variable is linear (a straight-line relationship). In multiple regression, this also implies additivity – each predictor’s effect is linear and adds up with others’ effects. Essentially, if you double a predictor (holding others constant), the outcome should change about twice as much (according to the model's slope).

In practical terms, linearity means our model form

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon$$

is correctly capturing the true relationship. If the true relationship is curved (say, quadratic or exponential) and we force a straight-line model, the linear model will systematically misestimate the outcome – underpredicting in some ranges and overpredicting in others. This results in patterns in the errors (residuals) indicating the model is a poor fit. For example, fitting a straight line to data that actually follows a U-shape will lead to a bowed pattern in a plot of residuals versus fitted values.

How can we check linearity? The simplest way is to visualize the data and the model residuals. A scatter plot of observed vs. predicted values (or residuals vs. predicted) should ideally show points forming a random cloud around a straight line (or around zero in the residual plot). If there is a clear curve or structure left in the residuals, it signals non-linearity.

In Python, we can do this easily: after fitting a model, compute predictions and residuals, then plot something like plt.scatter(predicted, residuals) to see if the residuals are randomly scattered. If we detect curvature, we might address it by transforming variables (e.g. taking log or polynomial terms) or using a more appropriate nonlinear model.

Python Code: Checking Linearity in a Regression Model

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Optional: Use a real dataset instead
# For demo, create synthetic slightly non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 3 * X + np.sin(X) * 5 + np.random.normal(0, 2, size=100)  # non-linear component

# Reshape for sklearn
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions and residuals
y_pred = model.predict(X_test)
residuals = y_test - y_pred

# -------------------------------
# Plot: Residuals vs. Predicted
# -------------------------------
plt.figure(figsize=(8, 5))
sns.scatterplot(x=y_pred.flatten(), y=residuals.flatten(), alpha=0.8)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals (y - ŷ)")
plt.title("Residuals vs. Predicted Values\nCheck for Linearity")
plt.grid(True)
plt.tight_layout()
plt.show()

Remember, violating linearity is very serious – a linear model on non-linear data can lead to large errors especially if we extrapolate outside the observed range.

  • X-axis: Predicted values from your linear regression model

  • Y-axis: Residuals (i.e., actual − predicted = y− ŷ)

What do we expect in a Good Model (Linearity Holds)

  • Points are randomly scattered around the horizontal red line at 0.

  • No pattern, curve, or trend.

  • Spread of residuals is relatively consistent across all predicted values.

What this plot shows (Violation of Linearity)

  • The residuals form a bowed or curved pattern — first positive, then negative, then positive again.

  • This indicates the model systematically underpredicts in some regions and overpredicts in others.

  • It suggests that the actual relationship between the input and output may be non-linear — perhaps quadratic or sinusoidal (as in the example code).

Interpretation summary

The linear regression model may not be appropriate for this dataset as-is. There's evidence of non-linearity in the data — the model is missing some underlying structure (e.g. curvature) that affects predictions.

Independence of Errors

Assumption:

The residuals (errors) are independent of each other. This means the error from one observation should not predict or influence the error from another. If this assumption holds, each prediction's error is its own story.

This is naturally satisfied if our data points are independent (e.g. a random sample from a population). However, time series data or any inherently ordered data can violate this due to autocorrelation – e.g. today's error might be similar to yesterday's. Violation of independence often shows up as residuals that are correlated with each other, especially in chronological order (one error "influencing" the next)..

In ideal cases: Data is collected randomly, so errors are scattered without pattern.
In time-dependent or ordered data: Errors may follow a trend — this is called autocorrelation.

Why does independence matter?

If errors are correlated, our model is likely overlooking some pattern – perhaps a trend or sequence effect that wasn’t modeled. Correlated errors also mean the model’s standard error calculations can be off: you may underestimate the true uncertainty, leading to overconfident predictions and overly optimistic p-values. This is commonly seen in time series, where residuals might follow a pattern over time (e.g. alternating positive/negative or gradual drift), indicating autocorrelation.

How to check the independence?

  1. Plot residuals in the order of observations (e.g. residuals vs. time if time series). A random scatter (no obvious runs or trends) suggests independence.

    • X-axis: Time/order/index

    • Y-axis: Residuals

    • A random cloud = independence

    • A pattern or wave = autocorrelation

  2. Statistical tests like the Durbin-Watson test check for autocorrelation: a DW statistic around 2 implies no significant autocorrelation, while values far from 2 signal positive or negative correlation.

    • Shows how correlated residuals are with lagged versions of themselves

    • If many bars are outside the confidence band, autocorrelation exists

In Python, one can examine the autocorrelation function (ACF) of residuals or use statsmodels.stats.stattools.durbin_watson.

The DW statistic ranges between 0 and 4, with:

  • A number close to 2 → no autocorrelation

  • < 2 → positive autocorrelation

  • > 2 → negative autocorrelation

  • \= 0 → Residuals are perfectly positively correlated (bad!)

  • \= 4 → Residuals are perfectly negatively correlated (bad!)

  1. For non-time-series data, independence can be checked by ensuring there’s no clustering of residual signs when data is sorted in any meaningful way. If independence is violated, we may need to incorporate the missing pattern into the model (e.g. add a time trend, seasonal dummies, or a lagged variable) or use specialized time series regression methods. Non-independence in residuals often indicates there is information left in the residuals that the model failed to capture – an opportunity to improve the model.

Python Code & Plots

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.stats.stattools import durbin_watson
from statsmodels.graphics.tsaplots import plot_acf

# Synthetic time-ordered data with autocorrelation
np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
noise = np.random.normal(0, 1, n)
y = 2 * x + np.cumsum(noise)  # Introducing autocorrelation
df = pd.DataFrame({'x': x, 'y': y})

# Fit linear regression
X = sm.add_constant(df['x'])
model = sm.OLS(df['y'], X).fit()
df['y_pred'] = model.predict(X)
df['residuals'] = df['y'] - df['y_pred']

# 1. Residuals vs Time Order Plot
plt.figure(figsize=(10, 4))
plt.plot(df.index, df['residuals'], marker='o', linestyle='-', alpha=0.7)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals in Time Order (Check Independence)")
plt.xlabel("Observation Index")
plt.ylabel("Residuals")
plt.tight_layout()
plt.show()

# 2. ACF Plot
plot_acf(df['residuals'], lags=30)
plt.title("Autocorrelation Plot of Residuals")
plt.tight_layout()
plt.show()

# 3. Durbin-Watson Test
dw_stat = durbin_watson(df['residuals'])
print(f"Durbin-Watson Statistic: {dw_stat:.3f}")

Residuals in Time Order (Line Plot)

What You See:

  • A smooth wave-like pattern in the residuals.

  • Residuals don’t jump randomly; instead, they gradually increase or decrease over time.

What This Means:

  • Residuals are correlated with previous residuals — especially the one right before.

  • This is a clear sign of positive autocorrelation.

  • Our model may be missing a time trend, seasonality, or lagged effect.

Autocorrelation Function (ACF) Plot

What You See:

  • Several vertical bars (autocorrelation values at different lags) are well outside the blue confidence band.

  • The correlation at lag 1 is close to 1.0, and it gradually decays.

What This Means:

  • Strong positive autocorrelation.

  • The residuals are highly dependent on their recent past values.

  • This confirms what we saw in the residual line plot.

     Durbin-Watson Statistic: 0.106

Combined Interpretation

Your model violates the independence assumption. Both the time-ordered residual plot and the ACF plot show that the errors are not random but strongly autocorrelated.

Probable Causes:

  • We are modeling time-ordered data (e.g., time series or sequential observations)

  • The model is not accounting for time, momentum, trend, or repeating patterns

  • Could also occur in panel data (grouped by entity over time)

Homoscedasticity (Constant Variance)

Assumption: Constant Spread of Errors (Homoscedasticity)

In linear regression, we assume that the errors (residuals) have roughly the same spread no matter what the predicted value is.

In simple terms:
Whether the model predicts a small number or a large one, the amount it could be wrong by should stay about the same.

This consistent spread of errors is what we call homoscedasticity.

However, if the errors grow or shrink with the prediction — say, smaller predictions are quite accurate while larger ones tend to be way off — then the assumption is violated. This unequal variability is known as heteroscedasticity.

Why is this important?

When homoscedasticity holds, our model performs consistently across the entire range of predictions, and its statistical outputs — like standard errors, confidence intervals, and p-values — are trustworthy.

But if the assumption is violated:

  • We might overestimate or underestimate how certain your results are.

  • Statistical tests like t-tests or F-tests could produce misleading results.

  • Certain observations (especially those with large variance) could unfairly dominate the model.

It’s worth noting that heteroscedasticity does not bias our coefficient estimates — our model still finds the best-fitting line on average. However, it does distort inference, which means we can’t fully trust our model’s uncertainty estimates or test statistics.

How to check for homoscedasticity?

We again turn to residual plots. Plot residuals vs. fitted values (predictions) and look at the spread of residuals. Ideally, the residuals should form a horizontal band with roughly equal scatter throughout. No clear pattern or trend in the spread means homoscedasticity is likely satisfied. If you see the residuals fan out (e.g. forming a cone shape wider on one side), that's a red flag for heteroscedasticity.

Below is an example residual plot:

Residuals vs Fitted Values: Each point represents a model residual plotted against the predicted value. The residuals are scattered roughly evenly around the horizontal line at 0, with no obvious curve or funnel shape. We want to see a random "cloud" of points like this, indicating the linearity assumption is met (no systematic curvature in residuals) and the homoscedasticity assumption holds (constant variance of residuals across predictions).

If the points in a residual plot show a pattern – say, residuals growing in magnitude as the fitted value increases (widening cone) – that suggests heteroscedasticity. For a more formal check, statistical tests like Breusch-Pagan or Goldfeld-Quandt can be used to detect non-constant variance.

residuals_vs_fitted_heteroscedasticity_demo.py

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Generate synthetic data with increasing variance
np.random.seed(42)
X = np.linspace(1, 10, 100).reshape(-1, 1)
noise = np.random.normal(0, X.flatten())  # more noise for larger X
y = 3 * X.flatten() + noise

# Fit linear regression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred

# Plot residuals vs predicted values
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals, alpha=0.7)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals (y - ŷ)")
plt.title("Residuals vs Predicted Values – Heteroscedasticity Example")
plt.grid(True)
plt.tight_layout()
plt.show()

In Python, we might use statsmodels.stats.diagnostic.het_breuschpagan. If heteroscedasticity is present, possible fixes include transforming the dependent variable (e.g. using log Y if variability grows with the level of Y) or using methods like robust standard errors or weighted least squares that account for the changing variance.

This residual plot shows classic signs of heteroscedasticity.

Here's an analysis:

  • Pattern Detected:
    The residuals appear to fan out as the predicted values increase — they are tightly clustered around the horizontal line (zero) for small predicted values, but the spread widens as the predictions get larger.

  • Implication:
    This pattern indicates non-constant variance of errors. The variability in prediction errors increases with the magnitude of the predicted value — violating the homoscedasticity assumption.

  • Model Reliability Impact:

    • Standard errors may be underestimated for large values.

    • Confidence intervals and p-values will likely be incorrect.

    • The model appears less precise for larger predictions, which could be dangerous if you’re using it to make decisions at that end of the range.

Normality of Residuals

Assumption: Normality of Residuals

In linear regression, we assume that the residuals are approximately normally distributed. That means if we plot all the error terms, they should form a bell-shaped curve centered around zero.

This assumption matters most when we want to draw statistical conclusions from our model — like checking p-values or building confidence intervals. If the residuals follow a normal distribution, we can trust those results. But if the residuals deviate a lot from normality — especially when the dataset is small — those conclusions might not be reliable.

That said, normality isn’t a big deal when we’re just making predictions. Even if the residuals aren’t perfectly normal, the regression line can still give good average predictions — especially when we have a large dataset. That’s because of the central limit theorem, which helps smooth out irregularities as our data grows.

The Central Limit Theorem says that:

If you take many random samples from any population (even if it's not normally distributed), the average of those samples will follow a normal distribution — as long as the sample size is big enough.

However, severe non-normality is something to pay attention to:

  • If the errors have long tails, it means big prediction mistakes are happening more often than they should.

  • If the errors are skewed (leaning heavily to one side), it might suggest that the model is missing something — like a non-linear relationship or an important variable.

How to check normality?

To check if residuals are normally distributed, we usually rely on visual tools — mainly histograms and Q-Q plots.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Simulated residuals (you can replace with your model's residuals)
np.random.seed(42)
residuals = np.random.normal(0, 1, 500)

# Histogram
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.title("Histogram of Residuals")
plt.xlabel("Residual")
plt.ylabel("Frequency")

# Q-Q Plot
plt.subplot(1, 2, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")

plt.tight_layout()
plt.show()

Histogram of residuals

A histogram of residuals should look roughly like a bell curve: symmetrical, unimodal (one peak), and centered around zero. If the shape is smooth and balanced, it’s a good sign that the residuals follow a normal distribution. But if the histogram is skewed, lopsided, or sharply peaked, it might suggest outliers, non-linearity, or other modeling issues.

What This Histogram Tells Us

  • Bell-shaped curve: The residuals appear to follow a roughly symmetrical bell curve, centered around 0. This is exactly what we want under the normality assumption in linear regression.

  • Centered at zero: Most of the residuals (errors) are clustered near 0, which means your model tends to be accurate on average.

  • Tails: The tails drop off gradually on both sides. There's a slight right-side tail, but it’s not extreme. No strong skewness or heavy tails are immediately obvious.

What This Means for our Model

The residuals look reasonably normal, so:

  • Our p-values and confidence intervals are likely reliable (especially if our sample size is decent).

  • Our model’s statistical inferences (like t-tests for coefficients) are more trustworthy.

  • No major red flags from the perspective of normality.

Q-Q plot (quantile-quantile plot)

A Q-Q plot (quantile-quantile plot) takes it a step further. It compares the quantiles of your residuals to those of a perfect normal distribution. If the residuals are normal, the points will fall along a straight diagonal line. Deviations from this line — like an “S” curve (skewness) or bowing outward (kurtosis) — are signs of non-normality.

  • Blue Dots: These are the quantiles of your actual residuals.

  • Red Line: This is the theoretical quantile line for a perfect normal distribution.

Interpretation:

  • Most points fall along the red line: That’s great. It suggests that our residuals are very close to normally distributed.

  • Slight deviations at the tails: A few points at the very top and bottom curve away slightly. This is common and usually not a concern unless those deviations are extreme or many.

Our residuals show strong evidence of normality. The points closely follow the diagonal line with only minor deviations at the tails, which is acceptable. That means:

  • We can trust our p-values and confidence intervals.

  • Our model's statistical inferences are reliable.

  • No red flags for non-normality.

Others

There are also formal statistical tests like Shapiro-Wilk, Kolmogorov-Smirnov, or Jarque-Bera, but these can be overly sensitive. With large datasets, even tiny, harmless deviations might trigger a “non-normal” result. That’s why it’s often better to trust your eyes — and use visual tools alongside your understanding of the data and sample size.

Quick Flashcards

  1. Q: What are the 4 key assumptions of linear regression?
    A: Linearity, Independence, Homoscedasticity, and Normality of errors.

  2. Q: How can we check for linearity in data?
    A: Use scatter plots or residual plots — a curved trend indicates non-linearity.

  3. Q: What is homoscedasticity?
    A: It means the variance of errors (residuals) is constant across all levels of the independent variable(s).

  4. Q: What if residuals show a funnel shape?
    A: This indicates heteroscedasticity, violating the constant variance assumption.

  5. Q: How do we check for independence of errors?
    A: Use a Durbin-Watson test or plot residuals over time — patterns imply dependence.

  6. Q: What if errors are autocorrelated?
    A: It suggests model misspecification or omitted variables in time series data.

  7. Q: Why is normality of residuals important?
    A: For small samples, it ensures valid confidence intervals and hypothesis tests.

  8. Q: How do we check for normality?
    A: Use histograms, Q-Q plots, or statistical tests like the Shapiro-Wilk test.

  9. Q: What happens if the linearity assumption is violated?
    A: The model may consistently under- or over-predict, leading to high bias.

  10. Q: Can you fix assumption violations?
    A: Yes — by transforming variables, adding interaction terms, or using different models (e.g., decision trees).

Summary

This article dives into the key assumptions underpinning linear regression: linearity, independence, homoscedasticity (constant variance), and normality of residuals. Understanding these assumptions is crucial for ensuring reliable predictions and accurate statistical inferences from regression models. We explore each assumption with real-world examples, demonstrate how to check them using Python, and discuss their impact on model performance. Violations of these assumptions can lead to systematic errors, increased uncertainty, and misleading statistical results, emphasizing the importance of careful diagnostic checks in regression analysis.

What’s Next?

We’ve now seen how assumptions lay the groundwork for trustworthy regression models. But even with those boxes checked, not all models are created equal — especially as we start adding more features.

In the next part, we turn to a smarter way of evaluating how well our model explains the data:

Unlike plain R², Adjusted R² doesn’t blindly reward complexity. It asks — does this extra feature actually help, or is it just adding noise?

We’ll explore how it works, when to use it, and why it’s essential when building models that balance simplicity and performance.

→ See you in Part 7.

Bibliography

  1. https://www.econometricstutor.co.uk/linear-regression-assumptions-of-linear-regression

  2. https://people.duke.edu/~rnau/testing.htm

  3. https://www.geeksforgeeks.org/machine-learning/assumptions-of-linear-regression/

  4. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-linear-regression/

0
Subscribe to my newsletter

Read articles from Abhilash PS directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Abhilash PS
Abhilash PS