"If you call an outlier an ‘anomaly,’ it sounds cool and mysterious. If you call it a ‘mistake,’ you’re just admitting your data’s drunk."

Outliers are the mischievous frenemy in a dataset.

Classic Definition

Outliers are data points that significantly differ from the rest of your dataset. They may represent rare events, data entry errors, or something truly meaningful — like fraud, breakthrough experiments, or anomalies in the real world.

Understandable Definition

Suppose you live in Bangalore (you shall know the reason soon 😉) And you drive to your 10km far office every day, and it normally takes you 1 hr to reach your office (the concept of distance and time is flawed in this city, beep…traffic), but one fine day you get lucky and you reach your office in 25 mins (magic🪄).
So, does it mean your normal travel time is your weekly average of ≈53 mins or is it 1 hr?
A sane Bangalorean knows full well to believe in his luck; he knows that he would never get 25 min travel time in his life again. 25 min travel time is an Outlier

Outliers in a dataset can occur due to:

Variability in the data (natural rare events) (For sure)
Measurement errors or data entry mistakes
Experimental errors

But why should I bang my head for outliers???

Well, outliers are frenemies. If not addressed properly, they can bring in real trouble.
What do I mean …well?

Outliers can:

Distort statistical summaries like the mean and standard deviation
Skew visualizations like box plots or histograms
Mislead machine learning models, especially those sensitive to distances (e.g., k-NN, linear regression)

But… outliers aren’t always bad!

In fraud detection, outliers are the point.
In healthcare, they can signal life-threatening conditions.
In stock markets, they may represent rare opportunities or risks.

Great, but how to find my frenemies (outliers)?

Nice question. It’s actually easier to do with programming than in real life 🤧.

To find outliers, we can employ

Statistical Methods

1.1. Z-Score Method

Concept:

The Z-score tells you how many standard deviations a data point is from the mean.
A Z-score > 3 or < -3 is often considered an outlier in a normal distribution.

Formula:

$$Z= σ (X−μ) $$

Where:

→ X = data point

→ μ = mean of the data

→ σ = standard deviation

Use When:

Data is approximately normally distributed
You are working with univariate data

Python Example:

import stats
import numpy as np

data = [12, 15, 14, 10, 18, 17, 14, 13, 100]  # 100 is likely an outlier
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)

print("Z-scores:", z_scores)
print("Outlier indices:", outliers)

Limitation:

Sensitive to non-normal distributions
May miss outliers in skewed data

IQR (Interquartile Range) Method

Concept:

This method is based on the spread of the middle 50% of the data (i.e., between Q1 and Q3).
Any point below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is considered an outlier.

Formula:

$$IQR=Q3−Q1$$

$$\text{Lower bound} = Q1 - 1.5 \times \text{IQR}$$

$$\text{Upper bound} = Q3 + 1.5 \times \text{IQR}$$

Use When:

The data is not normally distributed
You want a non-parametric approach

Python Example:

import pandas as pd

data = pd.Series([12, 15, 14, 10, 18, 17, 14, 13, 100])  # 100 is likely an outlier

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data < lower_bound) | (data > upper_bound)]

print("IQR:", IQR)
print("Outliers:\n", outliers)

Visual Methods

Box plots: Whiskers extend to 1.5×IQR; points outside are outliers.
Scatter plots: For multivariate data.
Histogram/density plots: To detect long tails or rare peaks.

Model-Based Methods

Isolation Forest, One-Class SVM, DBSCAN: Useful for high-dimensional or unlabeled data.

Conclusion

Outliers are like that one friend in every group project — totally unpredictable, possibly brilliant, maybe just confused.

Statistical methods like Z-score and IQR are the bouncers at the data club. If your data point shows up in a clown suit (3 standard deviations away from the mean), they’re getting flagged. But hey, sometimes that clown is the CEO in disguise — so don’t kick them out too fast!

Always ask:

Is this outlier a mistake?
Or is it the main character of your dataset?

So next time your graph looks like it had one too many coffees and spiked, don’t panic — just let the stats do the talking, and maybe… keep the weirdo. They might be onto something.

Outliers in Data Science

Table of contents

Classic Definition

Understandable Definition

But why should I bang my head for outliers???

Great, but how to find my frenemies (outliers)?

To find outliers, we can employ

Statistical Methods

Concept:

Formula:

Use When:

Python Example:

Limitation:

IQR (Interquartile Range) Method

Concept:

Formula:

Use When:

Python Example:

Visual Methods

Model-Based Methods

Conclusion

Subscribe to my newsletter

WrathWare

WrathWare

Outliers in Data Science

Table of contents

Classic Definition

Understandable Definition

But why should I bang my head for outliers???

Great, but how to find my frenemies (outliers)?

To find outliers, we can employ

Statistical Methods

Concept:

Formula:

Use When:

Python Example:

Limitation:

IQR (Interquartile Range) Method

Concept:

Formula:

Use When:

Python Example:

Visual Methods

Model-Based Methods

Conclusion

✌️If u liked it , kindly Like Share and Subscribe to the newsletter. ❤️

Subscribe to my newsletter

WrathWare

WrathWare