Outliers in Data Science

WrathWareWrathWare
4 min read

"If you call an outlier an ‘anomaly,’ it sounds cool and mysterious. If you call it a ‘mistake,’ you’re just admitting your data’s drunk."

Outliers are the mischievous frenemy in a dataset.

Classic Definition

Outliers are data points that significantly differ from the rest of your dataset. They may represent rare events, data entry errors, or something truly meaningful — like fraud, breakthrough experiments, or anomalies in the real world.

Understandable Definition

Suppose you live in Bangalore (you shall know the reason soon 😉) And you drive to your 10km far office every day, and it normally takes you 1 hr to reach your office (the concept of distance and time is flawed in this city, beep…traffic), but one fine day you get lucky and you reach your office in 25 mins (magic🪄).
So, does it mean your normal travel time is your weekly average of ≈53 mins or is it 1 hr?
A sane Bangalorean knows full well to believe in his luck; he knows that he would never get 25 min travel time in his life again. 25 min travel time is an Outlier

Outliers in a dataset can occur due to:

  • Variability in the data (natural rare events) (For sure)

  • Measurement errors or data entry mistakes

  • Experimental errors

But why should I bang my head for outliers???

Well, outliers are frenemies. If not addressed properly, they can bring in real trouble.
What do I mean …well?

Outliers can:

  • Distort statistical summaries like the mean and standard deviation

  • Skew visualizations like box plots or histograms

  • Mislead machine learning models, especially those sensitive to distances (e.g., k-NN, linear regression)

But… outliers aren’t always bad!

  • In fraud detection, outliers are the point.

  • In healthcare, they can signal life-threatening conditions.

  • In stock markets, they may represent rare opportunities or risks.

Great, but how to find my frenemies (outliers)?

Nice question. It’s actually easier to do with programming than in real life 🤧.

To find outliers, we can employ

  1. Statistical Methods

1.1. Z-Score Method

Concept:

The Z-score tells you how many standard deviations a data point is from the mean.
A Z-score > 3 or < -3 is often considered an outlier in a normal distribution.

Formula:

$$Z= σ (X−μ) ​$$

Where:

→ X = data point

→ μ = mean of the data

→ σ = standard deviation

Use When:

  • Data is approximately normally distributed

  • You are working with univariate data

Python Example:

import stats
import numpy as np

data = [12, 15, 14, 10, 18, 17, 14, 13, 100]  # 100 is likely an outlier
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)

print("Z-scores:", z_scores)
print("Outlier indices:", outliers)

Limitation:

  • Sensitive to non-normal distributions

  • May miss outliers in skewed data

  1. IQR (Interquartile Range) Method

Concept:

This method is based on the spread of the middle 50% of the data (i.e., between Q1 and Q3).
Any point below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is considered an outlier.

Formula:

$$IQR=Q3−Q1$$

$$\text{Lower bound} = Q1 - 1.5 \times \text{IQR}$$

$$\text{Upper bound} = Q3 + 1.5 \times \text{IQR}$$

Use When:

  • The data is not normally distributed

  • You want a non-parametric approach

Python Example:

import pandas as pd

data = pd.Series([12, 15, 14, 10, 18, 17, 14, 13, 100])  # 100 is likely an outlier

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data < lower_bound) | (data > upper_bound)]

print("IQR:", IQR)
print("Outliers:\n", outliers)

Visual Methods

  • Box plots: Whiskers extend to 1.5×IQR; points outside are outliers.

  • Scatter plots: For multivariate data.

  • Histogram/density plots: To detect long tails or rare peaks.

Model-Based Methods

  • Isolation Forest, One-Class SVM, DBSCAN: Useful for high-dimensional or unlabeled data.

Conclusion

Outliers are like that one friend in every group project — totally unpredictable, possibly brilliant, maybe just confused.

Statistical methods like Z-score and IQR are the bouncers at the data club. If your data point shows up in a clown suit (3 standard deviations away from the mean), they’re getting flagged. But hey, sometimes that clown is the CEO in disguise — so don’t kick them out too fast!

Always ask:

  • Is this outlier a mistake?

  • Or is it the main character of your dataset?

So next time your graph looks like it had one too many coffees and spiked, don’t panic — just let the stats do the talking, and maybe… keep the weirdo. They might be onto something.


✌️If u liked it , kindly Like Share and Subscribe to the newsletter. ❤️

10
Subscribe to my newsletter

Read articles from WrathWare directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

WrathWare
WrathWare

This Anon is a young Arjun, striving to master his craft. I code, run, play, and battle my way through college daily, only to return and shitpost on X by night. Yes, time is scarce, most of it lost to the labyrinth of traffic (brownie points if thou guesseth the city 😉). Yet, rather than succumb to doomscrolling, I have chosen to forge words into weapon - posting blogs to slay the beast of wasted hours. Love me or hate me, but know this: "I shalt be invincible, a bastion of might clad in the armor of antiquity!"