Ever heard of a saying that a person is known by the group he sticks together. For example, if you tend to hang around scholars in school / college, you are automatically perceived as one.

Similarly, in KNN (short form for K-Nearest Neighbours) we decide category based on the neighboring point. We will get to that in detail so stay tuned.

KNN is a lazy learner because it doesn’t “learn” a model like decision trees or random forests.
Instead, when it gets a new data point, it:

Looks at the K closest points in the training data, based on distance.
Checks which category (label) they belong to.
Votes and predicts the majority class.

Real Life Anology

Lets say we are asking some of our neighbours for advice. You are new to a neighbourhood and you wonder whether to have a pizza across the street or not. If it is good or bad something like that.

Suppose, you ask your 3 nearest neighbors (K=3).

2 say “Yes, its delicious”
1 says “No, it tastes like crap”

So, as majority says yes, you go there to try that place. That is how KNNs make their predictions.

Use KNN when:

Classifying whether a customer will buy a product based on age & income
Recognizing handwritten digits (like the MNIST dataset) (Refer my GitHub Repo for this topic : https://github.com/SAKET-SK/Data-Science-Lab/tree/main/Lab%202)
Detecting fraudulent transactions (based on similarity to known cases)

Step by Step Explaination

Choose K (number of neighbors to look at)
Measure Distance (usually Euclidean Distance)
Sort the training points based on distance to the new data point
Pick K nearest neighbors
Majority vote decides the class

P.S, Here is the formula for Euclidian Distance calculation for 2D planes:

Hope that takes you back into nostalagic school days learning mathematics. ;)

Also you should take proper care while choosing the value of K. Including too less makes model sensetive to noise. If the K is too big, it may include irrelevant neighbours. The best solution is try several values of K and use cross-validation to choose the best.

Time to Code

Alright, its time to code. This will help us in depth understanding the concept of KNN.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Let's simulate a simple dataset
data = {
    'Age': [18, 22, 25, 28, 30, 35, 40, 50, 60, 65],
    'Salary': [15000, 20000, 25000, 27000, 30000, 40000, 50000, 60000, 80000, 85000],
    'Bought_Product': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)
df.head()

| |
Age | Salary | Bought_Product | | --- | --- | --- | --- | | 0 | 18 | 15000 | 0 | | 1 | 22 | 20000 | 0 | | 2 | 25 | 25000 | 0 | | 3 | 28 | 27000 | 0 | | 4 | 30 | 30000 | 0 |

# Features and label
X = df[['Age', 'Salary']]
y = df['Bought_Product']

# Train-test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling is very important for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)  # You can try K=1, 5, etc.
knn.fit(X_train_scaled, y_train)

# Predictions
y_pred = knn.predict(X_test_scaled)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Greens')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Predict for a new customer: Age 33, Salary 37000
new_customer = scaler.transform([[33, 37000]])
prediction = knn.predict(new_customer)
print("Prediction (1=Will Buy, 0=Will Not Buy):", prediction[0])

# OUTPUT
# Prediction (1=Will Buy, 0=Will Not Buy): 0

Lets try plotting some decision boundaries.

A decision boundary is an imaginary line (or curve) that a machine learning model draws to separate different classes in the feature space.

In 2D, it looks like a line or curve on a plot.
In 3D, it’s a surface.
In higher dimensions, we can’t visualize it, but it still exists mathematically.

Imagine a school wants to classify students into “Need Support” and “Don’t Need Support” based on:

Exam Marks
Attendance (%)

If they draw a line that says:

“If marks < 40 and attendance < 60%, they need support”

That line is their decision boundary.
Students on one side = need help.
Students on the other side = doing fine.

We can plot a decision boundary for our code example too.

# Create mesh grid
h = .01
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict over the grid
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.6)
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, edgecolor='k', cmap=plt.cm.RdYlBu)
plt.title("KNN Decision Boundary")
plt.xlabel("Age (scaled)")
plt.ylabel("Salary (scaled)")
plt.show()

This is a 2D visualization of a KNN model's classification decision on a dataset with two features:

X-axis → Age (scaled)
Y-axis → Salary (scaled)

And the KNN model is used to predict binary class labels (e.g., Class 0 and Class 1).

🔴 Red region → All points here are classified as Class 0 by the model.
🔵 Blue region → All points here are classified as Class 1 by the model.

So, if you give a new input point (age and salary), and it falls in:

Red zone → It’ll be classified as Class 0
Blue zone → It’ll be classified as Class 1

The ⚫ small-circled dots are your actual data points from the training dataset. Based on their location, the KNN model has classified them and colored the background accordingly. The slanted line between the red and blue regions is the decision boundary. It represents the exact points where the model is uncertain ; meaning, it's a tie between Class 0 and Class 1 based on the nearest neighbors.

Key points

No training phase as it just remembers the data → No training phase; predictions are made based on stored data.
Uses distance to find similar points → Uses metrics like Euclidean distance to find nearest neighbors.
Great for small-to-medium datasets
Needs feature scaling (since it uses distance)
Choosing the right K value is very important -> A lower K can cause overfitting (too sensitive), while a higher K might underfit (too generalized).

Next we move to more boundary-sophistcated algorithm, but lets keep that to Day 13.

Until then Ciao!

Day 12: K-Nearest Neighbors (KNN) – Learning by Proximity

Table of contents

Real Life Anology

Step by Step Explaination

Time to Code

Key points

Subscribe to my newsletter

Saket Khopkar

Saket Khopkar