KNN - K Nearest Neighbor

Start day: 08/10/2025 - Doan Ngoc Cuong

  1. Introduction: Lazy Algorithm

KNN (Lazy Learning Algorithm) is a type of Instance-Based Learning

  • Instance-Based Learning = a learning approach where the algorithm stores the training data (instances) and uses them directly to make predictions.

KNN is called a lazy learner because when we supply training data to this algorithm, the algorithm does not train itself at all.
- Example:

Link Reference: Why is KNN a lazy learner? - GeeksforGeeks

  1. Distance:

  2. Bruce Force and KDTree

  1. Classification Report

Exactly how macro avg and weighted avg are calculated.

Suppose:

We have 2 classes:

  • Class A → 80 samples

  • Class B → 20 samples

Our model’s results:

ClassPrecisionRecallF1-scoreSupport
A0.900.800.8580
B0.501.000.6720

Macro average

Formula:

$$[ \text{Macro avg} = \frac{\text{Metric}_A + \text{Metric}_B}{2} ]$$

  • Precision (macro) = (0.90 + 0.50) / 2 = 0.70

  • Recall (macro) = (0.80 + 1.00) / 2 = 0.90

  • F1 (macro) = (0.85 + 0.67) / 2 ≈ 0.76


Weighted average

Formula:

$$[ \text{Weighted avg} = \frac{\text{Metric}_A \cdot \text{Support}_A + \text{Metric}_B \cdot \text{Support}_B}{\text{Total support}} ]$$

Total support = 80 + 20 = 100

  • Precision (weighted) = (0.90×80 + 0.50×20) / 100
    \= (72 + 10) / 100 = 0.82

  • Recall (weighted) = (0.80×80 + 1.00×20) / 100
    \= (64 + 20) / 100 = 0.84

  • F1 (weighted) = (0.85×80 + 0.67×20) / 100
    \= (68 + 13.4) / 100 ≈ 0.814


Key difference:

  • Macro avg = treats A and B equally (0.70 precision, 0.90 recall, 0.76 f1).

  • Weighted avg = Class A dominates because it has 4× more samples (0.82 precision, 0.84 recall, 0.81 f1).


  1. How to choose K

5.1 Basic Principles for Choosing k:

  • Use odd k (for binary classification) (prevents tie votes).

  • Some theoretical heuristic: k ≈ log(N) as a starting point.

5.2 Ways to Find the Optimal k

  1. Step 1: Cross Validation

    The model is trained on (k-1) folds and tested on the remaining fold. This is repeated k times, each time using a different fold for setting.

    Why» Cross- validation lets you compare different k values fairly, using multiple train/test splits => reducing the risk of overfitting.

  2. Step 2: Error Curve

    An error Curve or accuracy curve) is a plot of model performance versus the parameter value.
    - If k is very low => high variance, possibly high accu
    - If k is very high => low variance, high bias

    Bias and Variance:
    - What is Bias ?
    - What is Variance ?

    In a model. We have correlation between bias and variance (in under image):
    +, Bias:
    - Model performs poorly on both training and test data.
    - High bias → model fails to capture underlying patterns → underfitting.

    +, Variance:
    - High variance → model fits training data too closely and fails on unseen data → overfitting.
    - Model performs well on training data but poorly on test data.

    Link: Bias and Variance in Machine Learning - GeeksforGeeks

0
Subscribe to my newsletter

Read articles from Cường Đoàn Ngọc directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Cường Đoàn Ngọc
Cường Đoàn Ngọc

Name: Cường Educational Background: Data Science and Artificial Intelligence Current Role: AI Engineering at an AI Production company, specializing in Education AI Career Interests: Natural Language Processing (NLP), Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Workflow Systems, and AI Agents Personal Interests: Lifelong learning, personal development, speed-hacking (accelerated learning/productivity), and networking