K Nearest Neighbour is a instance based learning algorithm, where you use training data to categorize or find solution to the input value.

How does KNN work?

The K-NN algorithm can be explained based on the following algorithm:

Select the number K of the neighbors
Calculate the Euclidean distance of K neighbors
Take the K nearest neighbors as per the calculated Euclidean distance.
Among these k neighbors, count the number of data points in each category.
Assign the new data points to that category for which the number of neighbors is maximum.
Our model is ready.

KNN in Classification: Predicting Diabetes Status

Imagine we want to predict if someone has diabetes using things like their glucose level, BMI, blood pressure, and age. The K-Nearest Neighbors method looks at people in the dataset who have similar health measurements. It finds the closest few neighbors to the person and sees how many of them have diabetes. Then, it predicts the person’s diabetes status based on what most of those neighbors have. So basically, it groups people with similar health profiles and makes predictions accordingly.

KNN in Regression: Forecasting House Prices

Now, let’s look at a regression problem where we want to predict house prices based on features like size, number of bedrooms, location, and amenities. Using K-Nearest Neighbors for regression, instead of choosing a category, it looks at the closest similar houses in the dataset. Then, it predicts the price of a house by taking the average price of those nearest neighbors. So, KNN finds houses like yours and uses their prices to estimate what your house might sell for.

Distance Metric

Central to the functioning of KNN is the notion of distance metrics, which determine the similarity between instances in the feature space. Commonly used distance metrics include:

1. Euclidean Distance

The Euclidean distance, derived from the Pythagorean theorem, measures the straight-line distance between two points in Euclidean space. It is the most widely used distance metric in KNN and is suitable for continuous features.

2. Manhattan Distance

Also known as the city block distance or L1 norm, the Manhattan distance calculates the sum of absolute differences between the coordinates of two points. It is particularly useful when dealing with high-dimensional data or features with different scales.

3. Hamming Distance

The Hamming distance formula calculates the number of positions at which the corresponding symbols between two strings p and q are different. It is commonly used in situations where the data consists of categorical variables, such as binary features or sequences of symbols.

Choosing the appropriate distance metric is crucial for the performance of the KNN algorithm. Factors such as the nature of the data, feature scales, and dimensionality should be considered when selecting the distance metric.

Selecting the Optimal Value of K

Another critical aspect of KNN is choosing the optimal value of K, the number of nearest neighbors considered during prediction. The choice of K significantly impacts the model's performance, striking a balance between bias and variance.

Small K Values: Choosing a small value of K (e.g., K=1 or K=3) leads to low bias but high variance. The model tends to overfit the training data, capturing noise and outliers.
Large K Values: Conversely, selecting a large value of K (e.g., K=10 or K=20) results in high bias but low variance. The model becomes too simplistic, potentially missing important patterns in the data.

Strategies for Choosing K

Odd K Values: When choosing K, it is advisable to select odd values, particularly in binary classification tasks. This prevents ties in voting and ensures a clear majority when determining the class label.
Cross-Validation: Employ techniques like k-fold cross-validation to evaluate the model's performance for different values of K and choose the one with the optimal trade-off between bias and variance.
Grid Search: Conduct a grid search over a range of K values to identify the value that yields the best performance metrics (e.g., accuracy, precision, recall).
Domain Knowledge: Leverage domain knowledge and insights about the problem domain to select a reasonable range of K values. For instance, consider the underlying patterns in the data and the expected complexity of the relationships.

Pros of KNN

1. Simplicity

One of the most significant advantages of KNN is its simplicity. The algorithm is easy to understand and implement, making it an ideal choice for beginners in machine learning. Its intuitive nature makes it accessible even to those with a limited technical background.

2. No Assumptions About Data

Unlike parametric models such as linear regression or logistic regression, KNN does not make any assumptions about the underlying distribution of the data. It is a non-parametric algorithm, making it highly flexible and adaptable to various types of data distributions.

3. Adaptability to Non-Linear Data

KNN is well-suited for handling non-linear relationships in the data. It can capture complex patterns and decision boundaries without the need for explicit feature engineering or transformation. This makes it particularly useful in scenarios where the relationship between features and target variables is non-linear.

4. Efficiency in Training

KNN is a lazy learning algorithm, meaning it does not require explicit training on the entire dataset during the model-building phase. Instead, it stores the training instances and performs computations only when making predictions. This makes KNN efficient in terms of training time and memory usage, especially for large datasets.

5. Interpretability

KNN provides transparent and interpretable results. The predictions are based on the actual instances in the dataset, making it easy to understand the reasoning behind each prediction. This interpretability is valuable in applications where model transparency and explainability are critical.

Cons of KNN

1. Computational Complexity

Despite its efficiency in training, KNN can be computationally expensive during the prediction phase, especially for large datasets or high-dimensional feature spaces. As the number of training instances grows, the algorithm needs to compute distances to a large number of neighbors, leading to increased computational overhead.

2. Sensitive to Noise and Outliers

KNN is sensitive to noisy data and outliers, which can significantly affect the quality of predictions. Outliers can distort the decision boundaries and lead to erroneous classifications. Preprocessing steps such as outlier detection and feature scaling are often necessary to mitigate these issues.

3. Need for Optimal K Value

The choice of the optimal value of K, the number of neighbors considered during prediction, is crucial for the performance of the KNN algorithm. Selecting an inappropriate value of K can lead to overfitting or underfitting, impacting the model's accuracy and generalization ability. Determining the optimal K value often requires experimentation and cross-validation.

4. Imbalanced Data

KNN performs poorly on imbalanced datasets, where one class significantly outnumbers the others. In such scenarios, the majority class tends to dominate the predictions, leading to biased results. Techniques such as resampling or using weighted voting schemes are necessary to address class imbalance and improve the model's performance.

Week 7 : 3. K Nearest Neighbour