Introduction:

Outliers, those mischievous data points that deviate from the norm, can significantly impact the accuracy and reliability of machine learning models. In this blog post, we delve into the world of outliers, exploring their nature, understanding their impact on various machine learning algorithms, and uncovering strategies to detect and handle them effectively.

Understanding Outliers:

Outliers are data points that exhibit behavior significantly different from the majority in a dataset. Their presence can distort the true representation of the data, making them a crucial consideration in data analysis and modeling.

When to Address Outliers:

While outliers pose a threat to the integrity of machine learning models, it's not a one-size-fits-all decision to eliminate them. The need to remove or transform outliers depends on the specific problem statement, and careful consideration is essential because for anomaly detection use cases this outlier may also prove to be important.

Impacted ML Algorithms:

Certain machine learning algorithms, particularly those internally calculating weights, are more susceptible to the influence of outliers. Linear regression, logistic regression, AdaBoost, and deep learning algorithms are examples where outliers can skew results.

Strategies for Outlier Treatment:

Trimming:

Suitable for datasets with a few outliers, this strategy involves the removal of outliers entirely from the data which could lead to data loss.

Capping:

This strategy is most often used as it removes the data loss issue and in this approach, we assign the the lower range value or upper range value of the dataset.

Treat Like Missing Values:

This method is very rarely used and it involves thinking of considering outliers as missing values and then applying all strategies we use for missing values to it.

Discretization:

This is another method that is rarely used while handling outliers. This method involves the conversion of numerical features into categorical features.

Detecting and Handling Outliers for Different Distributions:

For Normal Distribution:

Detection: Use the Z score. If the Z score is between -3 to 3, the point is not an outlier. As normal distribution follows the bell curve about 99.7% of data points fall within 3 standard deviations.

Handling: Options include trimming or capping. Capping is preferred to bring data within the 3 standard deviation range. In this approach, if the data point Z score is above 3 then we assign it as 3 and if it is below -3 we assign it -3. Trimming means the removal of outliers from the data set entirely.

For Skewed Distribution:

Detection: Employ a box plot to identify points beyond the lower and upper ranges.

Handling: Choose between trimming or capping. Capping involves calculating minimum and maximum values using quartiles and interquartile ranges. The minimum value is given by Q1-1.5IQR and the Maximum value is given by Q3+1.5IQR. Q1 is the first quartile representing the values equal to the 25 percentile. Q3 is the third quartile representing a value equal to the 75 percentile. IQR is the difference between Q3 and Q1.

For Other Distributions:

Detection: Utilize a box plot. If outliers are present, those will be shown in the box plot.

Handling: We can always perform trimming but it leads to loss of data. Another approach is capping also known as winsorization. In this approach we decide on a percentage for example if we take 1% then we would consider all data points above the 99 percentile and below the 1 percentile as outliers. The above 99 percentile points will be assigned a value of 99 percentile points and below 1 percentile data point will be assigned a value of data point corresponding to 1 percentile.

Conclusion:

Outliers demand attention in the realm of data analysis and machine learning. By understanding their impact, recognizing when to address them, and employing effective detection and treatment strategies, we can ensure our models are robust and provide accurate insights.

Navigating Outliers: Detection, Impact, and Strategies

Table of contents