What is Outliers and Treatment of Outliers ?

An outlier is a data point that differs significantly from other observations in a dataset.

Impact of Outliers

Outliers can have a dramatic effect on the data analysis process:

  • Skew the data distribution and impact the mean and standard deviation.

  • Impact model assumptions such as normality.

  • Influence model estimates and accuracy; in regression models, for example, outliers can affect the slope of the regression line significantly.

Techniques to Handle Outliers

Handling outliers appropriately depends on the context and the cause of the outliers. Here are some common techniques:

1. Detection Methods

  • Statistical Tests:

    • Z-score: The Z-score measures the number of standard deviations an element is from the mean. A rule of thumb is to label data points as outliers if the Z-score is above 3 or below -3.

    • IQR (Interquartile Range) Score: IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Data points that fall below Q1 - 1.5xIQR or above Q3 + 1.5xIQR can be considered outliers.

  • Visualization:

    • Box Plots: These can visually show outliers as points outside the whiskers of the plot.

    • Scatter Plots: Useful in identifying outliers in the context of how data points are clustered or spread.

2. Removal

  • Direct Removal: If you're confident that the outlier is due to incorrectly entered or measured data, removing the outlier might be appropriate. However, care must be taken not to arbitrarily remove data points, as they could be genuine extreme values.

3. Transformation

  • Log Transformation: This can reduce the variability of data points, pulling in high values more than low values.

  • Square Root or Cube Root Transformation: Similar to log transformation, but less intense.

4. Imputation

  • Replacing Outliers: If outliers are identified as errors, but removing them is not an option, replacing them with a reasonable value such as the mean, median, or mode of the non-outlier data can be an alternative.

5. Binning

  • Data Binning: Outliers can be grouped into bins, which can reduce the impact of minor observation errors. This is more common in data visualization than in statistical analysis.

6. Winsorizing

  • Limiting Extreme Values: In this method, outliers are not removed but are instead replaced with the nearest value that is not an outlier. For example, all values above the 95th percentile might be set to the value of the 95th percentile.

7. Using Robust Methods

  • Robust Statistical Measures: Using medians instead of means, or robust models like the Huber regressor, can reduce the influence of outliers.

  • Machine Learning Models: Some models are inherently more robust to outliers, such as Random Forests or other tree-based methods.

0
Subscribe to my newsletter

Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sai Prasanna Maharana
Sai Prasanna Maharana