Navigating Outliers: Detection, Impact, and Strategies
Introduction:
Outliers, those mischievous data points that deviate from the norm, can significantly impact the accuracy and reliability of machine learning models. In this blog post, we delve into the world of outliers, exploring their nature, understanding their impact on various machine learning algorithms, and uncovering strategies to detect and handle them effectively.
Understanding Outliers:
Outliers are data points that exhibit behavior significantly different from the majority in a dataset. Their presence can distort the true representation of the data, making them a crucial consideration in data analysis and modeling.
When to Address Outliers:
While outliers pose a threat to the integrity of machine learning models, it's not a one-size-fits-all decision to eliminate them. The need to remove or transform outliers depends on the specific problem statement, and careful consideration is essential because for anomaly detection use cases this outlier may also prove to be important.
Impacted ML Algorithms:
Certain machine learning algorithms, particularly those internally calculating weights, are more susceptible to the influence of outliers. Linear regression, logistic regression, AdaBoost, and deep learning algorithms are examples where outliers can skew results.
Strategies for Outlier Treatment:
Trimming:
Suitable for datasets with a few outliers, this strategy involves the removal of outliers entirely from the data which could lead to data loss.
Capping:
This strategy is most often used as it removes the data loss issue and in this approach, we assign the the lower range value or upper range value of the dataset.
Treat Like Missing Values:
This method is very rarely used and it involves thinking of considering outliers as missing values and then applying all strategies we use for missing values to it.
Discretization:
This is another method that is rarely used while handling outliers. This method involves the conversion of numerical features into categorical features.
Detecting and Handling Outliers for Different Distributions:
For Normal Distribution:
Detection: Use the Z score. If the Z score is between -3 to 3, the point is not an outlier. As normal distribution follows the bell curve about 99.7% of data points fall within 3 standard deviations.
Handling: Options include trimming or capping. Capping is preferred to bring data within the 3 standard deviation range. In this approach, if the data point Z score is above 3 then we assign it as 3 and if it is below -3 we assign it -3. Trimming means the removal of outliers from the data set entirely.
For Skewed Distribution:
Detection: Employ a box plot to identify points beyond the lower and upper ranges.
Handling: Choose between trimming or capping. Capping involves calculating minimum and maximum values using quartiles and interquartile ranges. The minimum value is given by Q1-1.5IQR and the Maximum value is given by Q3+1.5IQR. Q1 is the first quartile representing the values equal to the 25 percentile. Q3 is the third quartile representing a value equal to the 75 percentile. IQR is the difference between Q3 and Q1.
For Other Distributions:
Detection: Utilize a box plot. If outliers are present, those will be shown in the box plot.
Handling: We can always perform trimming but it leads to loss of data. Another approach is capping also known as winsorization. In this approach we decide on a percentage for example if we take 1% then we would consider all data points above the 99 percentile and below the 1 percentile as outliers. The above 99 percentile points will be assigned a value of 99 percentile points and below 1 percentile data point will be assigned a value of data point corresponding to 1 percentile.
Conclusion:
Outliers demand attention in the realm of data analysis and machine learning. By understanding their impact, recognizing when to address them, and employing effective detection and treatment strategies, we can ensure our models are robust and provide accurate insights.
Subscribe to my newsletter
Read articles from Saurabh Naik directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Saurabh Naik
Saurabh Naik
๐ Passionate Data Enthusiast and Problem Solver ๐ค ๐ Education: Bachelor's in Engineering (Information Technology), Vidyalankar Institute of Technology, Mumbai (2021) ๐จโ๐ป Professional Experience: Over 2 years in startups and MNCs, honing skills in Data Science, Data Engineering, and problem-solving. Worked with cutting-edge technologies and libraries: Keras, PyTorch, sci-kit learn, DVC, MLflow, OpenAI, Hugging Face, Tensorflow. Proficient in SQL and NoSQL databases: MySQL, Postgres, Cassandra. ๐ Skills Highlights: Data Science: Statistics, Machine Learning, Deep Learning, NLP, Generative AI, Data Analysis, MLOps. Tools & Technologies: Python (modular coding), Git & GitHub, Data Pipelining & Analysis, AWS (Lambda, SQS, Sagemaker, CodePipeline, EC2, ECR, API Gateway), Apache Airflow. Flask, Django and streamlit web frameworks for python. Soft Skills: Critical Thinking, Analytical Problem-solving, Communication, English Proficiency. ๐ก Initiatives: Passionate about community engagement; sharing knowledge through accessible technical blogs and linkedin posts. Completed Data Scientist internships at WebEmps and iNeuron Intelligence Pvt Ltd and Ungray Pvt Ltd. successfully. ๐ Next Chapter: Pursuing a career in Data Science, with a keen interest in broadening horizons through international opportunities. Currently relocating to Australia, eligible for relevant work visas & residence, working with a licensed immigration adviser and actively exploring new opportunities & interviews. ๐ Let's Connect! Open to collaborations, discussions, and the exciting challenges that data-driven opportunities bring. Reach out for a conversation on Data Science, technology, or potential collaborations! Email: naiksaurabhd@gmail.com