Mastering Machine Learning: A Beginnerโ€™s Guide

Manav RastogiManav Rastogi
6 min read

What is Machine Learning (ML)?

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and make decisions without being explicitly programmed. It helps in automating tasks, recognizing patterns, and making predictions.

Example:

Gmail's spam filter automatically classifies emails as spam or important based on past email interactions.

ML is gaining popularity due to several advancements:

  • Advanced Processors: Powerful computing hardware enables complex models to run efficiently.

  • Data is the Fuel: With the explosion of data, ML models can extract insights that were previously impossible.

Difference Between AI, ML, DL, and DS

TermDescriptionExample
Artificial Intelligence (AI)Broad field aiming to make machines intelligentSelf-driving cars ๐Ÿš—
Machine Learning (ML)Subset of AI focused on learning from dataFraud detection ๐Ÿ”
Deep Learning (DL)Subset of ML using neural networksFace recognition ๐Ÿคณ
Data Science (DS)Encompasses AI/ML/DL along with data analyticsBusiness intelligence ๐Ÿ“Š

Types of Machine Learning

1. Supervised Learning

In supervised learning, models are trained on labeled data.

  • Regression: Predicting continuous values (e.g., House price prediction ๐Ÿก)

  • Classification: Categorizing data (e.g., Spam detection in emails ๐Ÿ“ง)

2. Unsupervised Learning

Models find hidden patterns in unlabeled data.

  • Clustering: Grouping similar data points (e.g., Customer segmentation)

  • Dimensionality Reduction: PCA used in image compression ๐Ÿ–ผ๏ธ

3. Semi-Supervised Learning

Combination of supervised and unsupervised learning.

  • Example: Medical image classification where only a small portion of data is labeled ๐Ÿฅ

4. Reinforcement Learning

  • Concerned with how intelligent agents take actions in an environment to maximize rewards.

  • Example: Training a robot to walk ๐Ÿค– or AlphaGo defeating human champions ๐ŸŽฎ

Train, Validate, and Test Data

Data TypeAnalogy
Training DataTeacher explaining concepts ๐Ÿ“–
Validation DataPre-board exam ๐Ÿ“
Test DataFinal board exam ๐ŸŽ“

Model Training and Performance

  • Overfitting: Model memorizes training data but fails on new data.

  • Underfitting: Model is too simple to capture patterns.

  • Generalized Model: Performs well on new data.

ScenarioTraining AccuracyTest AccuracyProblem
OverfittingHighLowPoor generalization
UnderfittingLowLowModel is too simple
Generalized ModelHighHighBest case scenario

Bias-Variance Tradeoff

  • Training error (Bias): High bias leads to underfitting.

  • Testing error (Variance): High variance leads to overfitting.

Handling Missing Data

Types of Missing Data

  • 1๏ธโƒฃ MCAR (Missing Completely At Random)

    ๐Ÿ‘‰ Definition: Missing data is independent of both observed and unobserved data. There is no systematic reason for the missing values.

    ๐Ÿ”น Example:

    • A survey was conducted, and some respondents accidentally skipped a question due to a printing error.

    • A lab machine randomly fails to record some measurements due to occasional power fluctuations.

2๏ธโƒฃ MAR (Missing At Random)

๐Ÿ‘‰ Definition: The missing data depends on the observed data but not on the missing data itself.

๐Ÿ”น Example:

  • In a medical study, older patients tend to skip certain questions about social media usage. The missing values depend on the patient's age (observed data), but not on the social media usage itself (missing data).

  • In an employee salary dataset, higher-level executives might not disclose their salaries. The missing salary values depend on the "Job Title" column (observed data), but not on salary itself.

3๏ธโƒฃ MNAR (Missing Not At Random)

๐Ÿ‘‰ Definition: The missing values depend on the value of the missing data itself.

๐Ÿ”น Example:

  • In a mental health survey, people with severe anxiety may be more likely to skip answering personal questions. The missing data is related to the actual anxiety level.

  • In an income dataset, people with very high salaries may refuse to disclose their income. The missing values depend on the income itself, as higher earners are less likely to report their salaries.

Methods to Handle Missing Values

  • If missing values <1%, drop them.

  • If missing values >40%, drop the column.

  • For continuous variables: Use mean/median.

  • For categorical variables: Use mode.

  • Use random imputation for extreme cases.

Handling Imbalanced Data

Class Imbalance Solutions

  • Undersampling (Removing data from the majority class)

  • Oversampling (Duplicating data from the minority class)

  • SMOTE (Synthetic Minority Oversampling Technique)

Outlier Handling

  • Drop Outliers if they are due to errors.

  • Cap the Outliers (Winsorization).

  • Replace with mean/median if reasonable.

Feature Extraction ๐Ÿ—๏ธ

Feature extraction transforms raw data into useful features that enhance model performance.

1๏ธโƒฃ Creating New Features ๐Ÿ”„

  • Example:

    • From a date column ๐Ÿ—“๏ธ (2024-04-02), extract:

      • Day of the week (Tuesday)

      • Month (April)

      • Year (2024)

  • Example:

    • From a text column ๐Ÿ“œ ("I love this product!"), extract:

      • Word Count (4)

      • Sentiment Score (Positive)

2๏ธโƒฃ Transforming Existing Features ๐Ÿ”„

  • Example:

    • Converting height in cm to height in meters.

    • Converting log of sales data to reduce skewness.

Scaling Methods

  • Standardization: Used in ML algorithms.

  • Normalization (Min-Max Scaling): Converts data between 0 and 1.

  • Unit Vector Scaling: Ensures data has unit length.

Data Encoding Techniques

  • 1๏ธโƒฃ One-Hot Encoding (OHE) ๐ŸŸฆ๐Ÿ”ด๐ŸŸฉ

    • Used for: Nominal (unordered) categorical variables

    • Example: Suppose we have a Color column:

ColorOne-Hot Encoding
Red(1,0,0)
Blue(0,1,0)
Green(0,0,1)
  • Real-World Example: Encoding city names for location-based recommendations ๐Ÿ™๏ธ

2๏ธโƒฃ Label Encoding ๐Ÿ”ข

  • Used for: Ordinal categorical variables (where order matters)

  • Example: Suppose we have a Size column:

SizeLabel Encoding
Small0
Medium1
Large2
  • Why? Because "Large" is greater than "Medium," and "Medium" is greater than "Small."

  • Real-World Example: Encoding education levels (Primary โ†’ 0, Secondary โ†’ 1, College โ†’ 2 ๐ŸŽ“)

3๏ธโƒฃ Target Guided Encoding ๐ŸŽฏ

  • Used for: Categories are ordered based on target variable correlation

  • Example: Suppose we are predicting Loan Approval (Yes/No), and we have a Job Type column:

Job TypeLoan Approval Rate (%)Target Guided Encoding
Manager80%3
Engineer70%2
Clerk50%1
  • Why? The job type is ranked based on its impact on loan approval rates.

  • Real-World Example: Encoding customer segments based on purchase likelihood in e-commerce ๐Ÿ›๏ธModel Evaluation Metrics

Conclusion

Machine Learning is revolutionizing industries by making data-driven predictions. From handling missing data and class imbalances to feature engineering, each step is crucial for building an accurate model.

๐Ÿš€ Keep exploring, keep learning, and let data guide your decisions!

0
Subscribe to my newsletter

Read articles from Manav Rastogi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Manav Rastogi
Manav Rastogi

"Aspiring Data Scientist and AI enthusiast with a strong foundation in full-stack web development. Passionate about leveraging data-driven solutions to solve real-world problems. Skilled in Python, databases, statistics, and exploratory data analysis, with hands-on experience in the MERN stack. Open to opportunities in Data Science, Generative AI, and full-stack development."