20 Sklearn concepts with Before-and-After Examples

Anix LynchAnix Lynch
17 min read

1. Model Selection (Splitting) ๐Ÿ“

Boilerplate Code:

from sklearn.model_selection import train_test_split

Use Case: Split your data into two groups: one for training the model and another for testing how well it performs. ๐Ÿ“š๐ŸŽ“

Goal: Ensure the model doesnโ€™t overfit and performs well on unseen data. ๐Ÿง ๐Ÿค–

Sample Code:

# Splitting the features (X) and labels (y) into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 80% of the data for training, 20% for testing!

Before Example: Youโ€™ve got a big stack of documents ๐Ÿ“š and donโ€™t know which part to use for learning and which for testing.

codeData: [document1, document2, document3, ..., document100]

After Example: Split the data into training and testing sets (80% for training, 20% for testing). ๐Ÿ“Š

Training data: [document1, document2, ..., document80]
Testing data: [document81, document82, ..., document100]

Challenge: ๐Ÿค” Try changing test_size to 0.3 and see how it affects the split!


2. Preprocessing (Data Preparation) ๐Ÿงน

Boilerplate Code:

from sklearn.preprocessing import StandardScaler, LabelEncoder

Use Case: You need to prepare your data by scaling numbers to a consistent range or encoding categories into numbers, so your model understands it better. ๐Ÿ“๐Ÿ”ข

Goal: Ensure all data is in the same format and scale, so the model doesnโ€™t get confused. ๐Ÿคฏ

Sample Code:

# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Encode categorical labels into numbers
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Now your data is ready for the model to digest!

Before Example: The intern has raw data, with heights in centimeters and animal types in words. ๐Ÿ˜ต

Height: [150cm, 160cm, 170cm]
Animal: ["cat", "dog", "bird"]

After Example: Now, heights are scaled, and animals are encoded into numbers. ๐Ÿ”ง

Height (scaled): [0.2, 0.5, 0.8]
Animal (encoded): [0, 1, 2]

Challenge: ๐ŸŒŸ Try using MinMaxScaler instead of StandardScaler to scale values between 0 and 1, and see the difference!

3. Metrics (Evaluation) ๐Ÿ…

Boilerplate Code:

from sklearn.metrics import accuracy_score, precision_score, mean_squared_error

Use Case: After training your model, you need to check how well it performed by evaluating accuracy, precision, or error. ๐Ÿ†

Goal: Measure how well the modelโ€™s predictions match the actual values. ๐Ÿ“Š

Sample Code:

# Check how many predictions were correct
accuracy = accuracy_score(y_test, y_pred)

# Measure precision for binary classification
precision = precision_score(y_test, y_pred)

# Calculate the error for regression models
mse = mean_squared_error(y_test, y_pred)

# Now you know how well your model performed!

Before Example: The internโ€™s model makes predictions, but they donโ€™t know how well it did. ๐Ÿ˜ฌ

Predictions: [Yes, No, Yes]
Actual: [Yes, No, No]

After Example: The accuracy score tells them how many predictions were correct! ๐Ÿ’ฏ

Accuracy Score: 67% (2 out of 3 correct)

Challenge: ๐Ÿ” Try calculating the f1_score to balance precision and recall!


4. Linear Models ๐Ÿ“

Boilerplate Code:

from sklearn.linear_model import LogisticRegression, LinearRegression

Use Case: Use a linear model to predict outcomes, whether for classification (Logistic) or regression (Linear). โœ๏ธ

Goal: Use simple equations to make predictions about future outcomes! ๐Ÿ“ˆ

Sample Code:

# Predict binary outcomes with logistic regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict continuous values with linear regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# You're all set to make predictions!

Before Example: We want to predict salary based on years of experience but doesnโ€™t have a formula. ๐Ÿ’ผ

Data: [Years: 1, Salary: $30k], [Years: 2, Salary: $40k]

After Example: We now has a simple linear equation to make predictions! ๐Ÿงฎ

Linear Equation: Salary = 10k * (Years of Experience)

Challenge: ๐Ÿ”ฅ Try using Ridge or Lasso regression to handle overfitting and compare results!

5. Ensemble Methods ๐ŸŒฒ๐ŸŒณ

Boilerplate Code:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

Use Case: Combine the predictions of multiple models to improve accuracy. ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘

Goal: Boost your modelโ€™s performance by having multiple models "vote" on the final prediction. ๐Ÿ—ณ๏ธ

Sample Code:

# Random forest for classification
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Gradient boosting for classification
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# Multiple models working together for a better result!

Before Example: We have one model making predictions but is unsure of its accuracy. ๐Ÿคทโ€โ™‚๏ธ

Model 1: "Yes"

After Example: Now, multiple models vote, and the majority wins! ๐Ÿ—ณ๏ธ

Model 1: "Yes", Model 2: "No", Model 3: "Yes" โ†’ Final prediction: "Yes"

Challenge: ๐ŸŒŸ Try using AdaBoostClassifier for a different boosting technique and compare results!


6. Support Vector Machines (SVM) ๐Ÿšง

Boilerplate Code:

from sklearn.svm import SVC, SVR

Use Case: Use Support Vector Machines to create decision boundaries for classification or prediction. ๐Ÿ›ค๏ธ
The difference between SVC (Support Vector Classifier) and SVR (Support Vector Regressor) is in what kind of task they are used for:

  1. SVC (Support Vector Classifier):

    • Task: Classification

    • Use: Separates data points into discrete categories or classes.

    • Example: Classifying images as either cats ๐Ÿฑ or dogs ๐Ÿถ.

    • Goal: Draw a decision boundary that separates classes as clearly as possible.

  2. SVR (Support Vector Regressor):

    • Task: Regression

    • Use: Predicts continuous values, like numbers or measurements.

    • Example: Predicting the price of a house based on features like size, location, etc.

    • Goal: Fit a curve that predicts continuous values with minimal error.

In both cases, they find a boundary or line (in SVC) or a curve (in SVR) that optimizes the separation or prediction based on the data.

Goal: Separate different classes with a boundary that maximizes the gap between them. ๐Ÿงฑ

Sample Code:

# Classify with Support Vector Classifier
svc = SVC()
svc.fit(X_train, y_train)

# Predict continuous values with Support Vector Regressor
svr = SVR()
svr.fit(X_train, y_train)

# Now you're ready to separate classes with clear boundaries!

Before Example: We work with a mixed dataset of cats ๐Ÿฑ and dogs ๐Ÿถ.

Data: [cats, dogs, mixed up]

After Example: A decision boundary now neatly separates them into groups. ๐Ÿงฑ

Boundary: [cats on one side, dogs on the other]

Challenge: ๐Ÿง  Try changing the kernel parameter (e.g., 'linear', 'rbf') and see how it changes the decision boundary!


7. Nearest Neighbors ๐Ÿงญ

Boilerplate Code:

from sklearn.neighbors import KNeighborsClassifier

Use Case: Classify new data points based on the nearest neighbors. ๐Ÿก

Goal: Make predictions by finding the closest examples and using them as a guide. ๐Ÿ‘ฏโ€โ™‚๏ธ

Sample Code:

# Classify with K-nearest neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Now your model will classify based on the nearest neighbors!

Before Example: We are unsure how to classify new data. ๐Ÿค”

New Data: ? | Neighbors: [cat, cat, dog]

After Example: The model classifies based on the majority of the nearest neighbors! ๐Ÿ—ณ๏ธ

Prediction: "cat" (since most neighbors are cats)

Challenge: ๐Ÿงฉ Try changing the number of neighbors (e.g., n_neighbors=3 or 7) and see how it affects the predictions!

Hereโ€™s the next set! Letโ€™s keep the momentum going! ๐Ÿ˜Š


8. Decision Trees ๐ŸŒณ

Boilerplate Code:

from sklearn.tree import DecisionTreeClassifier

Use Case: Use if-then rules to classify data step-by-step, creating a decision-making flow. ๐Ÿ“œ

Goal: Build a tree of decisions to arrive at a prediction based on different features. ๐ŸŒฟ

Sample Code:

# Classify using a decision tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Your decision tree is ready to make predictions based on rules!

Before Example: We donโ€™t know how to make classification decisions without a structure. ๐Ÿคทโ€โ™€๏ธ

Is it furry? Yes โ†’ Is it a pet? Yes โ†’ ?

After Example: The decision tree now helps classify data based on clear rules! โœ…

If furry โ†’ yes, If pet โ†’ yes โ†’ classify as "cat"

Challenge: ๐ŸŒŸ Try adjusting the max_depth of the tree and see how it changes performance. Can you avoid overfitting?


9. Cross-Validation ๐Ÿ”„

Boilerplate Code:

from sklearn.model_selection import cross_val_score

Use Case: Test the model multiple times by splitting the data differently each time to get more reliable performance metrics. ๐Ÿงช

Goal: Ensure your modelโ€™s performance isnโ€™t dependent on a single data split. ๐Ÿ”„

Sample Code:

# Evaluate model using cross-validation
scores = cross_val_score(model, X, y, cv=5)

# Average score across 5 different splits of the data
average_score = scores.mean()

Before Example: We test the model just once and gets a performance score, but it's hard to know if the score is reliable. ๐Ÿ˜•

Score from one test: 85%

After Example: By testing across multiple splits, the intern gets a more reliable average score! ๐Ÿ“Š

Scores: [85%, 90%, 80%, 88%, 84%] โ†’ Average: 85%

Challenge: ๐Ÿ” Experiment with different cv values (e.g., cv=3, cv=10) and see how it impacts the results!


10. Hyperparameter Tuning ๐ŸŽ›๏ธ

Boilerplate Code:

from sklearn.model_selection import GridSearchCV

Use Case: Optimize your model by finding the best hyperparameters through an automated search. ๐Ÿ”

Goal: Fine-tune your model by testing different hyperparameters and selecting the best combination. ๐ŸŽฏ

Sample Code:

# Define the hyperparameters to search
param_grid = {'n_estimators': [100, 200], 'max_depth': [3, 5, 7]}

# Use grid search to find the best parameters
grid_search = GridSearchCV(RandomForestClassifier(), param_grid)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

Before Example: We use default hyperparameters for the model and doesnโ€™t know if theyโ€™re optimal. ๐Ÿคท

Parameters: n_estimators = 100, max_depth = 3

After Example: Now we run a grid search and finds the best parameters for the model! ๐ŸŽฏ

Best Parameters: n_estimators = 200, max_depth = 5

Challenge: ๐Ÿค” Try changing the range of hyperparameters in param_grid and see how the best parameters change!


11. Pipelines (Sequential Workflow) ๐Ÿ”„

Boilerplate Code:

from sklearn.pipeline import Pipeline

Use Case: Combine multiple steps into one single workflow. You can connect data preprocessing, feature engineering, and model training in a single pipeline! ๐Ÿ”—

Goal: Automate and streamline your machine learning workflow by creating a sequential process that ties multiple steps together. ๐Ÿš‚

Sample Code:

# Create a pipeline with a scaler and a logistic regression model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Now the pipeline scales the data and trains the model in one go!

Before Example: manually scale the data and then pass it to the model. ๐Ÿ‹๏ธโ€โ™‚๏ธ

Manually scale โ†’ Manually pass to the model โ†’ Manually predict

After Example: With a pipeline, itโ€™s all automatic! ๐Ÿš‚

Pipeline automatically scales, trains, and predicts in sequence!

Challenge: ๐ŸŒŸ Try adding PCA into the pipeline between the scaler and the model and see how it impacts the modelโ€™s performance!


12. Polynomial Features (Feature Engineering) ๐Ÿงฎ

Boilerplate Code:

from sklearn.preprocessing import PolynomialFeatures

Use Case: Expand your feature set by adding polynomial terms (e.g., squares, cubes), creating more complex relationships between features. ๐Ÿ”ง

Goal: Enrich the dataset by transforming simple features into polynomial ones to capture more complex patterns. ๐ŸŽฏ

Sample Code:

# Expand features by adding polynomial terms (e.g., xยฒ, xยณ)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# You've created new polynomial features!

Before Example: has simple features but knows there are hidden patterns to capture. ๐Ÿค”

Features: [x1, x2]

After Example: After expanding with polynomial features, new patterns are created! ๐Ÿงฎ

Expanded Features: [1, x1, x2, x1ยฒ, x1*x2, x2ยฒ]

Challenge: ๐Ÿ” Try increasing the degree (e.g., degree=3) and see how it affects the complexity of the model.


13. Feature Selection (Selecting Important Features) โœ‚๏ธ

Boilerplate Code:

from sklearn.feature_selection import SelectKBest

Use Case: Select only the most important features based on statistical tests, filtering out the unnecessary ones. ๐Ÿ”

Goal: Narrow down your data to the top K features that are most relevant for the model. โœ‚๏ธ

Sample Code:

# Select the top 5 best features
selector = SelectKBest(k=5)
X_selected = selector.fit_transform(X, y)

# Now you've narrowed down to the most important features!

Before Example: too many features, but not all are relevant for the model. ๐Ÿ˜ต

Features: [A, B, C, D, E, F]

After Example: After using feature selection, we keep only the top 5 features! โœ‚๏ธ

Selected Features: [A, B, D, E, F]

Challenge: ๐ŸŒŸ Try selecting a different number of features (e.g., k=3) and see how it impacts model performance!


14. Scaling Data (MinMaxScaler) ๐Ÿ“

Boilerplate Code:

from sklearn.preprocessing import MinMaxScaler

Use Case: Scale your data to fit between a given range (usually 0 and 1) to avoid any feature dominating due to scale differences. ๐Ÿ“Š

Goal: Ensure all features are within the same range for better model performance. ๐ŸŽฏ

Sample Code:

# Scale features to the range [0, 1]
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Now your data is scaled to the same range!

Before Example: data has values with a wide range (e.g., heights in cm and incomes in thousands). ๐Ÿ“๐Ÿ’ฐ

Height: [150, 160, 170]
Income: [30k, 50k, 60k]

After Example: After scaling, all the data is within the range [0, 1]. ๐Ÿ“

Height (scaled): [0.2, 0.5, 0.8]
Income (scaled): [0.4, 0.6, 1.0]

Challenge: ๐Ÿ” Try using StandardScaler and compare the results with MinMaxScaler.


15. Outlier Detection (Isolation Forest) ๐Ÿšจ

Boilerplate Code:

from sklearn.ensemble import IsolationForest

Use Case: Detect outliers in your datasetโ€”those rare and unusual points that donโ€™t fit the pattern. ๐Ÿ•ต๏ธ

Goal: Identify and remove outliers so they donโ€™t skew your model. ๐Ÿšจ

Sample Code:

# Use Isolation Forest to detect outliers
iso_forest = IsolationForest()
iso_forest.fit(X_train)

# Predict outliers (-1 means it's an outlier, 1 means it's not)
outlier_predictions = iso_forest.predict(X_test)

Before Example: data with hidden outliers that could throw off the modelโ€™s performance. ๐Ÿ˜ฌ

Data: [1, 2, 2, 3, 100]

After Example: The outlier is detected and can be dealt with! ๐Ÿšจ

Outlier detected: 100

Challenge: ๐ŸŒŸ Try adjusting contamination (e.g., contamination=0.05) to change how sensitive the model is to outliers.


16. Dimensionality Reduction (PCA) ๐Ÿ“‰

Boilerplate Code:

from sklearn.decomposition import PCA

Use Case: Reduce the number of features while keeping most of the important information. Perfect when you have too many features! ๐Ÿง 

Goal: Compress your dataset into fewer dimensions without losing too much information. ๐ŸŽฏ

Sample Code:

# Reduce to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# You've reduced the data to just 2 dimensions!

Before Example: a dataset with too many features, making analysis difficult. ๐Ÿ˜ฐ

Features: [Feature1, Feature2, ..., Feature100]

After Example: We use PCA to reduce the data to fewer dimensions. ๐Ÿ“Š

Reduced Features: [PrincipalComponent1, PrincipalComponent2]

Challenge: ๐Ÿ” Try using different numbers of components (e.g., n_components=3) and see how much variance is retained!


17. Clustering (KMeans) ๐Ÿงฒ

Boilerplate Code:

from sklearn.cluster import KMeans

Use Case: Automatically group your data into clusters based on similarity. ๐Ÿ“Š

Goal: Discover hidden patterns by grouping similar data points togetherโ€”no labels needed! ๐Ÿงฒ

Sample Code:

# Cluster the data into 3 groups
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

# Get the cluster labels for each data point
labels = kmeans.predict(X)

# Now youโ€™ve grouped the data into clusters!

Before Example: unlabeled data and no way to classify it into meaningful groups. ๐Ÿค”

Data: [Data Point 1, Data Point 2, Data Point 3, ...]

After Example: group the data into clusters, revealing hidden patterns! ๐Ÿ“Š

Cluster 1: [Data Point 1, Data Point 3]
Cluster 2: [Data Point 2]

Challenge: ๐Ÿง  Try changing the number of clusters (n_clusters=5) and see how it affects the groupings.


18. Model Calibration (Calibrated Classifier) ๐ŸŽฏ

In the context of machine learning, model calibration is the process of adjusting a modelโ€™s predicted probabilities to make them more realistic.

For example, if a model predicts a 70% chance of rain but it only rains 40% of the time when the model says 70%, then the probabilities aren't well-calibrated(or well-adjusted). Calibration fixes this so that the probabilities reflect reality more closely.
Boilerplate Code:

from sklearn.calibration import CalibratedClassifierCV

Use Case: Adjust your modelโ€™s probability predictions to make them more reliable and realistic. ๐Ÿ“‰

Goal: Fine-tune predicted probabilities so theyโ€™re accurate and trustworthy. ๐ŸŽฏ

Sample Code:

# Calibrate a classifier
calibrated_model = CalibratedClassifierCV(base_estimator=SVC())
calibrated_model.fit(X_train, y_train)

# Predict probabilities for test data
calibrated_probs = calibrated_model.predict_proba(X_test

)

# Now your probabilities are calibrated and more accurate!

Before Example: The model gives probabilities, but theyโ€™re not very reliable. ๐Ÿคทโ€โ™‚๏ธ

Predicted Probabilities: [0.7, 0.4, 0.9]

After Example: With calibration, the probabilities are more realistic and trustworthy! ๐Ÿ“‰

Calibrated Probabilities: [0.6, 0.5, 0.8]

Challenge: ๐ŸŒŸ Try calibrating different classifiers (e.g., RandomForestClassifier) and compare the changes in probability predictions!


19. Nearest Centroid Classifier ๐Ÿงญ

Boilerplate Code:

from sklearn.neighbors import NearestCentroid

Use Case: Classify new data points based on the centroid of the closest class, a simple but effective classifier. ๐Ÿงญ

Goal: Assign a class to each point by calculating the centroid (center) of each class and finding which classโ€™s centroid is closest. ๐ŸŽฏ

Sample Code:

# Use nearest centroid classifier
nc = NearestCentroid()
nc.fit(X_train, y_train)

# Predict labels for the test data
y_pred = nc.predict(X_test)

# Now your data is classified based on the nearest centroid!

Before Example: unsure how to classify new data and needs a simple approach. ๐Ÿคทโ€โ™‚๏ธ

New Data: [Unknown] | Centroids: [Centroid1, Centroid2]

After Example: classify based on which centroid is closest! ๐ŸŽฏ

Assigned Class: [Class 1]

Challenge: ๐ŸŒŸ Try running NearestCentroid on a dataset with more than two classes and see how it handles multiple centroids.


20. Voting Classifier (Combining Classifiers) ๐Ÿ—ณ๏ธ

Boilerplate Code:

from sklearn.ensemble import VotingClassifier

Use Case: Combine multiple models to make a final prediction based on the majority vote. Perfect for improving accuracy! ๐Ÿ—ณ๏ธ

Goal: Improve prediction performance by having multiple models vote on the outcome, and the majority wins! ๐ŸŽฏ

Sample Code:

# Define multiple classifiers
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()

# Use VotingClassifier to combine them
voting_clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
voting_clf.fit(X_train, y_train)

# Now your final predictions are based on the majority vote!

Before Example: use one model, but theyโ€™re unsure of its accuracy. ๐Ÿค”

Model 1: "Yes"

After Example: Multiple models vote, and the majority rules! ๐Ÿ—ณ๏ธ

Model 1: "Yes", Model 2: "No", Model 3: "Yes" โ†’ Final prediction: "Yes"

Challenge: ๐ŸŒŸ Try using voting='soft' to combine the probabilities instead of the hard majority and compare the results.


Bonus point:

All of these methodsโ€”Cross-Validation, Voting Classifier, Boosting, Stacking, and Ensemble Methodsโ€”use multiple models, but they do so in different ways. Here's a simplified breakdown of the differences:

1. Cross-Validation ๐Ÿ“Š

  • Purpose: It's a validation technique, not a model combination method.

  • How it works: You split your dataset into multiple parts (folds) and train the model on different subsets to evaluate its performance on unseen data. This helps to ensure your model generalizes well.

  • Key Point: You're using one model but evaluating it in different ways on different data splits.

2. Voting Classifier ๐Ÿ—ณ๏ธ

  • Purpose: Combines multiple models and makes a decision by "voting."

  • How it works: You train several models and let each one "vote" on the prediction. There are two types:

    • Hard Voting: Each model gives a class prediction, and the most common class wins.

    • Soft Voting: Each model gives a probability for each class, and the probabilities are averaged.

  • Key Point: It aggregates decisions from multiple models to improve accuracy by combining their strengths.

3. Boosting ๐Ÿš€

  • Purpose: Sequentially trains models, with each new model trying to fix the errors of the previous one.

  • How it works: Models are trained one after another, and each new model focuses on correcting mistakes made by the previous models. Examples include AdaBoost and Gradient Boosting.

  • Key Point: It's a sequential process where each model builds on the mistakes of the previous ones, making the final model stronger.

4. Stacking ๐Ÿ“š

  • Purpose: Combines predictions from multiple models by using another model (called a meta-model) to make the final prediction.

  • How it works: Multiple base models are trained, and their predictions are used as inputs to a meta-model. The meta-model then makes the final prediction.

  • Key Point: It layers models, using the predictions of several models as inputs for another model to make a final decision.

5. Ensemble Methods ๐Ÿ—๏ธ

  • Purpose: General term for using multiple models to make better predictions.

  • How it works: Any technique that combines multiple models (e.g., Voting, Boosting, Stacking, etc.) is called an ensemble method. The idea is to reduce error and increase accuracy by using the strengths of different models.

  • Key Point: An umbrella term that includes techniques like bagging (e.g., Random Forests), boosting, and stacking.


Summary of Differences:

  • Cross-Validation: Splits data to evaluate the performance of a single model.

  • Voting Classifier: Combines models by letting them vote on the final decision.

  • Boosting: Sequentially builds models, where each new model corrects the previous oneโ€™s errors.

  • Stacking: Uses the predictions of multiple models as inputs to a new model.

  • Ensemble Methods: General term for techniques that combine multiple models to improve performance.

Each of these techniques has a different way of leveraging multiple models to improve accuracy or robustness!

0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch