20 Sklearn concepts with Before-and-After Examples
Table of contents
- 1. Model Selection (Splitting) ๐
- 2. Preprocessing (Data Preparation) ๐งน
- 3. Metrics (Evaluation) ๐
- 4. Linear Models ๐
- 5. Ensemble Methods ๐ฒ๐ณ
- 6. Support Vector Machines (SVM) ๐ง
- 7. Nearest Neighbors ๐งญ
- 8. Decision Trees ๐ณ
- 9. Cross-Validation ๐
- 10. Hyperparameter Tuning ๐๏ธ
- 11. Pipelines (Sequential Workflow) ๐
- 12. Polynomial Features (Feature Engineering) ๐งฎ
- 13. Feature Selection (Selecting Important Features) โ๏ธ
- 14. Scaling Data (MinMaxScaler) ๐
- 15. Outlier Detection (Isolation Forest) ๐จ
- 16. Dimensionality Reduction (PCA) ๐
- 17. Clustering (KMeans) ๐งฒ
- 18. Model Calibration (Calibrated Classifier) ๐ฏ
- 19. Nearest Centroid Classifier ๐งญ
- 20. Voting Classifier (Combining Classifiers) ๐ณ๏ธ
1. Model Selection (Splitting) ๐
Boilerplate Code:
from sklearn.model_selection import train_test_split
Use Case: Split your data into two groups: one for training the model and another for testing how well it performs. ๐๐
Goal: Ensure the model doesnโt overfit and performs well on unseen data. ๐ง ๐ค
Sample Code:
# Splitting the features (X) and labels (y) into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 80% of the data for training, 20% for testing!
Before Example: Youโve got a big stack of documents ๐ and donโt know which part to use for learning and which for testing.
codeData: [document1, document2, document3, ..., document100]
After Example: Split the data into training and testing sets (80% for training, 20% for testing). ๐
Training data: [document1, document2, ..., document80]
Testing data: [document81, document82, ..., document100]
Challenge: ๐ค Try changing test_size
to 0.3 and see how it affects the split!
2. Preprocessing (Data Preparation) ๐งน
Boilerplate Code:
from sklearn.preprocessing import StandardScaler, LabelEncoder
Use Case: You need to prepare your data by scaling numbers to a consistent range or encoding categories into numbers, so your model understands it better. ๐๐ข
Goal: Ensure all data is in the same format and scale, so the model doesnโt get confused. ๐คฏ
Sample Code:
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Encode categorical labels into numbers
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
# Now your data is ready for the model to digest!
Before Example: The intern has raw data, with heights in centimeters and animal types in words. ๐ต
Height: [150cm, 160cm, 170cm]
Animal: ["cat", "dog", "bird"]
After Example: Now, heights are scaled, and animals are encoded into numbers. ๐ง
Height (scaled): [0.2, 0.5, 0.8]
Animal (encoded): [0, 1, 2]
Challenge: ๐ Try using MinMaxScaler
instead of StandardScaler
to scale values between 0 and 1, and see the difference!
3. Metrics (Evaluation) ๐
Boilerplate Code:
from sklearn.metrics import accuracy_score, precision_score, mean_squared_error
Use Case: After training your model, you need to check how well it performed by evaluating accuracy, precision, or error. ๐
Goal: Measure how well the modelโs predictions match the actual values. ๐
Sample Code:
# Check how many predictions were correct
accuracy = accuracy_score(y_test, y_pred)
# Measure precision for binary classification
precision = precision_score(y_test, y_pred)
# Calculate the error for regression models
mse = mean_squared_error(y_test, y_pred)
# Now you know how well your model performed!
Before Example: The internโs model makes predictions, but they donโt know how well it did. ๐ฌ
Predictions: [Yes, No, Yes]
Actual: [Yes, No, No]
After Example: The accuracy score tells them how many predictions were correct! ๐ฏ
Accuracy Score: 67% (2 out of 3 correct)
Challenge: ๐ Try calculating the f1_score
to balance precision and recall!
4. Linear Models ๐
Boilerplate Code:
from sklearn.linear_model import LogisticRegression, LinearRegression
Use Case: Use a linear model to predict outcomes, whether for classification (Logistic) or regression (Linear). โ๏ธ
Goal: Use simple equations to make predictions about future outcomes! ๐
Sample Code:
# Predict binary outcomes with logistic regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# Predict continuous values with linear regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# You're all set to make predictions!
Before Example: We want to predict salary based on years of experience but doesnโt have a formula. ๐ผ
Data: [Years: 1, Salary: $30k], [Years: 2, Salary: $40k]
After Example: We now has a simple linear equation to make predictions! ๐งฎ
Linear Equation: Salary = 10k * (Years of Experience)
Challenge: ๐ฅ Try using Ridge
or Lasso
regression to handle overfitting and compare results!
5. Ensemble Methods ๐ฒ๐ณ
Boilerplate Code:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
Use Case: Combine the predictions of multiple models to improve accuracy. ๐งโ๐คโ๐ง
Goal: Boost your modelโs performance by having multiple models "vote" on the final prediction. ๐ณ๏ธ
Sample Code:
# Random forest for classification
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# Gradient boosting for classification
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
# Multiple models working together for a better result!
Before Example: We have one model making predictions but is unsure of its accuracy. ๐คทโโ๏ธ
Model 1: "Yes"
After Example: Now, multiple models vote, and the majority wins! ๐ณ๏ธ
Model 1: "Yes", Model 2: "No", Model 3: "Yes" โ Final prediction: "Yes"
Challenge: ๐ Try using AdaBoostClassifier
for a different boosting technique and compare results!
6. Support Vector Machines (SVM) ๐ง
Boilerplate Code:
from sklearn.svm import SVC, SVR
Use Case: Use Support Vector Machines to create decision boundaries for classification or prediction. ๐ค๏ธ
The difference between SVC (Support Vector Classifier) and SVR (Support Vector Regressor) is in what kind of task they are used for:
SVC (Support Vector Classifier):
Task: Classification
Use: Separates data points into discrete categories or classes.
Example: Classifying images as either cats ๐ฑ or dogs ๐ถ.
Goal: Draw a decision boundary that separates classes as clearly as possible.
SVR (Support Vector Regressor):
Task: Regression
Use: Predicts continuous values, like numbers or measurements.
Example: Predicting the price of a house based on features like size, location, etc.
Goal: Fit a curve that predicts continuous values with minimal error.
In both cases, they find a boundary or line (in SVC) or a curve (in SVR) that optimizes the separation or prediction based on the data.
Goal: Separate different classes with a boundary that maximizes the gap between them. ๐งฑ
Sample Code:
# Classify with Support Vector Classifier
svc = SVC()
svc.fit(X_train, y_train)
# Predict continuous values with Support Vector Regressor
svr = SVR()
svr.fit(X_train, y_train)
# Now you're ready to separate classes with clear boundaries!
Before Example: We work with a mixed dataset of cats ๐ฑ and dogs ๐ถ.
Data: [cats, dogs, mixed up]
After Example: A decision boundary now neatly separates them into groups. ๐งฑ
Boundary: [cats on one side, dogs on the other]
Challenge: ๐ง Try changing the kernel
parameter (e.g., 'linear', 'rbf') and see how it changes the decision boundary!
7. Nearest Neighbors ๐งญ
Boilerplate Code:
from sklearn.neighbors import KNeighborsClassifier
Use Case: Classify new data points based on the nearest neighbors. ๐ก
Goal: Make predictions by finding the closest examples and using them as a guide. ๐ฏโโ๏ธ
Sample Code:
# Classify with K-nearest neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Now your model will classify based on the nearest neighbors!
Before Example: We are unsure how to classify new data. ๐ค
New Data: ? | Neighbors: [cat, cat, dog]
After Example: The model classifies based on the majority of the nearest neighbors! ๐ณ๏ธ
Prediction: "cat" (since most neighbors are cats)
Challenge: ๐งฉ Try changing the number of neighbors (e.g., n_neighbors=3
or 7
) and see how it affects the predictions!
Hereโs the next set! Letโs keep the momentum going! ๐
8. Decision Trees ๐ณ
Boilerplate Code:
from sklearn.tree import DecisionTreeClassifier
Use Case: Use if-then rules to classify data step-by-step, creating a decision-making flow. ๐
Goal: Build a tree of decisions to arrive at a prediction based on different features. ๐ฟ
Sample Code:
# Classify using a decision tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# Your decision tree is ready to make predictions based on rules!
Before Example: We donโt know how to make classification decisions without a structure. ๐คทโโ๏ธ
Is it furry? Yes โ Is it a pet? Yes โ ?
After Example: The decision tree now helps classify data based on clear rules! โ
If furry โ yes, If pet โ yes โ classify as "cat"
Challenge: ๐ Try adjusting the max_depth
of the tree and see how it changes performance. Can you avoid overfitting?
9. Cross-Validation ๐
Boilerplate Code:
from sklearn.model_selection import cross_val_score
Use Case: Test the model multiple times by splitting the data differently each time to get more reliable performance metrics. ๐งช
Goal: Ensure your modelโs performance isnโt dependent on a single data split. ๐
Sample Code:
# Evaluate model using cross-validation
scores = cross_val_score(model, X, y, cv=5)
# Average score across 5 different splits of the data
average_score = scores.mean()
Before Example: We test the model just once and gets a performance score, but it's hard to know if the score is reliable. ๐
Score from one test: 85%
After Example: By testing across multiple splits, the intern gets a more reliable average score! ๐
Scores: [85%, 90%, 80%, 88%, 84%] โ Average: 85%
Challenge: ๐ Experiment with different cv
values (e.g., cv=3
, cv=10
) and see how it impacts the results!
10. Hyperparameter Tuning ๐๏ธ
Boilerplate Code:
from sklearn.model_selection import GridSearchCV
Use Case: Optimize your model by finding the best hyperparameters through an automated search. ๐
Goal: Fine-tune your model by testing different hyperparameters and selecting the best combination. ๐ฏ
Sample Code:
# Define the hyperparameters to search
param_grid = {'n_estimators': [100, 200], 'max_depth': [3, 5, 7]}
# Use grid search to find the best parameters
grid_search = GridSearchCV(RandomForestClassifier(), param_grid)
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
Before Example: We use default hyperparameters for the model and doesnโt know if theyโre optimal. ๐คท
Parameters: n_estimators = 100, max_depth = 3
After Example: Now we run a grid search and finds the best parameters for the model! ๐ฏ
Best Parameters: n_estimators = 200, max_depth = 5
Challenge: ๐ค Try changing the range of hyperparameters in param_grid
and see how the best parameters change!
11. Pipelines (Sequential Workflow) ๐
Boilerplate Code:
from sklearn.pipeline import Pipeline
Use Case: Combine multiple steps into one single workflow. You can connect data preprocessing, feature engineering, and model training in a single pipeline! ๐
Goal: Automate and streamline your machine learning workflow by creating a sequential process that ties multiple steps together. ๐
Sample Code:
# Create a pipeline with a scaler and a logistic regression model
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Now the pipeline scales the data and trains the model in one go!
Before Example: manually scale the data and then pass it to the model. ๐๏ธโโ๏ธ
Manually scale โ Manually pass to the model โ Manually predict
After Example: With a pipeline, itโs all automatic! ๐
Pipeline automatically scales, trains, and predicts in sequence!
Challenge: ๐ Try adding PCA
into the pipeline between the scaler and the model and see how it impacts the modelโs performance!
12. Polynomial Features (Feature Engineering) ๐งฎ
Boilerplate Code:
from sklearn.preprocessing import PolynomialFeatures
Use Case: Expand your feature set by adding polynomial terms (e.g., squares, cubes), creating more complex relationships between features. ๐ง
Goal: Enrich the dataset by transforming simple features into polynomial ones to capture more complex patterns. ๐ฏ
Sample Code:
# Expand features by adding polynomial terms (e.g., xยฒ, xยณ)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# You've created new polynomial features!
Before Example: has simple features but knows there are hidden patterns to capture. ๐ค
Features: [x1, x2]
After Example: After expanding with polynomial features, new patterns are created! ๐งฎ
Expanded Features: [1, x1, x2, x1ยฒ, x1*x2, x2ยฒ]
Challenge: ๐ Try increasing the degree (e.g., degree=3
) and see how it affects the complexity of the model.
13. Feature Selection (Selecting Important Features) โ๏ธ
Boilerplate Code:
from sklearn.feature_selection import SelectKBest
Use Case: Select only the most important features based on statistical tests, filtering out the unnecessary ones. ๐
Goal: Narrow down your data to the top K features that are most relevant for the model. โ๏ธ
Sample Code:
# Select the top 5 best features
selector = SelectKBest(k=5)
X_selected = selector.fit_transform(X, y)
# Now you've narrowed down to the most important features!
Before Example: too many features, but not all are relevant for the model. ๐ต
Features: [A, B, C, D, E, F]
After Example: After using feature selection, we keep only the top 5 features! โ๏ธ
Selected Features: [A, B, D, E, F]
Challenge: ๐ Try selecting a different number of features (e.g., k=3
) and see how it impacts model performance!
14. Scaling Data (MinMaxScaler) ๐
Boilerplate Code:
from sklearn.preprocessing import MinMaxScaler
Use Case: Scale your data to fit between a given range (usually 0 and 1) to avoid any feature dominating due to scale differences. ๐
Goal: Ensure all features are within the same range for better model performance. ๐ฏ
Sample Code:
# Scale features to the range [0, 1]
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Now your data is scaled to the same range!
Before Example: data has values with a wide range (e.g., heights in cm and incomes in thousands). ๐๐ฐ
Height: [150, 160, 170]
Income: [30k, 50k, 60k]
After Example: After scaling, all the data is within the range [0, 1]. ๐
Height (scaled): [0.2, 0.5, 0.8]
Income (scaled): [0.4, 0.6, 1.0]
Challenge: ๐ Try using StandardScaler
and compare the results with MinMaxScaler
.
15. Outlier Detection (Isolation Forest) ๐จ
Boilerplate Code:
from sklearn.ensemble import IsolationForest
Use Case: Detect outliers in your datasetโthose rare and unusual points that donโt fit the pattern. ๐ต๏ธ
Goal: Identify and remove outliers so they donโt skew your model. ๐จ
Sample Code:
# Use Isolation Forest to detect outliers
iso_forest = IsolationForest()
iso_forest.fit(X_train)
# Predict outliers (-1 means it's an outlier, 1 means it's not)
outlier_predictions = iso_forest.predict(X_test)
Before Example: data with hidden outliers that could throw off the modelโs performance. ๐ฌ
Data: [1, 2, 2, 3, 100]
After Example: The outlier is detected and can be dealt with! ๐จ
Outlier detected: 100
Challenge: ๐ Try adjusting contamination
(e.g., contamination=0.05
) to change how sensitive the model is to outliers.
16. Dimensionality Reduction (PCA) ๐
Boilerplate Code:
from sklearn.decomposition import PCA
Use Case: Reduce the number of features while keeping most of the important information. Perfect when you have too many features! ๐ง
Goal: Compress your dataset into fewer dimensions without losing too much information. ๐ฏ
Sample Code:
# Reduce to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# You've reduced the data to just 2 dimensions!
Before Example: a dataset with too many features, making analysis difficult. ๐ฐ
Features: [Feature1, Feature2, ..., Feature100]
After Example: We use PCA to reduce the data to fewer dimensions. ๐
Reduced Features: [PrincipalComponent1, PrincipalComponent2]
Challenge: ๐ Try using different numbers of components (e.g., n_components=3
) and see how much variance is retained!
17. Clustering (KMeans) ๐งฒ
Boilerplate Code:
from sklearn.cluster import KMeans
Use Case: Automatically group your data into clusters based on similarity. ๐
Goal: Discover hidden patterns by grouping similar data points togetherโno labels needed! ๐งฒ
Sample Code:
# Cluster the data into 3 groups
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# Get the cluster labels for each data point
labels = kmeans.predict(X)
# Now youโve grouped the data into clusters!
Before Example: unlabeled data and no way to classify it into meaningful groups. ๐ค
Data: [Data Point 1, Data Point 2, Data Point 3, ...]
After Example: group the data into clusters, revealing hidden patterns! ๐
Cluster 1: [Data Point 1, Data Point 3]
Cluster 2: [Data Point 2]
Challenge: ๐ง Try changing the number of clusters (n_clusters=5
) and see how it affects the groupings.
18. Model Calibration (Calibrated Classifier) ๐ฏ
In the context of machine learning, model calibration is the process of adjusting a modelโs predicted probabilities to make them more realistic.
For example, if a model predicts a 70% chance of rain but it only rains 40% of the time when the model says 70%, then the probabilities aren't well-calibrated(or well-adjusted). Calibration fixes this so that the probabilities reflect reality more closely.
Boilerplate Code:
from sklearn.calibration import CalibratedClassifierCV
Use Case: Adjust your modelโs probability predictions to make them more reliable and realistic. ๐
Goal: Fine-tune predicted probabilities so theyโre accurate and trustworthy. ๐ฏ
Sample Code:
# Calibrate a classifier
calibrated_model = CalibratedClassifierCV(base_estimator=SVC())
calibrated_model.fit(X_train, y_train)
# Predict probabilities for test data
calibrated_probs = calibrated_model.predict_proba(X_test
)
# Now your probabilities are calibrated and more accurate!
Before Example: The model gives probabilities, but theyโre not very reliable. ๐คทโโ๏ธ
Predicted Probabilities: [0.7, 0.4, 0.9]
After Example: With calibration, the probabilities are more realistic and trustworthy! ๐
Calibrated Probabilities: [0.6, 0.5, 0.8]
Challenge: ๐ Try calibrating different classifiers (e.g., RandomForestClassifier
) and compare the changes in probability predictions!
19. Nearest Centroid Classifier ๐งญ
Boilerplate Code:
from sklearn.neighbors import NearestCentroid
Use Case: Classify new data points based on the centroid of the closest class, a simple but effective classifier. ๐งญ
Goal: Assign a class to each point by calculating the centroid (center) of each class and finding which classโs centroid is closest. ๐ฏ
Sample Code:
# Use nearest centroid classifier
nc = NearestCentroid()
nc.fit(X_train, y_train)
# Predict labels for the test data
y_pred = nc.predict(X_test)
# Now your data is classified based on the nearest centroid!
Before Example: unsure how to classify new data and needs a simple approach. ๐คทโโ๏ธ
New Data: [Unknown] | Centroids: [Centroid1, Centroid2]
After Example: classify based on which centroid is closest! ๐ฏ
Assigned Class: [Class 1]
Challenge: ๐ Try running NearestCentroid
on a dataset with more than two classes and see how it handles multiple centroids.
20. Voting Classifier (Combining Classifiers) ๐ณ๏ธ
Boilerplate Code:
from sklearn.ensemble import VotingClassifier
Use Case: Combine multiple models to make a final prediction based on the majority vote. Perfect for improving accuracy! ๐ณ๏ธ
Goal: Improve prediction performance by having multiple models vote on the outcome, and the majority wins! ๐ฏ
Sample Code:
# Define multiple classifiers
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
# Use VotingClassifier to combine them
voting_clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
voting_clf.fit(X_train, y_train)
# Now your final predictions are based on the majority vote!
Before Example: use one model, but theyโre unsure of its accuracy. ๐ค
Model 1: "Yes"
After Example: Multiple models vote, and the majority rules! ๐ณ๏ธ
Model 1: "Yes", Model 2: "No", Model 3: "Yes" โ Final prediction: "Yes"
Challenge: ๐ Try using voting='soft'
to combine the probabilities instead of the hard majority and compare the results.
Bonus point:
All of these methodsโCross-Validation, Voting Classifier, Boosting, Stacking, and Ensemble Methodsโuse multiple models, but they do so in different ways. Here's a simplified breakdown of the differences:
1. Cross-Validation ๐
Purpose: It's a validation technique, not a model combination method.
How it works: You split your dataset into multiple parts (folds) and train the model on different subsets to evaluate its performance on unseen data. This helps to ensure your model generalizes well.
Key Point: You're using one model but evaluating it in different ways on different data splits.
2. Voting Classifier ๐ณ๏ธ
Purpose: Combines multiple models and makes a decision by "voting."
How it works: You train several models and let each one "vote" on the prediction. There are two types:
Hard Voting: Each model gives a class prediction, and the most common class wins.
Soft Voting: Each model gives a probability for each class, and the probabilities are averaged.
Key Point: It aggregates decisions from multiple models to improve accuracy by combining their strengths.
3. Boosting ๐
Purpose: Sequentially trains models, with each new model trying to fix the errors of the previous one.
How it works: Models are trained one after another, and each new model focuses on correcting mistakes made by the previous models. Examples include AdaBoost and Gradient Boosting.
Key Point: It's a sequential process where each model builds on the mistakes of the previous ones, making the final model stronger.
4. Stacking ๐
Purpose: Combines predictions from multiple models by using another model (called a meta-model) to make the final prediction.
How it works: Multiple base models are trained, and their predictions are used as inputs to a meta-model. The meta-model then makes the final prediction.
Key Point: It layers models, using the predictions of several models as inputs for another model to make a final decision.
5. Ensemble Methods ๐๏ธ
Purpose: General term for using multiple models to make better predictions.
How it works: Any technique that combines multiple models (e.g., Voting, Boosting, Stacking, etc.) is called an ensemble method. The idea is to reduce error and increase accuracy by using the strengths of different models.
Key Point: An umbrella term that includes techniques like bagging (e.g., Random Forests), boosting, and stacking.
Summary of Differences:
Cross-Validation: Splits data to evaluate the performance of a single model.
Voting Classifier: Combines models by letting them vote on the final decision.
Boosting: Sequentially builds models, where each new model corrects the previous oneโs errors.
Stacking: Uses the predictions of multiple models as inputs to a new model.
Ensemble Methods: General term for techniques that combine multiple models to improve performance.
Each of these techniques has a different way of leveraging multiple models to improve accuracy or robustness!
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by