A Deep Dive into Supervised Learning Models in Scikit-learn
Supervised learning is one of the most popular types of machine learning where the model learns from labeled data. In this post, we will explore some widely-used supervised learning models in the Scikit-learn library, highlighting their advantages, key parameters, and usage.
1. Linear Regression
Linear Regression is one of the simplest algorithms used for regression tasks. It assumes a linear relationship between input features and output.
Advantages:
Simple and easy to interpret.
Efficient for small to medium-sized datasets.
Provides a baseline to compare more complex models.
Handles collinearity well with regularization (Ridge, Lasso).
from sklearn.linear_model import LinearRegression
reg = LinearRegression(fit_intercept=True) # Create a Linear Regression instance
reg.fit(X_train, y_train) # Train the model on the training set
score = reg.score(X_test, y_test) # Test the model and evaluate its performance
The closer the score
is to 1, the better the model's performance.
2. Logistic Regression
Logistic Regression is commonly used for binary classification problems and can be extended to multiclass classification.
Advantages:
Effective for binary classification tasks.
Provides probabilities for predictions.
Can be extended to multiclass problems with
multi_class='ovr'
ormulti_class='multinomial'
.Regularization helps prevent overfitting.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(multi_class='ovr', max_iter=3000) # Create Logistic Regression instance
clf.fit(X_train, y_train) # Train the model
pred = clf.predict(X_test.iloc[0].values.reshape(1, -1)) # Predict on one sample
proba = clf.predict_proba(X_test.iloc[0].values.reshape(1, -1)) # Get class probabilities
3. Decision Tree Classifier
Decision trees are easy to interpret and can handle both numerical and categorical data. They are also capable of modeling non-linear relationships.
Advantages:
Easy to understand and interpret.
Requires little data preparation.
Can handle both numerical and categorical data.
Capable of modeling non-linear relationships.
from sklearn.tree import DecisionTreeClassifier, plot_tree
clf = DecisionTreeClassifier(max_depth=5) # Create Decision Tree instance
clf.fit(X_train, y_train) # Train the model
plot_tree(clf) # Visualize the decision tree
The max_depth
parameter can be adjusted to control the complexity of the tree.
4. Bagging Regressor
Bagging (Bootstrap Aggregating) helps reduce the variance of a model by combining predictions from multiple decision trees.
Advantages:
Reduces variance and helps prevent overfitting.
Improves model stability and accuracy.
Effective for high-variance models such as decision trees.
from sklearn.ensemble import BaggingRegressor
reg = BaggingRegressor(n_estimators=10) # Create a Bagging Regressor with 10 decision trees
reg.fit(X_train, y_train) # Train the model
predictions = reg.predict(X_test[0:10]) # Predict on multiple samples
5. Random Forest Regressor
Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It's great for both regression and classification tasks.
Advantages:
Reduces variance and prevents overfitting.
Improves model stability and accuracy.
Handles missing data well.
Provides feature importance for feature selection.
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=100, max_depth=10) # Create Random Forest Regressor
rfc.fit(X_train, y_train) # Train the model
score = rfc.score(X_test, y_test) # Evaluate the model's performance
6. Random Forest Classifier
The Random Forest Classifier is similar to the Random Forest Regressor but is used for classification tasks. It uses majority voting among multiple trees for class predictions.
Advantages:
Works well with categorical features.
Improves accuracy and stability.
Handles missing values well.
Provides feature importance.
Suitable for large datasets with many features.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10) # Create Random Forest Classifier
rfc.fit(X_train, y_train) # Train the model
score = rfc.score(X_test, y_test) # Evaluate model accuracy
7. Support Vector Classifier (SVC)
The SVC algorithm attempts to find the optimal hyperplane that separates different classes in a high-dimensional space.
Advantages:
Effective for high-dimensional spaces.
Uses different kernel functions for separation (linear, radial basis function, polynomial).
from sklearn.svm import SVC
model = SVC(kernel='rbf') # Create SVC instance with radial basis function kernel
model.fit(X_train, y_train) # Train the model
y_pred = model.predict(X_test) # Predict on the test set
8. AdaBoost Classifier
AdaBoost is a boosting algorithm used to improve the performance of weak classifiers. It works well with decision trees and other models.
Advantages:
Improves the accuracy of weak learners (e.g., Decision Tree).
Performs better than individual models like decision trees.
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier() # Create AdaBoost Classifier
clf.fit(X_train, y_train) # Train the model
9. K-Nearest Neighbors Regressor
The K-Nearest Neighbors (KNN) algorithm is used for both classification and regression. It finds the 'k' closest neighbors and predicts the value based on their outcomes.
Advantages:
Simple and easy to implement.
Non-parametric and does not assume any underlying data distribution.
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, metric='cosine') # Create KNN Regressor
knn.fit(X_train, y_train) # Train the model
prediction = knn.predict(target_user_x) # Predict on new data
Conclusion
In this post, we covered some of the most commonly used machine learning models in Scikit-learn, including their advantages and basic usage. Whether you're working on regression, classification, or boosting tasks, Scikit-learn provides a variety of tools to help you build, train, and evaluate your models efficiently.
For more information, check out the Scikit-learn documentation.
Subscribe to my newsletter
Read articles from Emeron Marcelle directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Emeron Marcelle
Emeron Marcelle
As a doctoral scholar in Information Technology, I am deeply immersed in the world of artificial intelligence, with a specific focus on advancing the field. Fueled by a strong passion for Machine Learning and Artificial Intelligence, I am dedicated to acquiring the skills necessary to drive growth and innovation in this dynamic field. With a commitment to continuous learning and a desire to contribute innovative ideas, I am on a path to make meaningful contributions to the ever-evolving landscape of Machine Learning.