Exploring Supervised Learning with Scikit-learn

Supervised learning is one of the most popular types of machine learning where the model learns from labeled data. In this post, we will explore some widely-used supervised learning models in the Scikit-learn library, highlighting their advantages, key parameters, and usage.

1. Linear Regression

Linear Regression is one of the simplest algorithms used for regression tasks. It assumes a linear relationship between input features and output.

Advantages:

Simple and easy to interpret.
Efficient for small to medium-sized datasets.
Provides a baseline to compare more complex models.
Handles collinearity well with regularization (Ridge, Lasso).

from sklearn.linear_model import LinearRegression

reg = LinearRegression(fit_intercept=True)  # Create a Linear Regression instance
reg.fit(X_train, y_train)  # Train the model on the training set
score = reg.score(X_test, y_test)  # Test the model and evaluate its performance

The closer the score is to 1, the better the model's performance.

2. Logistic Regression

Logistic Regression is commonly used for binary classification problems and can be extended to multiclass classification.

Advantages:

Effective for binary classification tasks.
Provides probabilities for predictions.
Can be extended to multiclass problems with multi_class='ovr' or multi_class='multinomial'.
Regularization helps prevent overfitting.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(multi_class='ovr', max_iter=3000)  # Create Logistic Regression instance
clf.fit(X_train, y_train)  # Train the model
pred = clf.predict(X_test.iloc[0].values.reshape(1, -1))  # Predict on one sample
proba = clf.predict_proba(X_test.iloc[0].values.reshape(1, -1))  # Get class probabilities

3. Decision Tree Classifier

Decision trees are easy to interpret and can handle both numerical and categorical data. They are also capable of modeling non-linear relationships.

Advantages:

Easy to understand and interpret.
Requires little data preparation.
Can handle both numerical and categorical data.
Capable of modeling non-linear relationships.

from sklearn.tree import DecisionTreeClassifier, plot_tree

clf = DecisionTreeClassifier(max_depth=5)  # Create Decision Tree instance
clf.fit(X_train, y_train)  # Train the model
plot_tree(clf)  # Visualize the decision tree

The max_depth parameter can be adjusted to control the complexity of the tree.

4. Bagging Regressor

Bagging (Bootstrap Aggregating) helps reduce the variance of a model by combining predictions from multiple decision trees.

Advantages:

Reduces variance and helps prevent overfitting.
Improves model stability and accuracy.
Effective for high-variance models such as decision trees.

from sklearn.ensemble import BaggingRegressor

reg = BaggingRegressor(n_estimators=10)  # Create a Bagging Regressor with 10 decision trees
reg.fit(X_train, y_train)  # Train the model
predictions = reg.predict(X_test[0:10])  # Predict on multiple samples

5. Random Forest Regressor

Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It's great for both regression and classification tasks.

Advantages:

Reduces variance and prevents overfitting.
Improves model stability and accuracy.
Handles missing data well.
Provides feature importance for feature selection.

from sklearn.ensemble import RandomForestRegressor

rfc = RandomForestRegressor(n_estimators=100, max_depth=10)  # Create Random Forest Regressor
rfc.fit(X_train, y_train)  # Train the model
score = rfc.score(X_test, y_test)  # Evaluate the model's performance

6. Random Forest Classifier

The Random Forest Classifier is similar to the Random Forest Regressor but is used for classification tasks. It uses majority voting among multiple trees for class predictions.

Advantages:

Works well with categorical features.
Improves accuracy and stability.
Handles missing values well.
Provides feature importance.
Suitable for large datasets with many features.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10)  # Create Random Forest Classifier
rfc.fit(X_train, y_train)  # Train the model
score = rfc.score(X_test, y_test)  # Evaluate model accuracy

7. Support Vector Classifier (SVC)

The SVC algorithm attempts to find the optimal hyperplane that separates different classes in a high-dimensional space.

Advantages:

Effective for high-dimensional spaces.
Uses different kernel functions for separation (linear, radial basis function, polynomial).

from sklearn.svm import SVC

model = SVC(kernel='rbf')  # Create SVC instance with radial basis function kernel
model.fit(X_train, y_train)  # Train the model
y_pred = model.predict(X_test)  # Predict on the test set

8. AdaBoost Classifier

AdaBoost is a boosting algorithm used to improve the performance of weak classifiers. It works well with decision trees and other models.

Advantages:

Improves the accuracy of weak learners (e.g., Decision Tree).
Performs better than individual models like decision trees.

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier()  # Create AdaBoost Classifier
clf.fit(X_train, y_train)  # Train the model

9. K-Nearest Neighbors Regressor

The K-Nearest Neighbors (KNN) algorithm is used for both classification and regression. It finds the 'k' closest neighbors and predicts the value based on their outcomes.

Advantages:

Simple and easy to implement.
Non-parametric and does not assume any underlying data distribution.

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5, metric='cosine')  # Create KNN Regressor
knn.fit(X_train, y_train)  # Train the model
prediction = knn.predict(target_user_x)  # Predict on new data

Conclusion

In this post, we covered some of the most commonly used machine learning models in Scikit-learn, including their advantages and basic usage. Whether you're working on regression, classification, or boosting tasks, Scikit-learn provides a variety of tools to help you build, train, and evaluate your models efficiently.

For more information, check out the Scikit-learn documentation.

A Deep Dive into Supervised Learning Models in Scikit-learn

1. Linear Regression

2. Logistic Regression

3. Decision Tree Classifier

4. Bagging Regressor

5. Random Forest Regressor

6. Random Forest Classifier

7. Support Vector Classifier (SVC)

8. AdaBoost Classifier

9. K-Nearest Neighbors Regressor

Conclusion

Subscribe to my newsletter

Emeron Marcelle

Emeron Marcelle