The Effect of Preprocessing on Supervised Learning

Aaryan meenaAaryan meena
2 min read

We will fit the cancer dataset without scaling and then with different types of Scaling algorithms. First, let’s fit the SVC on the original data again for comparison:

from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=0)
svm = SVC(C=100)
svm.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))
Test set accuracy: 0.63

Now, let’s scale the data using MinMaxScaler before fitting the SVC:

# preprocessing using 0-1 scaling
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(
svm.score(X_test_scaled, y_test)))
💡
Scaled test set accuracy: 0.97

As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t involve any complicated math, it is good practice to use the scaling mechanisms provided by scikit-learn instead of reimplementing them yourself, as it’s easy to make mistakes even in these simple computations.

# preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("SVM test accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))
💡
SVM test accuracy: 0.96

Now that we’ve seen how simple data transformations for preprocessing work, let’s move on to more interesting transformations using unsupervised learning.

The article explores the impact of scaling on the performance of a Support Vector Classifier (SVC) using the cancer dataset. Initially, fitting the SVC without scaling results in a test accuracy of 0.63. When MinMaxScaler is applied, accuracy significantly improves to 0.97. Using StandardScaler, the accuracy is 0.96. The importance of using scikit-learn’s reliable scaling functions over manual computations is emphasized, paving the way for exploring more complex data transformations.

0
Subscribe to my newsletter

Read articles from Aaryan meena directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Aaryan meena
Aaryan meena