Machine Learning Chapter 9: Dimensionality Reduction


Welcome to Part 9 - Dimensionality Reduction!
In chapter 3 - Classification, we used datasets with two independent variables to better visualize Machine Learning models and because we can often reduce variables to two using Dimensionality Reduction.
There are two types of Dimensionality Reduction:
Feature Selection
Feature Extraction
[Feature Selection includes techniques like Backward Elimination and Forward Selection, covered in chapter 2 - Regression]
In this part, we'll explore Feature Extraction techniques:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Kernel PCA
Image Source
Principal Component Analysis (PCA)
I’ll give a brief on PCA:
- One of the most popular dimensionality reduction algorithm
It is considered to be one of the most used unsupervised data
Used for: visualization, feature extraction, noise filtering, stock market predictions and gene analysis.
Goal of PCA: detect correlation between variables. If strong correlation is found, you can reduce dimensionality.
Again, with PCA, the goal is reduce dimension of a D-dimension by setting the data or, projecting the data onto a K-dimension subspace where K < D
These are the main functions of the PCA algorithm.
Now, click here.
On the left we have a 2D image, on the right we have an 1D image (You can make changes on the left one which will impact the right one).
Here, a 3D image has been turned into a 2D image.
EXTRA: I am providing you with two YouTube videos. It is more effective to visualize.
PCA in python
Get your code and dataset from here.
About the dataset:
Each row represents a wine, and for each wine, we have different details—various features and characteristics. These include alcohol level, malic acid, ash alkalinity, magnesium, total phenols, and flavonoids. As you can see, there are many features for each wine. Now, I'll explain the dependent variable for these wines. We have the customer segment, which is the last column showing which segment the wines belong to.
Let me explain what happens in business terms. This dataset is from the UCI ML repository, so all credit goes to this great platform for datasets. However, I changed the last column, customer segment, to make it more relevant for business. This turns the case study into a business-focused one.
Imagine this dataset is for a wine merchant who has many different wines to sell and a large group of customers. The wine shop owner hired you as a data scientist to do some initial clustering work. At first, we had all these features without the last column, customer segment. We have features from alcohol to proline, and the wine shop owner asked you to perform clustering to find different customer segments based on their wine preferences. There are three customer segments, each representing a group of customers with similar wine tastes.
That was the first task (You can try it yourself if you want, but now we'll focus on reducing the number of features) The wine shop owner liked how you identified the three segments. Now, the owner wants to make the dataset simpler by using fewer features and also create a predictive model. This model will be trained on the current data, including the features and the customer segment, so that for each new wine in the shop, we can use this model on the simpler dataset to predict which customer segment the new wine fits into.
Once we know which customer segment a wine belongs to, we can suggest it to the right customers. This works like a recommendation system. For each new wine in the shop, our model will find the best customer segment, making sure it is a good match.
Implementation and visualization:
We don't need to implement everything. That would waste time. Instead, we want to focus on reducing dimensions. Here's what we'll do: I'll show you the implementation, but the only part we'll redo is applying PCA. But what I’ll do is: I’ll show you the before and after condition of the data set.
Cope paste the code from Importing libraries - feature scaling in spyder IDE (you will get this on Anaconda)
Now, if we copy paste this code in spyder IDE, we can see X_train
looks like this (with lots of features/column)
X_test
looks like this:
Our goal is to reduce the column/features. Let's start the PCA.
Applying PCA:
[I chose the logistic regression model for our classification toolkit, but any model will work. You'll see it gives great results, though you can choose another if you prefer. Tool from the scikit learn API tool]
Using the pca
object, we will just transform but not fit because our goal is to check with this data later.
Now, for the fun part, if we run the cell in spyder IDE, X_train
looks like this:
We have only 2 features now due to the PCA. The features has been reduced and dimensions has been reduced as well. Same goes for X_test
So this is the before vs after condition of the data set. Now, you can run all
your code.
We achieved an impressive accuracy of 97%.
The wine shop owner had a great intuition with the idea of using dimensionality reduction. Dimensionality reduction not only simplifies your dataset but can also enhance the final results by combining it with your predictive model. That's exactly what happened here. We reached an outstanding accuracy of 97% with only one incorrect prediction.
By the way, this is interesting. This is the first time you're seeing a confusion matrix with three rows and three columns. That's because we have three classes, right? We have three customer segments: one, two, and three. So, we have three classes to predict.
Now, we should see amazing results on the graphs. PC one is on the X-axis, and PC two is on the Y-axis. You'll notice the different prediction regions. The green dots, blue dots, and red dots represent the real observations.
Here is the test set.
Even with new data, our logistic regression model combined with dimensionality reduction did a great job of separating the three classes.
We can clearly see the one mistake in the confusion matrix. It shows a green wine, which actually belongs to customer segment two, but the model predicted it as segment one. But that's okay. Any business owner or data scientist would be thrilled with just one mistake.
Let's see if we can improve this with other dimensionality reduction techniques like linear discriminant analysis. To do better, we need to reach 100% accuracy. We'll check if the features from LDA can create a boundary that separates all three classes well.
This will be challenging, but it's possible. Then, we'll see if we can do the same with kernel PCA, our final dimensionality reduction technique.
Kernel PCA
Kernel PCA is another form of dimensionality reduction. Again, we will work on the Applying Kernel PCA because we are familiar with rest of the codes, right?
The implementation is over. That’s all was there to change.
As expected, we’ve got a 100% accuracy! I’ve rarely seen non-kernel version of the model beats the kernel version, it can happen but very rare!
Here the clustering for training data
and for test data.
Can you see the difference? The points are more close to each another and that’s what gave us the high accuracy.
Get your code and dataset from here.
Linear Discriminant Analysis(LDA)
LDA is a commonly used dimensionality reduction technique. It’s used in pre processing step for pattern classification and ML algorithms. The goal of LDA is to project a feature space onto a smaller subspace while keeping the class discriminatory information intact.
We have both PCA and LDA as linear transformation techniques used for dimensionality reduction. So, where the difference lies from PCA? LDA is actually a supervised data because of the relation to dependent variable, while PCA is an unsupervised data.
In PCA, we focus on the subspace and use dimensionality reduction to see how the principal component axes relate to each other.
In LDA, we aim to separate classes, and I think this visualization clearly shows the difference between the two.
PCA looks at the whole dataset together, while LDA tries to show the differences between classes in the data.
Additional: When would you use PCA rather than LDA in classification?
Steps:
Applying LDA
We need to apply the fit transform method to the training set and then only the transform method to the test set. This is for the same reason as before: to avoid information leakage from the test set.
With PCA, the fit transform method only took x-train as input because it only needs the features to apply this dimensionality reduction technique. But LDA is different. To use this technique, it needs not only the features but also the dependent variable. The dependent variable is necessary for the LDA equation.
We achieved 100% accuracy! In other words, our logistic regression model was able to perfectly classify all three classes by separating them completely.
We had some anomalies in PCA but in this LDA, we don't have that. Awesome!
So, we reduced the dimensions and are more accurate than PCA.
Find your code and dataset from here.
Subscribe to my newsletter
Read articles from Fatima Jannet directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
