Principal Component Analysis

PCA:
PCA (Principal Component Analysis) is used for feature extraction, especially when feature selection isn't helpful — for example, when the features are not strongly correlated with the target or when we can't remove any features without losing important information.
Instead of removing features, PCA creates new features (called principal components) that capture the most important patterns in the data.
Instead of selecting and removing features one by one, PCA creates new features (called principal components) that are combinations of the original ones. These components are chosen based on how much variance they capture — meaning how much the data spreads in a particular direction. The first few components usually capture most of the meaningful patterns in the data. By projecting the data onto these new axes, PCA allows us to work with fewer dimensions, which makes models faster, reduces noise, and helps with visualization — all while keeping the core structure of the data intact.
HOW IT DOES:
Makes new features (principal components)
These are linear combinations of the original features (e.g., PC1 = 0.5×feature1 + 0.7×feature2 - 0.1×feature3...)2. Orders them by importance (how much variance they capture)
PC1 captures the most variance, PC2 the next most, and so on.3. Keeps only a subset (top k components)
You choose the topk
components that explain, say, 95% of the total variance — and drop the rest.
But why variance
PCA uses variance because it measures how much the data spreads out — more spread means more information or patterns.
The mean only tells where the data is centered, not how it varies.
PCA cares about differences, not just averages — so variance is the right tool.
and, PCA chooses the direction with maximum variance to capture as much information as possible in a single axis.
Math Intuition
- Features may have different scales (e.g., height in cm, income in lakhs), and PCA is sensitive to scale. So, convert data to zero mean and unit variance (z-score):
$$x_{\text{scaled}} = \frac{x - \text{mean}(x)}{\text{std}(x)}$$
✅ Step 1: PCA wants to reduce dimensions
👉 The goal is to keep the data simple (fewer features),
but still retain as much information as possible.✅ Step 2: It can’t use the target variable
👉 PCA is unsupervised, so it doesn’t know the labels.
It can’t rely on accuracy or loss — so it looks inside the data only✅ Step 3: Information is hidden in variation
👉 If a feature or direction has more variance,
it means values are more spread out, not constant.
This spread means there's more information or pattern to capture.✅ Step 4: Low variance = noise or redundancy
👉 A direction where data barely changes (low variance)
doesn’t add much knowledge. It's often just noise or repetition.✅ *Step 5: PCA finds the direction with maximum variance
👉 PCA looks at all possible directions through the data.
It picks the one where the data spreads the most — that’s the first principal component.✅ Step 6: It projects data onto that direction
👉 Once that best direction is found,
each data point is projected onto it — turning many features into one.✅ Step 7: Repeat for next best directions
👉 Then it finds the second-best direction (second most variance),
which must be orthogonal (90°) to the first.
And so on…
Subscribe to my newsletter
Read articles from priyanshu tiwari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
