Standardization (Z-Scores) helps make data "apple to apple🍎"?
Table of contents
- Explanation
- Sample Output
- Interpretation
- Why test score between 2 subject is not apple to apple?
- 2. Different Difficulty Levels
- 3. Different Distributions
- 4. Different Learning Objectives
- 5. Different Student Populations
- Why Standardization Helps
- Example
- 10 other scenarios what might go wrong if we don't use Z-scores?
We'll use a school grading system across different subjects as our analogy.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Example data: test scores in different subjects
data = {
'math_score': [65, 70, 80, 90, 95],
'history_score': [75, 80, 85, 90, 95]
}
df = pd.DataFrame(data)
# Initialize the score standardizer
score_standardizer = StandardScaler()
# Fit and transform the scores
standardized_scores = score_standardizer.fit_transform(df)
# Convert standardized scores back to DataFrame
df_standardized = pd.DataFrame(standardized_scores, columns=['math_z_score', 'history_z_score'])
print("Original Scores:")
print(df)
print("\nStandardized Scores (Z-scores):")
print(df_standardized)
Explanation
In this school grading analogy:
We have test scores from two different subjects: math and history.
The
StandardScaler
acts like a "score standardizer" that converts the raw scores into standardized scores (Z-scores).After standardization, each score is expressed in terms of how many standard deviations it is from the mean score for that subject.
Sample Output
Original Scores:
math_score history_score
0 65 75
1 70 80
2 80 85
3 90 90
4 95 95
Standardized Scores (Z-scores):
math_z_score history_z_score
0 -1.336306 -1.414214
1 -0.890871 -0.707107
2 0.000000 0.000000
3 0.890871 0.707107
4 1.336306 1.414214
Interpretation
In the original data, math and history have different score distributions.
After standardization:
The mean score for each subject becomes 0.
Positive Z-scores indicate scores above the mean.
Negative Z-scores indicate scores below the mean.
The magnitude indicates how many standard deviations away from the mean the score is.
This standardization allows you to compare performance across different subjects on the same scale, even if the original scoring systems were different. For example:
A student with a math Z-score of 0.890871 and a history Z-score of 0.707107 performed above average in both subjects, but relatively better in math.
The lowest score in both subjects (65 in math, 75 in history) both have similar Z-scores around -1.4, indicating they are similarly below average in their respective subjects.
Test scores between two subjects are often not directly comparable due to several reasons:
Why test score between 2 subject is not apple to apple?
1. Different Scales and Units
Different Maximum Scores: Math tests might have a maximum score of 100, whereas history tests might have a maximum score of 50.
Different Grading Systems: One subject might use a percentage-based grading system, while another might use letter grades or a different numerical scale.
2. Different Difficulty Levels
Subject Complexity: The inherent difficulty of the subjects might differ. For example, a score of 80 in math might represent a higher level of achievement than a score of 80 in history, depending on the curriculum and complexity of the material.
Test Design: The design and difficulty of the tests themselves might vary, making direct comparisons unfair.
3. Different Distributions
Score Distribution: The distribution of scores can be different for each subject. For example, math scores might be normally distributed, while history scores might be skewed.
Variability in Scores: The variability (standard deviation) of scores can differ, indicating that one subject might have more spread out scores than another.
4. Different Learning Objectives
Skills Assessed: Different subjects assess different skills and knowledge areas. Math might test problem-solving and analytical skills, while history might test memorization and understanding of historical events.
Educational Goals: The goals of education in each subject might differ, making it inappropriate to compare scores directly.
5. Different Student Populations
Student Strengths: Students might have different strengths and weaknesses across subjects. A student might excel in math but struggle in history, or vice versa.
Classroom Environment: The teaching methods and classroom environments might differ, affecting student performance in each subject.
Why Standardization Helps
Standardization (converting scores to Z-scores) addresses these issues by:
Removing Units: Standardization removes the units of measurement, putting all scores on a common scale.
Centering and Scaling: It centers the scores around the mean and scales them by the standard deviation, allowing for direct comparison of relative performance.
Handling Different Distributions: It accounts for different score distributions, making comparisons fairer.
Example
Consider two students with the following scores:
Student A: 80 in math, 85 in history
Student B: 90 in math, 75 in history
Without standardization, it's hard to say who performed better overall because the scores are not directly comparable. After standardization, you might find:
Student A: Z-score of 0.5 in math, 0.7 in history
Student B: Z-score of 1.0 in math, -0.3 in history
Now, it's easier to see that Student B performed exceptionally well in math but below average in history, while Student A performed above average in both subjects.
10 other scenarios what might go wrong if we don't use Z-scores?
I understand now. You're asking for examples of what might go wrong if we don't use Z-scores (or standardization) in each of these 10 scenarios. Here are some examples:
Comparability across features:
Drama: A company uses height (in cm) and weight (in kg) to predict health risks. Without standardization, weight dominates the model, leading to incorrect risk assessments for tall, lean individuals.import pandas as pd from sklearn.preprocessing import StandardScaler # Example data: height (in cm) and weight (in kg) data = {'height_cm': [160, 170, 180, 190], 'weight_kg': [55, 70, 85, 100]} df = pd.DataFrame(data) # Initialize the scaler scaler = StandardScaler() # Fit and transform the data scaled_data = scaler.fit_transform(df) # Convert scaled data back to DataFrame df_scaled = pd.DataFrame(scaled_data, columns=['height_cm', 'weight_kg']) print(df_scaled)
Improved model performance:
Drama: A neural network struggles to learn from a dataset with salary (in thousands) and age (in years). The model focuses mainly on salary due to its larger scale, completely missing important age-related patterns.import pandas as pd from sklearn.preprocessing import StandardScaler # Example data: salary (in thousands) and age (in years) data = {'salary_k': [50, 60, 70, 80], 'age_years': [25, 35, 45, 55]} df = pd.DataFrame(data) # Initialize the scaler scaler = StandardScaler() # Fit and transform the data scaled_data = scaler.fit_transform(df) # Convert scaled data back to DataFrame df_scaled = pd.DataFrame(scaled_data, columns=['salary_k', 'age_years']) print(df_scaled)
Reducing bias: Drama: In a credit scoring model, income (in thousands) overshadows other important factors like credit history, unfairly disadvantaging lower-income applicants with excellent credit.
Normalization of distributions: Drama: A machine learning model for predicting house prices performs poorly because it assumes normal distributions. Without standardization, the skewed distribution of house prices leads to inaccurate predictions.
Faster convergence: Drama: A gradient descent algorithm takes weeks to converge on a large dataset with widely varying feature scales, causing significant delays in product development and missed deadlines.
Improved interpretability: Drama: Analysts misinterpret the importance of different factors in a linear regression model for customer churn, leading to misguided retention strategies and increased customer loss.
Handling outliers:
Drama: A few extremely high-value transactions in a fraud detection system cause the model to ignore smaller, more common fraudulent activities, resulting in significant financial losses.import pandas as pd from sklearn.preprocessing import StandardScaler # Example data: transaction amounts (in dollars) with outliers data = {'transaction_amount': [10, 20, 30, 1000]} df = pd.DataFrame(data) # Initialize the scaler scaler = StandardScaler() # Fit and transform the data scaled_data = scaler.fit_transform(df) # Convert scaled data back to DataFrame df_scaled = pd.DataFrame(scaled_data, columns=['transaction_amount']) print(df_scaled)
Consistency across datasets:
Drama: Researchers combining medical data from multiple hospitals reach incorrect conclusions about treatment effectiveness due to inconsistent scaling across datasets.
import pandas as pd from sklearn.preprocessing import StandardScaler # Example data: dataset1 and dataset2 with different scales data1 = {'feature1': [1, 2, 3, 4], 'feature2': [10, 20, 30, 40]} data2 = {'feature1': [5, 6, 7, 8], 'feature2': [50, 60, 70, 80]} df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) # Initialize the scaler scaler = StandardScaler() # Fit and transform the data scaled_data1 = scaler.fit_transform(df1) scaled_data2 = scaler.fit_transform(df2) # Convert scaled data back to DataFrame df_scaled1 = pd.DataFrame(scaled_data1, columns=['feature1', 'feature2']) df_scaled2 = pd.DataFrame(scaled_data2, columns=['feature1', 'feature2']) print(df_scaled1) print(df_scaled2)
Prerequisite for certain algorithms:
Drama: A data scientist attempts to use PCA for dimensionality reduction on non-standardized data, resulting in a completely distorted representation of the dataset and misleading insights.import pandas as pd from sklearn.preprocessing import StandardScaler # Example data: features for PCA data = {'feature1': [1, 2, 3, 4], 'feature2': [100, 200, 300, 400]} df = pd.DataFrame(data) # Initialize the scaler scaler = StandardScaler() # Fit and transform the data scaled_data = scaler.fit_transform(df) # Convert scaled data back to DataFrame df_scaled = pd.DataFrame(scaled_data, columns=['feature1', 'feature2']) print(df_scaled)
Improved numerical stability: Drama: A financial model dealing with both large monetary values and small percentage changes experiences numerical overflow, leading to critical errors in investment decisions and substantial financial losses.
These examples illustrate how failing to standardize data can lead to various problems in data analysis and machine learning applications.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by