Understanding Mathematical Transforms in Machine Learning

Arbash HussainArbash Hussain
6 min read

Mathematical transformations are essential in feature engineering, a key step in the machine learning process. Before using data in a model, raw datasets often need preprocessing to improve their quality, clarity, and predictive power. Many real-world datasets face issues like skewed distributions, unequal variances (heteroscedasticity), or non-linearity, which can harm model performance.

This is where feature transformations such as Log, Reciprocal, Power, Box-Cox, and Yeo-Johnson are useful. These methods help normalize data, stabilize variance, reduce outliers, and improve linearity, making features more suitable for machine learning algorithms. By using these transformations, we can improve model convergence, enhance clarity, and increase predictive accuracy.

In this blog, I will provide a detailed overview of these transformations, including their formulas, advantages, limitations, and best use cases.


Why Do We Need These Mathematical Transformations in ML?

  1. Normalization & Standardization

    • Ensures all features contribute equally by bringing them to a common scale.

    • Helps gradient-based optimization algorithms (e.g., gradient descent) converge faster.

  2. Handling Skewness & Outliers

    • Log, Box-Cox, and other transformations can normalize skewed data distributions.

    • Reduces the impact of extreme values that might affect model stability.

  3. Feature Engineering

    • Transforms raw features into more meaningful representations.

    • Polynomial and interaction transformations can introduce new patterns.

  4. Improving Model Interpretability

    • Some transformations make it easier to interpret the relationships between variables, improving explainability.

Types of Mathematical Transformations in ML

  1. Function Transformations

    • Apply mathematical functions to reshape the data distribution.

    • Examples: Log, Exponential, Square Root, Sigmoid, and Tanh transformations.

    • Use case: Helps with skewed data and improves feature representation.

  2. Quantile-Based Transformations

    • Transforms data based on its rank, ensuring a uniform or normal distribution.

    • Examples: Quantile Transformer, Rank Transformation, Normal Score Transformation.

    • Use case: Handles outliers and non-Gaussian distributions effectively.

  3. Power Transformations

    • Applies power functions to stabilize variance and make data more normal-like.

    • Examples: Box-Cox and Yeo-Johnson transformations.

    • Use case: Useful for data with heteroscedasticity (varying spread).


1. Function Transformations

Function transformations apply mathematical functions to each data point to alter its distribution. These transformations help reduce skewness, normalize data, and make patterns more evident.

Common Function Transformations:

a) Log Transformation

  • Formula: \(x' = \log(x)\)

  • Use case: Handles right skewed data and compresses large values.

  • Example: Best for Right-skewed data where large values need compression.

    • Raw data: [1, 10, 100, 1000]

    • Log-transformed: [0, 1, 2, 3]

b) Exponential Transformation

  • Formula: \(x' = e^x\)

  • Use case: Expands small differences in low-value data points, often used to reverse log transformations.

  • Example: Data that is heavily left-skewed and has many large values close together.

    • Raw data: [-2, -1, 0, 1, 2]

    • Exponential-transformed: [0.135, 0.368, 1.000, 2.718, 7.389]

c) Square Root Transformation

  • Formula: \(x' = \sqrt{x}\)

  • Use case: Similar to log transformation but preserves more of the original data structure. Used when data contains zeros.

  • Example: Best for Right-skewed data with large differences, reducing the gap between large and small values.

    • Raw data: [1, 4, 9, 16, 25]

    • Square-Root-transformed: [1.000, 2.000, 3.000, 4.000, 5.000]

d) Reciprocal Transformation

  • Formula: \(x' = \frac{1}{x}\)

  • Use case: Helps normalize large values but cannot handle zeros.

  • Example: Best for Heavy-tailed distributions, when we want to reduce large values significantly.

    • Raw data: [1, 2, 5, 10, 100]

    • Reciprocal-transformed: [1.000, 0.500, 0.200, 0.100, 0.010]

e) Sigmoid & Tanh Transformations

  • Sigmoid: Squeezes values between 0 and 1.

$$x' = \frac{1}{1+e^{-x}}$$

  • Tanh: Squeezes values between -1 and 1.

$$x' = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

  • Use case: Common in deep learning for feature scaling.

When to Use Function Transformations?

  • When data is highly skewed (e.g., income, population, prices).

  • When large variations exist between minimum and maximum values.

  • When features need to be bounded (e.g., probabilities).


2. Quantile-Based Transformations

Quantile-based transformations map data to a new distribution based on rank instead of absolute values. This ensures a uniform or normal distribution, regardless of the original data shape.

Common Quantile-Based Transformations:

a) Quantile Transformation

  • Maps feature values to a uniform or normal distribution based on percentiles.

  • Example:

    • Raw data: [10, 50, 200, 5000]

    • Transformed (Uniform): [0.1, 0.4, 0.7, 1.0]

b) Rank Transformation

  • Replaces each value with its rank among all values.

  • Example:

    • Raw data: [100, 200, 150]

    • Rank-transformed: [1, 3, 2]

c) Normal Score Transformation

  • Similar to rank transformation but maps values to a normal (Gaussian) distribution.

When to Use Quantile-Based Transformations?

  • When data contains many outliers or is heavily skewed.

  • When non-Gaussian distributions need to be converted into normal distributions.

  • When robust preprocessing is needed before regression models.


3. Power Transformation

Power transformations apply specific mathematical functions to stabilize variance, reduce skewness, and make data resemble a normal distribution.

Common Power Transformations:

a) Box-Cox Transformation

Box-Cox transformation is a more generalized power transformation that can handle a broader range of data distributions.

Formula:

$$x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log(x), & \text{if } \lambda = 0 \end{cases}$$

Purpose:

Box-Cox transformation optimizes the value of λ to make the data as normal as possible. It is particularly useful for heteroscedastic data.

Advantages:

  • Can handle a wide range of data distributions.

  • Automatically determines the optimal λ using maximum likelihood estimation.

  • Improves normality and stabilizes variance.

Disadvantages:

  • Requires strictly positive values.

  • Finding the optimal λ can be computationally intensive.

When to Use:

  • When normality assumption is critical.

  • When a simple log or power transformation does not work well.

  • When variance is highly unstable.

b) Yeo-Johnson Transformation

Unlike Box-Cox, the Yeo-Johnson transformation can handle both positive and negative values.

Formula:

$$x'(\lambda) = \begin{cases} \frac{(x + 1)^\lambda - 1}{\lambda}, & \text{if } x \geq 0, \lambda \neq 0 \\ \log(x + 1), & \text{if } x \geq 0, \lambda = 0 \\ \frac{-(|x| + 1)^{2 - \lambda} + 1}{2 - \lambda}, & \text{if } x < 0, \lambda \neq 2 \\ -\log(|x| + 1), & \text{if } x < 0, \lambda = 2 \end{cases}$$

Purpose:

Yeo-Johnson transformation is an extension of Box-Cox that works for both positive and negative values, making it suitable for datasets with mixed sign distributions.

Advantages:

  • Works for both positive and negative values.

  • Provides normality improvements like Box-Cox.

  • Optimizes \(\lambda\) to best fit the data.

Disadvantages:

  • Requires computational optimization to find the best \(\lambda \) .

  • May not be as interpretable as simpler transformations.

When to use Power Transformations?

  • When data is highly skewed and needs to be closer to a normal distribution.

  • When variance is not constant (heteroscedasticity) and needs stabilization.

  • When linear models perform poorly due to non-linearity in features.


Choosing the Right Transformation

Transformation TypeWhen to Use?Handles Outliers?Works with Negative Values?
Function TransformationsWhen data is moderately skewed and needs rescaling or bounding (e.g., probabilities, percentages)NoNo (except Tanh, Sigmoid)
Power TransformationsWhen data is highly skewed and variance needs stabilization for regression and ML modelsSomewhatYes (Yeo-Johnson)
Quantile TransformationsWhen data has extreme skewness or outliers and needs to be uniformly or normally distributedYesYes

Conclusion

Mathematical transformations play a fundamental role in machine learning, particularly in preprocessing. Selecting the right transformation depends on the data distribution and the modeling objectives. I hope this guide is useful to you. If so, please like and follow. Your feedback and engagement are highly appreciated, so feel free to share your thoughts or ask questions in the comments section.


0
Subscribe to my newsletter

Read articles from Arbash Hussain directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Arbash Hussain
Arbash Hussain

I'm a Computer Science Engineer with a passion for data science and AI. My interest for computer science has motivated me to work with various tech stacks like Flutter, Next.js, React.js, Pygame and Unity. For data science projects, I've used tools like MLflow, AWS, Tableau, SQL, and MongoDB, and I've worked with Flask and Django to build data-driven applications. I'm always eager to learn and stay updated with the latest in the field. I'm looking forward to connecting with like-minded professionals and finding opportunities to make an impact through data and AI.