Mathematical transformations are essential in feature engineering, a key step in the machine learning process. Before using data in a model, raw datasets often need preprocessing to improve their quality, clarity, and predictive power. Many real-world datasets face issues like skewed distributions, unequal variances (heteroscedasticity), or non-linearity, which can harm model performance.

This is where feature transformations such as Log, Reciprocal, Power, Box-Cox, and Yeo-Johnson are useful. These methods help normalize data, stabilize variance, reduce outliers, and improve linearity, making features more suitable for machine learning algorithms. By using these transformations, we can improve model convergence, enhance clarity, and increase predictive accuracy.

In this blog, I will provide a detailed overview of these transformations, including their formulas, advantages, limitations, and best use cases.

Why Do We Need These Mathematical Transformations in ML?

Normalization & Standardization
- Ensures all features contribute equally by bringing them to a common scale.
- Helps gradient-based optimization algorithms (e.g., gradient descent) converge faster.
Handling Skewness & Outliers
- Log, Box-Cox, and other transformations can normalize skewed data distributions.
- Reduces the impact of extreme values that might affect model stability.
Feature Engineering
- Transforms raw features into more meaningful representations.
- Polynomial and interaction transformations can introduce new patterns.
Improving Model Interpretability
- Some transformations make it easier to interpret the relationships between variables, improving explainability.

Types of Mathematical Transformations in ML

Function Transformations
- Apply mathematical functions to reshape the data distribution.
- Examples: Log, Exponential, Square Root, Sigmoid, and Tanh transformations.
- Use case: Helps with skewed data and improves feature representation.
Quantile-Based Transformations
- Transforms data based on its rank, ensuring a uniform or normal distribution.
- Examples: Quantile Transformer, Rank Transformation, Normal Score Transformation.
- Use case: Handles outliers and non-Gaussian distributions effectively.
Power Transformations
- Applies power functions to stabilize variance and make data more normal-like.
- Examples: Box-Cox and Yeo-Johnson transformations.
- Use case: Useful for data with heteroscedasticity (varying spread).

1. Function Transformations

Function transformations apply mathematical functions to each data point to alter its distribution. These transformations help reduce skewness, normalize data, and make patterns more evident.

Common Function Transformations:

a) Log Transformation

Formula: $x' = \log(x)$
Use case: Handles right skewed data and compresses large values.
Example: Best for Right-skewed data where large values need compression.
- Raw data: [1, 10, 100, 1000]
- Log-transformed: [0, 1, 2, 3]

b) Exponential Transformation

Formula: $x' = e^x$
Use case: Expands small differences in low-value data points, often used to reverse log transformations.
Example: Data that is heavily left-skewed and has many large values close together.
- Raw data: [-2, -1, 0, 1, 2]
- Exponential-transformed: [0.135, 0.368, 1.000, 2.718, 7.389]

c) Square Root Transformation

Formula: $x' = \sqrt{x}$
Use case: Similar to log transformation but preserves more of the original data structure. Used when data contains zeros.
Example: Best for Right-skewed data with large differences, reducing the gap between large and small values.
- Raw data: [1, 4, 9, 16, 25]
- Square-Root-transformed: [1.000, 2.000, 3.000, 4.000, 5.000]

d) Reciprocal Transformation

Formula: $x' = \frac{1}{x}$
Use case: Helps normalize large values but cannot handle zeros.
Example: Best for Heavy-tailed distributions, when we want to reduce large values significantly.
- Raw data: [1, 2, 5, 10, 100]
- Reciprocal-transformed: [1.000, 0.500, 0.200, 0.100, 0.010]

e) Sigmoid & Tanh Transformations

Sigmoid: Squeezes values between 0 and 1.

$$x' = \frac{1}{1+e^{-x}}$$

Tanh: Squeezes values between -1 and 1.

$$x' = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

Use case: Common in deep learning for feature scaling.

When to Use Function Transformations?

When data is highly skewed (e.g., income, population, prices).
When large variations exist between minimum and maximum values.
When features need to be bounded (e.g., probabilities).

2. Quantile-Based Transformations

Quantile-based transformations map data to a new distribution based on rank instead of absolute values. This ensures a uniform or normal distribution, regardless of the original data shape.

Common Quantile-Based Transformations:

a) Quantile Transformation

Maps feature values to a uniform or normal distribution based on percentiles.
Example:
- Raw data: [10, 50, 200, 5000]
- Transformed (Uniform): [0.1, 0.4, 0.7, 1.0]

b) Rank Transformation

Replaces each value with its rank among all values.
Example:
- Raw data: [100, 200, 150]
- Rank-transformed: [1, 3, 2]

c) Normal Score Transformation

Similar to rank transformation but maps values to a normal (Gaussian) distribution.

When to Use Quantile-Based Transformations?

When data contains many outliers or is heavily skewed.
When non-Gaussian distributions need to be converted into normal distributions.
When robust preprocessing is needed before regression models.

3. Power Transformation

Power transformations apply specific mathematical functions to stabilize variance, reduce skewness, and make data resemble a normal distribution.

Common Power Transformations:

a) Box-Cox Transformation

Box-Cox transformation is a more generalized power transformation that can handle a broader range of data distributions.

Formula:

$$x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log(x), & \text{if } \lambda = 0 \end{cases}$$

Purpose:

Box-Cox transformation optimizes the value of λ to make the data as normal as possible. It is particularly useful for heteroscedastic data.

Advantages:

Can handle a wide range of data distributions.
Automatically determines the optimal λ using maximum likelihood estimation.
Improves normality and stabilizes variance.

Disadvantages:

Requires strictly positive values.
Finding the optimal λ can be computationally intensive.

When to Use:

When normality assumption is critical.
When a simple log or power transformation does not work well.
When variance is highly unstable.

b) Yeo-Johnson Transformation

Unlike Box-Cox, the Yeo-Johnson transformation can handle both positive and negative values.

Formula:

$$x'(\lambda) = \begin{cases} \frac{(x + 1)^\lambda - 1}{\lambda}, & \text{if } x \geq 0, \lambda \neq 0 \\ \log(x + 1), & \text{if } x \geq 0, \lambda = 0 \\ \frac{-(|x| + 1)^{2 - \lambda} + 1}{2 - \lambda}, & \text{if } x < 0, \lambda \neq 2 \\ -\log(|x| + 1), & \text{if } x < 0, \lambda = 2 \end{cases}$$

Purpose:

Yeo-Johnson transformation is an extension of Box-Cox that works for both positive and negative values, making it suitable for datasets with mixed sign distributions.

Advantages:

Works for both positive and negative values.
Provides normality improvements like Box-Cox.
Optimizes $\lambda$ to best fit the data.

Disadvantages:

Requires computational optimization to find the best $\lambda $ .
May not be as interpretable as simpler transformations.

When to use Power Transformations?

When data is highly skewed and needs to be closer to a normal distribution.
When variance is not constant (heteroscedasticity) and needs stabilization.
When linear models perform poorly due to non-linearity in features.

Choosing the Right Transformation

Transformation Type	When to Use?	Handles Outliers?	Works with Negative Values?
Function Transformations	When data is moderately skewed and needs rescaling or bounding (e.g., probabilities, percentages)	No	No (except Tanh, Sigmoid)
Power Transformations	When data is highly skewed and variance needs stabilization for regression and ML models	Somewhat	Yes (Yeo-Johnson)
Quantile Transformations	When data has extreme skewness or outliers and needs to be uniformly or normally distributed	Yes	Yes

Conclusion

Mathematical transformations play a fundamental role in machine learning, particularly in preprocessing. Selecting the right transformation depends on the data distribution and the modeling objectives. I hope this guide is useful to you. If so, please like and follow. Your feedback and engagement are highly appreciated, so feel free to share your thoughts or ask questions in the comments section.

Understanding Mathematical Transforms in Machine Learning

Table of contents

Why Do We Need These Mathematical Transformations in ML?

Types of Mathematical Transformations in ML

1. Function Transformations

Common Function Transformations:

a) Log Transformation

b) Exponential Transformation

c) Square Root Transformation

d) Reciprocal Transformation

e) Sigmoid & Tanh Transformations

When to Use Function Transformations?

2. Quantile-Based Transformations

Common Quantile-Based Transformations:

a) Quantile Transformation

b) Rank Transformation

c) Normal Score Transformation

When to Use Quantile-Based Transformations?

3. Power Transformation

Common Power Transformations:

a) Box-Cox Transformation

Formula:

Purpose:

Advantages:

Disadvantages:

When to Use:

b) Yeo-Johnson Transformation

Formula:

Purpose:

Advantages:

Disadvantages:

When to use Power Transformations?

Choosing the Right Transformation

Conclusion

Subscribe to my newsletter

Arbash Hussain

Arbash Hussain