What is Correlation, When will it arise, and how to handle it, Explain it with a dataset. Also, what is a Correlation matrix?
Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It’s a common tool for determining how closely two quantities move in relation to one another. The correlation coefficient ranges from -1 to +1, where:
+1 indicates a perfect positive linear relationship,
-1 indicates a perfect negative linear relationship, and
0 indicates no linear relationship.
When Correlation Arises
Correlation arises in various scenarios, especially when variables have some form of inherent connection or dependency, such as:
Physical properties related to dimensions (like height and weight, where taller individuals often weigh more).
Economic indicators (like GDP and consumer spending).
Behavioural data (like study time and exam scores, where more study time can correlate with higher scores).
Handling Correlation
In data analysis and modelling, high correlation among variables can lead to problems like multicollinearity in regression analysis. To handle correlation effectively:
Data Exploration: Use scatter plots, correlation coefficients, or advanced statistical tests to understand the nature and strength of relationships.
Variable Selection: Avoid using highly correlated predictors together in the same model.
Dimensionality Reduction Techniques: Use techniques like Principal Component Analysis (PCA) to reduce the number of variables in a dataset by transforming them
into a new set of variables that are uncorrelated and retain most of the information in the original dataset.
- Regularization Techniques: Use methods like Ridge or Lasso regression, which can handle multicollinearity by adding a penalty to the regression model.
Example with Dataset
Suppose we have a dataset from a fitness study tracking physical metrics and activity levels:
Participant | Age | Weight (kg) | Height (cm) | Hours of Exercise per Week |
1 | 25 | 70 | 175 | 4 |
2 | 30 | 82 | 180 | 3 |
3 | 22 | 68 | 170 | 5 |
4 | 28 | 75 | 165 | 2 |
In this dataset:
Weight and Height are likely to have a positive correlation.
Age and Hours of Exercise might not have a strong correlation.
Correlation Matrix
A correlation matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlation coefficients between the variables.
It helps to visualize and identify any correlations between variables at a glance.
Example of a Correlation Matrix for the above dataset:
| Age | Weight | Height | Hours of Exercise |
Age | 1 | -0.10 | -0.20 | -0.50 |
Weight | -0.10 | 1 | 0.85 | 0.15 |
Height | -0.20 | 0.85 | 1 | 0.10 |
Hours of Exercise | -0.50 | 0.15 | 0.10 | 1 |
Here, you can see that:
Weight and Height have a high correlation (0.85), suggesting a strong positive relationship.
Age and Hours of Exercise have a moderate negative correlation (-0.50), suggesting that older participants tend to exercise less.
This matrix is pivotal in deciding which variables might be redundant or require adjustments before further analysis, such as in regression modelling.
Subscribe to my newsletter
Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by