What is Multicollinearity, and when will it arises, Explain it with dataset example ?
Multicollinearity is a statistical concept found in regression analysis where two or more independent variables in a model are highly correlated. This correlation means that one variable can be linearly predicted from the others with a high degree of accuracy. When multicollinearity is present, it weakens the statistical power of the model, making it difficult to determine the effect of each independent variable on the dependent variable due to the shared variance among the correlated variables.
When Multicollinearity Arises
Multicollinearity usually occurs due to:
High Correlation between Variables: When variables used in the model are inherently related or measure similar attributes.
Data Collection Methods: Overlapping data collection methods or derived variables (e.g., calculating one variable from another).
Inadequate Data: Too few observations (data points) compared to the number of variables, or when the data is not sufficiently diverse.
Improper Model Specification: Including unnecessary or redundant variables in the model.
Example with Dataset
Let’s consider a dataset from the real estate market to explain multicollinearity:
House Price | Size (sq ft) | Number of Rooms | Number of Bedrooms | Age of Home (Years) |
$350,000 | 2,000 | 7 | 4 | 10 |
$450,000 | 2,500 | 9 | 5 | 5 |
$300,000 | 1,800 | 6 | 3 | 15 |
$500,000 | 3,000 | 10 | 6 | 3 |
In this dataset, you might encounter multicollinearity between several variables:
Size and Number of Rooms: Larger homes (more square footage) typically have more rooms.
Number of Rooms and Number of Bedrooms: More rooms in a house generally include more bedrooms.
Age of Home and Price (possibly): While not directly a cause of multicollinearity, older homes might have different pricing dynamics due to renovations or historical value, which might correlate indirectly with size or room count due to typical construction patterns when the house was built.
Impact and Detection
The presence of multicollinearity can be detected using:
Correlation Matrix: Visual inspection of correlation coefficients between variables. High correlation coefficients (near +1 or -1) indicate potential problems.
Variance Inflation Factor (VIF): A measure of how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF of 5 or more is usually considered concerning.
Addressing Multicollinearity
To handle multicollinearity, you might:
Remove highly correlated predictors: Decide which variable has more significance to your model and remove the other.
Principal Component Analysis (PCA): Transform the correlated variables into a set of linearly uncorrelated variables.
Regularization Techniques: Methods like Ridge or Lasso regression can help reduce the impact of multicollinearity by penalizing large coefficients.
Subscribe to my newsletter
Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by