What is Multicollinearity, and when will it arises, Explain it with dataset example ?

Multicollinearity is a statistical concept found in regression analysis where two or more independent variables in a model are highly correlated. This correlation means that one variable can be linearly predicted from the others with a high degree of accuracy. When multicollinearity is present, it weakens the statistical power of the model, making it difficult to determine the effect of each independent variable on the dependent variable due to the shared variance among the correlated variables.

When Multicollinearity Arises

Multicollinearity usually occurs due to:

  • High Correlation between Variables: When variables used in the model are inherently related or measure similar attributes.

  • Data Collection Methods: Overlapping data collection methods or derived variables (e.g., calculating one variable from another).

  • Inadequate Data: Too few observations (data points) compared to the number of variables, or when the data is not sufficiently diverse.

  • Improper Model Specification: Including unnecessary or redundant variables in the model.

Example with Dataset

Let’s consider a dataset from the real estate market to explain multicollinearity:

House Price

Size (sq ft)

Number of Rooms

Number of Bedrooms

Age of Home (Years)

$350,000

2,000

7

4

10

$450,000

2,500

9

5

5

$300,000

1,800

6

3

15

$500,000

3,000

10

6

3

In this dataset, you might encounter multicollinearity between several variables:

  1. Size and Number of Rooms: Larger homes (more square footage) typically have more rooms.

  2. Number of Rooms and Number of Bedrooms: More rooms in a house generally include more bedrooms.

  3. Age of Home and Price (possibly): While not directly a cause of multicollinearity, older homes might have different pricing dynamics due to renovations or historical value, which might correlate indirectly with size or room count due to typical construction patterns when the house was built.

Impact and Detection

The presence of multicollinearity can be detected using:

  • Correlation Matrix: Visual inspection of correlation coefficients between variables. High correlation coefficients (near +1 or -1) indicate potential problems.

  • Variance Inflation Factor (VIF): A measure of how much the variance of an estimated regression coefficient increases if your predictors are correlated. A VIF of 5 or more is usually considered concerning.

Addressing Multicollinearity

To handle multicollinearity, you might:

  • Remove highly correlated predictors: Decide which variable has more significance to your model and remove the other.

  • Principal Component Analysis (PCA): Transform the correlated variables into a set of linearly uncorrelated variables.

  • Regularization Techniques: Methods like Ridge or Lasso regression can help reduce the impact of multicollinearity by penalizing large coefficients.

0
Subscribe to my newsletter

Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sai Prasanna Maharana
Sai Prasanna Maharana