How to remove Multicollinearity in dataset using PCA

Address Multicollinearity using Principal Component Analysis

Multicollinearity refers to a condition in which the independent variables are correlated to each other. Multicollinearity can cause problems when you fit the model and interpret the results. The variables of the dataset should be independent of each other to overdue the problem of multicollinearity.

In this article, you can read why multicollinearity is a problem and how to remove Multicollinearity in the dataset using Principal Component Analysis (PCA).

Multicollinearity highly affects the variance associated with the problem, and can also affect the interpretation of the model, as it undermines the statistical significance of independent variables.

For a dataset, if some of the independent variables are highly independent of each other, it results in multicollinearity. A small change in any of the features can affect the model performance to a great extent. In other words, The coefficients of the model become very sensitive to small changes in the independent variables.

To handle or remove multicollinearity in the dataset, firstly we need to confirm if the dataset is multicollinear in nature. There are various techniques to find the presence of multicollinearity in the data, some of them are:

Getting very high standard errors for regression coefficients
The overall model is significant, but none of the coefficients are significant
Large changes in coefficients when adding predictors
High Variance Inflation Factor (VIF) and Low Tolerance

are some of the techniques or hacks to find multicollinearity in the data.

In this article, we will see how to find multicollinearity in data using Correlation Matrix and PCA, and remove it using PCA. The basic idea is to run a PCA on all predictors. Their ratio, the Condition Index, will be high if multicollinearity is present.

About the Data:

For further analysis, the dataset used is the Diamonds dataset download from Kaggle. This classic dataset contains the prices (target variable) and the other 9 independent variables of almost 54,000 diamonds.

Preprocessing of the dataset:

The dataset has 9 independent features and ‘price’ is the target class label. Before proceeding to statistical correlation analysis, we need to encode the categorical variables such as ‘cut’, ‘color’, and ‘clarity’.

(Image by Author), Left: Dataset before preprocessing, Right: Dataset after preprocessing

Correlation Analysis:

To find the person correlation coefficient between all the variables in the dataset:

data.corr(method='pearson')Method of correlation:
* pearson (default)
* kendall
* spearman

(Image by Author), Correlation heatmap of data

From the above correlation heatmap, we can observe that the independent variable: ‘x’, ‘y’, ‘z’, ‘carat’ are highly correlated (person coefficient> 0.9) with each other, hence conclude the presence of multicollinearity in the data.

We can also drop a few of the highly correlated features to remove multicollinearity in the data, but that may result in loss of information and is also a not feasible technique for data with high dimensionality. The idea is to reduce the dimensionality of the data using the PCA algorithm and hence remove the variables with low variance.

Principal Component Analysis (PCA) is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into lower space.

To extract features from the dataset using the PCA technique, firstly we need to find the percentage of variance explained as dimensionality decreases.

Notations,
λ: eigenvalue
d: number of dimension of original dataset
k: number of dimensions of new feature space

(Image by Author), Plot for % of cumulative variance explained vs the number of dimensions

(Image by Author)

From the above image, np.cumsum(pca.explained_variance_ratio_), the total variance of data captured by 1st PCA is 0.46, for 1st two PCA is 0.62, 1st 6 PCA is 0.986.

For the individual variance captured the variance of data captured by 1st PCA is 4.21, for 2nd PCA is 1.41, 3rd PCA is 1.22, and the last PCA is 0.0156.

Since 98.6% of the total variance is captured by the 1st 6 PCA itself, we take only 6 components of PCA and compute a correlation heatmap to overserve the multicollinearity.

(Image by Author), Correlation heatmap of data

From the above correlation heatmap, it can be now observed that none of the independent variables are now un-correlated.

we can observe that the independent variable: ‘x’, ‘y’, ‘z’, ‘carat’ are highly correlated (person coefficient> 0.9) with each other, hence conclude the presence of multicollinearity in the data.

Hence by reducing the dimensionality of the data using PCA, the variance is preserved by 98.6% and multicollinearity of the data is removed.

Click on the Google Colaboratory below to get the full code implementation.

There are various methods to remove multicollinearity from the dataset. In this article, we have discussed the PCA dimensionality reduction technique to remove multicollinearity from the dataset and preserving the maximum variance. There is one disadvantage of this technique, the interpretability of the features is lost.

[1] Multicollinearity in Regression Analysis: Problems, Detection, and Solutions: https://statisticsbyjim.com/

[2] Eight ways to detect multicollinearity: https://www.theanalysisfactor.com/

Thank You for Reading

Address Multicollinearity using Principal Component Analysis

About the Data:

Preprocessing of the dataset:

Correlation Analysis:

Footer