## Address Multicollinearity using Principal Component Analysis

Multicollinearity refers to a condition in which the **independent variables are correlated to each other**. Multicollinearity can cause problems when you fit the model and interpret the results. The variables of the dataset should be independent of each other to overdue the problem of multicollinearity.

In this article, you can read why multicollinearity is a problem and how to remove Multicollinearity in the dataset using Principal Component Analysis (PCA).

Multicollinearity highly affects the variance associated with the problem, and can also affect the interpretation of the model, as **it undermines the statistical significance of independent variables**.

For a dataset, if some of the independent variables are highly independent of each other, it results in multicollinearity. A small change in any of the features can affect the model performance to a great extent. In other words, The coefficients of the model become very sensitive to small changes in the independent variables.

To handle or remove multicollinearity in the dataset, firstly we need to confirm if the dataset is multicollinear in nature. There are various techniques to find the presence of multicollinearity in the data, some of them are:

**Getting very high standard errors for regression coefficients****The overall model is significant, but none of the coefficients are significant****Large changes in coefficients when adding predictors****High Variance Inflation Factor (VIF) and Low Tolerance**

are some of the techniques or hacks to find multicollinearity in the data.

In this article, we will see how to find multicollinearity in data using Correlation Matrix and PCA, and remove it using PCA. The basic idea is to run a PCA on all predictors. Their ratio, the Condition Index, will be high if multicollinearity is present.

## About the Data:

For further analysis, the dataset used is the Diamonds dataset download from Kaggle. This classic dataset contains the prices (target variable) and the other 9 independent variables of almost 54,000 diamonds.

## Preprocessing of the dataset:

The dataset has 9 independent features and **‘price’** is the target class label. Before proceeding to statistical correlation analysis, we need to encode the categorical variables such as **‘cut’, ‘color’**, and** ‘clarity’**.

## Correlation Analysis:

To find the person correlation coefficient between all the variables in the dataset:

data.corr(Method of correlation:method='pearson')

* pearson (default)

* kendall

* spearman

From the above correlation heatmap, we can observe that the independent variable: **‘x’, ‘y’, ‘z’, ‘carat’** are **highly correlated (person coefficient> 0.9) **with each other, hence conclude the presence of multicollinearity in the data.

We can also drop a few of the highly correlated features to remove multicollinearity in the data, but that may result in loss of information and is also a not feasible technique for data with high dimensionality. The idea is to reduce the dimensionality of the data using the PCA algorithm and hence remove the variables with low variance.

**Principal Component Analysis (PCA)** is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into lower space.

To extract features from the dataset using the PCA technique, firstly we need to find the percentage of variance explained as dimensionality decreases.