How to avoid multicollinearity in Categorical Data?

Avoid multicollinearity using pd.get_dummies hack

Multicollinearity refers to a condition in which the independent variables are correlated to each other. Multicollinearity can cause problems when you fit the model and interpret the results. The variables of the dataset should be independent of each other to overdue the problem of multicollinearity.

To handle or remove multicollinearity in the dataset, firstly we need to confirm if the dataset is multicollinear in nature. There are various techniques to find the presence of multicollinearity in the data, some of them are:

Getting very high standard errors for regression coefficients
The overall model is significant, but none of the coefficients are significant
Large changes in coefficients when adding predictors
High Variance Inflation Factor (VIF) and Low Tolerance

are some of the techniques or hacks to find multicollinearity in the data.

To read more about how to remove multicollinearity in the dataset using Principal Component Analysis read my below-mentioned article:

In this article, we will see how to find multicollinearity in categorical features using the Correlation Matrix, and remove it.

About the Data:

For further analysis, the dataset used is Churn Modelling from Kaggle. The problem statement is a binary classification problem and has numerical and categorical columns.

For this article, we will only observe collinearity between categorical features: “Geography”, “Gender”.

(Image by Author)

Machine Learning models can train only the dataset with numerical features, in order to convert categorical features, pd.get_dummies is a powerful technique to convert categorical variables into numerical variables. It one-hot encodes the categorical variables.

pd.get_dummies one hot encodes the categorical features “Geography”, “Gender”.

Syntax:  
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

(Image by Author), Encoding using pd.get_dummies()

Correlation HeatMap:

To find the person correlation coefficient between all the numerical variables in the dataset:

data.corr(method='pearson')Method of correlation:
* pearson (default)
* kendall
* spearman

(Image by Author), Correlation Matrix with drop_first=False

From the above correlation matrix, it is clearly observed that the one-hot features encoded using pd.get_dummies are highly correlated with others.

(Image by Author), Correlation Matrix with drop_first=False for categorical features

Correlation coefficient scale:
+1: highly correlated in positive direction
-1: highly correlated in negative direction
 0: No correlation

To avoid or remove multicollinearity in the dataset after one-hot encoding using pd.get_dummies, you can drop one of the categories and hence removing collinearity between the categorical features. Sklearn provides this feature by including drop_first=True in pd.get_dummies.

For example, if you have a variable gender, you don’t need both a male and female dummy. If male=1 then the person is a male and if male=0 then the person is female. For the presence of hundreds of categories, dropping the first column does not affect much.

(Image by Author), Encoding using pd.get_dummies() with drop_first=False

Correlation HeatMap for drop_first=True:

(Image by Author), Correlation Matrix with drop_first=True

(Image by Author), Correlation Matrix with drop_first=True for categorical features

One category from each categorical column is avoided by using drop_first=True, and it’s clearly observed from the correlation heatmap that the categorical features are no more correlated.

Avoid multicollinearity using pd.get_dummies hack

About the Data:

Correlation HeatMap:

Correlation HeatMap for drop_first=True:

Footer