• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

How to avoid multicollinearity in Categorical Data?

December 27, 2020 by systems

Avoid multicollinearity using pd.get_dummies hack

Satyam Kumar
Photo by Eric Prouzet on Unsplash

Multicollinearity refers to a condition in which the independent variables are correlated to each other. Multicollinearity can cause problems when you fit the model and interpret the results. The variables of the dataset should be independent of each other to overdue the problem of multicollinearity.

To handle or remove multicollinearity in the dataset, firstly we need to confirm if the dataset is multicollinear in nature. There are various techniques to find the presence of multicollinearity in the data, some of them are:

  • Getting very high standard errors for regression coefficients
  • The overall model is significant, but none of the coefficients are significant
  • Large changes in coefficients when adding predictors
  • High Variance Inflation Factor (VIF) and Low Tolerance

are some of the techniques or hacks to find multicollinearity in the data.

To read more about how to remove multicollinearity in the dataset using Principal Component Analysis read my below-mentioned article:

In this article, we will see how to find multicollinearity in categorical features using the Correlation Matrix, and remove it.

About the Data:

For further analysis, the dataset used is Churn Modelling from Kaggle. The problem statement is a binary classification problem and has numerical and categorical columns.

For this article, we will only observe collinearity between categorical features: “Geography”, “Gender”.

(Image by Author)

Machine Learning models can train only the dataset with numerical features, in order to convert categorical features, pd.get_dummies is a powerful technique to convert categorical variables into numerical variables. It one-hot encodes the categorical variables.

pd.get_dummies one hot encodes the categorical features “Geography”, “Gender”.

Syntax:  
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
(Image by Author), Encoding using pd.get_dummies()

Correlation HeatMap:

To find the person correlation coefficient between all the numerical variables in the dataset:

data.corr(method='pearson')Method of correlation:
* pearson (default)
* kendall
* spearman
(Image by Author), Correlation Matrix with drop_first=False

From the above correlation matrix, it is clearly observed that the one-hot features encoded using pd.get_dummies are highly correlated with others.

(Image by Author), Correlation Matrix with drop_first=False for categorical features
Correlation coefficient scale:
+1: highly correlated in positive direction
-1: highly correlated in negative direction
0: No correlation

To avoid or remove multicollinearity in the dataset after one-hot encoding using pd.get_dummies, you can drop one of the categories and hence removing collinearity between the categorical features. Sklearn provides this feature by including drop_first=True in pd.get_dummies.

For example, if you have a variable gender, you don’t need both a male and female dummy. If male=1 then the person is a male and if male=0 then the person is female. For the presence of hundreds of categories, dropping the first column does not affect much.

(Image by Author), Encoding using pd.get_dummies() with drop_first=False

Correlation HeatMap for drop_first=True:

(Image by Author), Correlation Matrix with drop_first=True
(Image by Author), Correlation Matrix with drop_first=True for categorical features

One category from each categorical column is avoided by using drop_first=True, and it’s clearly observed from the correlation heatmap that the categorical features are no more correlated.

Filed Under: Artificial Intelligence

Primary Sidebar

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

The Future of Mobile Technology: Recent Advancements and Predictions

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy