Avoid multicollinearity using pd.get_dummies hack
Multicollinearity refers to a condition in which the independent variables are correlated to each other. Multicollinearity can cause problems when you fit the model and interpret the results. The variables of the dataset should be independent of each other to overdue the problem of multicollinearity.
To handle or remove multicollinearity in the dataset, firstly we need to confirm if the dataset is multicollinear in nature. There are various techniques to find the presence of multicollinearity in the data, some of them are:
- Getting very high standard errors for regression coefficients
- The overall model is significant, but none of the coefficients are significant
- Large changes in coefficients when adding predictors
- High Variance Inflation Factor (VIF) and Low Tolerance
are some of the techniques or hacks to find multicollinearity in the data.
To read more about how to remove multicollinearity in the dataset using Principal Component Analysis read my below-mentioned article:
In this article, we will see how to find multicollinearity in categorical features using the Correlation Matrix, and remove it.
About the Data:
For further analysis, the dataset used is Churn Modelling from Kaggle. The problem statement is a binary classification problem and has numerical and categorical columns.
For this article, we will only observe collinearity between categorical features: “Geography”, “Gender”.
Machine Learning models can train only the dataset with numerical features, in order to convert categorical features, pd.get_dummies is a powerful technique to convert categorical variables into numerical variables. It one-hot encodes the categorical variables.
pd.get_dummies one hot encodes the categorical features “Geography”, “Gender”.
Syntax:
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Correlation HeatMap:
To find the person correlation coefficient between all the numerical variables in the dataset:
data.corr(method='pearson')Method of correlation:
* pearson (default)
* kendall
* spearman
From the above correlation matrix, it is clearly observed that the one-hot features encoded using pd.get_dummies are highly correlated with others.
Correlation coefficient scale:
+1: highly correlated in positive direction
-1: highly correlated in negative direction
0: No correlation
To avoid or remove multicollinearity in the dataset after one-hot encoding using pd.get_dummies, you can drop one of the categories and hence removing collinearity between the categorical features. Sklearn provides this feature by including drop_first=True in pd.get_dummies.
For example, if you have a variable gender
, you don’t need both a male
and female
dummy. If male=1
then the person is a male and if male=0
then the person is female. For the presence of hundreds of categories, dropping the first column does not affect much.
Correlation HeatMap for drop_first=True:
One category from each categorical column is avoided by using drop_first=True, and it’s clearly observed from the correlation heatmap that the categorical features are no more correlated.