what is correlation ?
Correlation defines the mutual relationship between two or more features. Suppose you want to purchase a house , property dealer has shown some houses to you and you observed that the house price is increased with increase in size of house.Here Size of the house is strongly correlated with price.
Suppose you are a player and there is humongous recession came into white collar jobs . This recession won’t affect your earning because the recession of white collar job has nothing to do with your profession.In this case there is no any correlation between both the features.
Here in chi squared test we decide whether a feature is correlated with target variable or not using p-value.
H0 :- There is no relationship between categorical feature and target variable
H1 :- There is some relationship between categorical feature and target variable
If p-value ≥0.05 ,failed to reject null hypothesis there is no any relationship between target variable and categorical features.
if p_value <0.05 ,Rejects null hypothesis and there will be some relationship between target variable and categorical features and we will take all that features for further machine learning pipeline.let’s get started……
Note :- Chi squared test works only with discrete target variable if target variable is continuous then we should do binning first then will go for chi squared test.
Step 1 : Acquiring data set and importing all the essential library
#importing all the essential library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
--------------------------------------------------------------------
df=pd.read_csv(""https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/churn_data_st.csv",sep=",")
df.head()
Step 2 : Feature Encoding
a. Firstly we will extract all the features which has categorical variables.
df.dtypes
We will drop customerID because it will have null impact on target variable.
b. Extract all the features having categorical variable and then will do feature encoding on all of them.
cat_df["gender"]=cat_df["gender"].map({"Female":1,"Male":0})
cat_df["Contract"]=cat_df["Contract"].map({'Month-to-month':0, 'One year':1, 'Two year':2})
cat_df["PaperlessBilling"]=cat_df["PaperlessBilling"].map({"Yes":0,"No":1})
cat_df.head()
Step 3 : Applying Chi Squared test
x=cat_df.iloc[:,:-1] #Independent variable
y=cat_df.iloc[:,-1] #Target variable
f_score=chi2(x,y) #returns f score and p value
f_score[out] >> (array([2.63667886e-01, 1.11578017e+03, 1.53480111e+02]),
array([6.07611392e-001, 1.22794132e-244, 3.00847449e-035]))
--------------------------------------------------------------------
# printing p values for each categorical features
p_value=pd.Series(f_score[1],index=x.columns)
p_value.sort_values(ascending=True,inplace=True)[out] >>Contract 1.227941e-244
PaperlessBilling 3.008474e-35
gender 6.076114e-01
dtype: float64
Let’s understand the p_value with the help of visualization.
p_value.plot(kind="bar")
plt.xlabel("Features",fontsize=20)
plt.ylabel("p_values",fontsize=20)
plt.title("chi squared test base on p value")
plt.show()
If we see above plot we can conclude that gender feature has p_value approximately equals to 0.6 means p_value >0.05 hence gender does not have significance on target variable.
So we will only select Contract and PaperlessBilling for further machine learning modeling.
Note :- This technique is based on hypothesis it might be possible some data is important but chi-squared test is not showing that much significance in such scenario our domain knowledge play a pivotal role