what is correlation ?

Correlation defines the mutual relationship between two or more features. Suppose you want to purchase a house , property dealer has shown some houses to you and you observed that the house price is increased with increase in size of house.Here Size of the house is strongly correlated with price.

Suppose you are a player and there is humongous recession came into white collar jobs . This recession won’t affect your earning because the recession of white collar job has nothing to do with your profession.In this case there is no any correlation between both the features.

Here in chi squared test we decide whether a feature is correlated with target variable or not using p-value.

H0 :- There is no relationship between categorical feature and target variable

H1 :- There is some relationship between categorical feature and target variable

If **p-value ≥0.05** ,failed to reject null hypothesis there is no any relationship between target variable and categorical features.

if **p_value <0.05** ,Rejects null hypothesis and there will be some relationship between target variable and categorical features and we will take all that features for further machine learning pipeline.let’s get started……

Note :- Chi squared test works only with discrete target variable if target variable is continuous then we should do binning first then will go for chi squared test.

## Step 1 : Acquiring data set and importing all the essential library

`#importing all the essential library`

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.feature_selection import chi2

from sklearn.model_selection import train_test_split

--------------------------------------------------------------------

df=pd.read_csv(""https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/churn_data_st.csv",sep=",")

df.head()

## Step 2 : Feature Encoding

a. Firstly we will extract all the features which has categorical variables.

`df.dtypes`

We will drop customerID because it will have null impact on target variable.

b. Extract all the features having categorical variable and then will do feature encoding on all of them.

`cat_df["gender"]=cat_df["gender"].map({"Female":1,"Male":0})`

cat_df["Contract"]=cat_df["Contract"].map({'Month-to-month':0, 'One year':1, 'Two year':2})

cat_df["PaperlessBilling"]=cat_df["PaperlessBilling"].map({"Yes":0,"No":1})

cat_df.head()

## Step 3 : Applying Chi Squared test

x=cat_df.iloc[:,:-1] #Independent variable

y=cat_df.iloc[:,-1] #Target variable

f_score=chi2(x,y) #returns f score and p value

f_score[out] >> (array([2.63667886e-01, 1.11578017e+03, 1.53480111e+02]),

array([6.07611392e-001, 1.22794132e-244, 3.00847449e-035]))

--------------------------------------------------------------------

# printing p values for each categorical features

p_value=pd.Series(f_score[1],index=x.columns)

p_value.sort_values(ascending=True,inplace=True)[out] >>Contract 1.227941e-244

PaperlessBilling 3.008474e-35

gender 6.076114e-01

dtype: float64

Let’s understand the p_value with the help of visualization.

`p_value.plot(kind="bar")`

plt.xlabel("Features",fontsize=20)

plt.ylabel("p_values",fontsize=20)

plt.title("chi squared test base on p value")

plt.show()

If we see above plot we can conclude that **gender **feature has p_value approximately equals to 0.6 means **p_value >0.05 **hence gender does not have significance on target variable.

So we will only select **Contract **and **PaperlessBilling** for further machine learning modeling.

Note :- This technique is based on hypothesis it might be possible some data is important but chi-squared test is not showing that much significance in such scenario our domain knowledge play a pivotal role