• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Categorical Feature Selection using Chi- Squared Test

December 18, 2020 by systems

akhil anand

what is correlation ?

Source

Correlation defines the mutual relationship between two or more features. Suppose you want to purchase a house , property dealer has shown some houses to you and you observed that the house price is increased with increase in size of house.Here Size of the house is strongly correlated with price.

Suppose you are a player and there is humongous recession came into white collar jobs . This recession won’t affect your earning because the recession of white collar job has nothing to do with your profession.In this case there is no any correlation between both the features.

Here in chi squared test we decide whether a feature is correlated with target variable or not using p-value.

H0 :- There is no relationship between categorical feature and target variable

H1 :- There is some relationship between categorical feature and target variable

If p-value ≥0.05 ,failed to reject null hypothesis there is no any relationship between target variable and categorical features.

if p_value <0.05 ,Rejects null hypothesis and there will be some relationship between target variable and categorical features and we will take all that features for further machine learning pipeline.let’s get started……

Note :- Chi squared test works only with discrete target variable if target variable is continuous then we should do binning first then will go for chi squared test.

Step 1 : Acquiring data set and importing all the essential library

#importing all the essential library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
--------------------------------------------------------------------
df=pd.read_csv(""https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/churn_data_st.csv",sep=",")
df.head()
Top five rows of dataset

Step 2 : Feature Encoding

a. Firstly we will extract all the features which has categorical variables.

df.dtypes
Figure 1

We will drop customerID because it will have null impact on target variable.

b. Extract all the features having categorical variable and then will do feature encoding on all of them.

cat_df["gender"]=cat_df["gender"].map({"Female":1,"Male":0})
cat_df["Contract"]=cat_df["Contract"].map({'Month-to-month':0, 'One year':1, 'Two year':2})
cat_df["PaperlessBilling"]=cat_df["PaperlessBilling"].map({"Yes":0,"No":1})
cat_df.head()
Encoded Output

Step 3 : Applying Chi Squared test

x=cat_df.iloc[:,:-1]  #Independent variable
y=cat_df.iloc[:,-1] #Target variable
f_score=chi2(x,y) #returns f score and p value
f_score
[out] >> (array([2.63667886e-01, 1.11578017e+03, 1.53480111e+02]),
array([6.07611392e-001, 1.22794132e-244, 3.00847449e-035]))
--------------------------------------------------------------------
# printing p values for each categorical features
p_value=pd.Series(f_score[1],index=x.columns)
p_value.sort_values(ascending=True,inplace=True)
[out] >>Contract 1.227941e-244
PaperlessBilling 3.008474e-35
gender 6.076114e-01
dtype: float64

Let’s understand the p_value with the help of visualization.

p_value.plot(kind="bar")
plt.xlabel("Features",fontsize=20)
plt.ylabel("p_values",fontsize=20)
plt.title("chi squared test base on p value")
plt.show()
Figure 2

If we see above plot we can conclude that gender feature has p_value approximately equals to 0.6 means p_value >0.05 hence gender does not have significance on target variable.

So we will only select Contract and PaperlessBilling for further machine learning modeling.

Note :- This technique is based on hypothesis it might be possible some data is important but chi-squared test is not showing that much significance in such scenario our domain knowledge play a pivotal role

Filed Under: Machine Learning

Primary Sidebar

Carmel WordPress Help

Carmel WordPress Help: Expert Support to Keep Your Website Running Smoothly

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy