Relationship association between categorical features
Chi-square test is a non-parametric test in hypothesis testing to know the association of two categorical features in bi-variate data or records. Non-parametric tests are distribution-free test because it is based on very less number of assumptions that’s why it is not normally distributed. When the target variable doesn’t show normal distribution can be seen target are in ordinal or in nominal and existence of outliers. The Chi-square test also stated that the variance of a sample is somehow equal to the population from which the sample was taken. That’s why called the hypothesis for population variance.
To test whether one categorical variable is associated or has an effect on another categorical value, we check the hypothesis on these two conditions shown below:
H0: Two categorical variables are independent of each other.
H1: Two categorical variables are not independent of each other.
H0 and H1 are the null hypotheses and alternate hypotheses, respectively.
After testing, if we get to know that we have to reject the null hypothesis, then we have to accept the alternate hypothesis that says both categorical data have some level of association. The test performs on p-values that determine if the p-value is less than 0.05, then both categorical values have a strong association, and if the p-value is more than 0.05, then they are independent.
The formula for Chi-Square is shown below:
The distribution of Chi-square called is Z square distribution, and the diagram of chi-square is shown below:
This test is only on categorical data such as gender ( male, female), color ( red, green, orange, etc.), and other binary categories.
Many learners are still don’t know many things in their learning path, and we are always trying to get knowledge in a simple, meaningful way. The below tree will give you a little hint on how to choose a test for bivariate data.
We will take an example of a preference between ice-cream and chocolate in adults and children. The two hypotheses are given below:
- Age and preference for ice-cream and chocolate are independent.
- Age and preference for ice-cream and chocolate are not independent.
Consider the table for the analysis, as shown below:
The next step to add the row and column to make a divide by total.
Now we have both observed value and expected value. We will calculate the chi-square value for each cell by applying the formula we saw above.
After adding all values, the overall chi-square value is now 4.102. Well, this chi-square value is similar to the z test. Now to get the critical chi-square value with a degree of freedom. The DOF is one less than the total number of total rows and columns.
In a row, we have two rows and two columns also. So the DOF will be as shown below:
DOF = (row-1)*(column-1) = (2–1)*(2–1) = 1
After knowing the degree of freedom, we can calculate the critical chi-square value with the help of the alpha value. The alpha value is the value that comes after choosing the confidence of interval.
The Alpha value can choose from this table, as shown in the photo.
When we see the chi-square table with a 5% alpha value, the critical values come at 3.84. We can observe these values through a chi-square distribution photo.
We can see that the chi-square value is greater than the critical value. So, we have to reject the null hypothesis. Also, if we see that if we choose the alpha value to 1%, the critical value comes 6.64. P-value is between 5% and 1% means if we have a significance region is 5%, then still we have to reject the null hypothesis. But, if the 1% alpha value, the chi-square value is less than critical, then we have to accept the null hypothesis.
Types of Chi-Square test
- Test for independence (Two-way chi-square test): Good for categorical values association.
- Test for the goodness of fit (One-way chi-square test): Good to check observed values differ from the theoretical value.
Conclusion:
The Chi-square test is very good when we have categories features in a data set.
Reach me on my LinkedIn. Mail me at amitprius@gmail.com.
- NLP — Zero to Hero with Python
2. Python Data Structures Data-types and Objects
3. MySQL: Zero to Hero