
A general problem in applied machine learning is discovering whether input features are important to the outcome to be predicted.
This is the problem of feature selection.
In classification problems where input variables also are categorical, we will use statistical tests to work out whether the output variable depends or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be inappropriate to the problem and removed from the dataset.
Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.
Contingency Table
A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.
For example, the Sex=rows and Interest=columns table with contrived counts might look as follows:
Science, Math, Art
Male 20, 30, 15
Female 20, 15, 30
The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:
degrees of freedom: (rows — 1) * (cols — 1)
In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:
If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
If p-value > alpha: not significant result, fail to reject the null hypothesis (H0), independent.
The Pearson’s chi-squared test for independence can be calculated in Python using the chi2_contingency() SciPy function.
The function takes an array as input representing the contingency table for the two categorical variables. It returns the calculated statistic and p-value for interpretation as well as the calculated degrees of freedom and table of expected frequencies.
stat, p, dof, expected = chi2_contingency(table)
We can interpret the statistic by retrieving the critical value from the chi-squared distribution for the probability and number of degrees of freedom.
For example,
a probability of 95% can be used, suggesting that the finding of the test is quite likely given the assumption of the test that the variable is independent. If the statistic is less than or equal to the critical value, we can fail to reject this assumption, otherwise, it can be rejected.
# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
if abs(stat) >= critical:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
We can also interpret the p-value by comparing it to a chosen significance level, which would be 5%, calculated by inverting the 95% probability used in the critical value interpretation.
# interpret p-value
alpha = 1.0 - prob
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
We can tie all of this together and demonstrate the chi-squared significance test using a contrived contingency table.
A contingency table is defined below that has a different number of observations for each population (row), but a similar proportion across each group (column). Given the similar proportions, we would expect the test to find that the groups are similar and that the variables are independent (fail to reject the null hypothesis or H0).
table = [ [10, 20, 30],
[6, 9, 17]]
The complete example is listed below.
# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2
# contingency table
table = [ [10, 20, 30],
[6, 9, 17]]
print(table)
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof)
print(expected)# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
if abs(stat) >= critical:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')# interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
Running the example first prints the contingency table. The test is calculated and the degrees of freedom (dof) is reported as 2,
Next, the calculated expected frequency table is printed and we can see that indeed the observed contingency table does appear to match via an eyeball check of the numbers.
The critical value is calculated and interpreted, finding that indeed the variables are independent (fail to reject H0). The interpretation of the p-value makes the same finding.
[[10, 20, 30], [6, 9, 17]]
dof=2
[[10.43478261 18.91304348 30.65217391]
[ 5.56521739 10.08695652 16.34782609]]
probability=0.950, critical=5.991, stat=0.272
Independent (fail to reject H0)
significance=0.050, p=0.873
Independent (fail to reject H0)
Thankyou Regex software for providing winter intership.
Hope you liked this post make sure to clap on this post.