## An intuitive mathematical introduction

## Local mutual information to find biased terms in NLP datasets, and why it should be preferred over Pointwise mutual information

[In my last article, I tried scratching the surface of understanding some different reasons of biased datasets in NLP. Feel free to go and take a look, as this article builds upon it!]

As seen earlier, datasets tend to get biased when certain terms get associated with one particular label. The models that we train over such datasets start to capture this association, and behave poorly when the context of these terms are inverted. For example, a model that has seen the usage of term *kill* in ‘hateful’ tweets, will be *biased* to predict any new tweets containing this term as ‘hateful’ instead of ‘non-hateful’.

In this article, I will be taking a step further by isolating terms in a dataset that are most likely to introduce bias.

I will be going through two common derivatives of mutual information to quantify the correlation of terms with labels.