Importance of #StratifiedKfold in #machinelearningmodels

When we want to train our ML model we split our entire dataset into training_set and test_set using train_test_split() class present in sklearn. The problem we face is with using different random numbers because of which we get different accuracies and hence we canโt exactly point out the accuracy for our model.

The train_test_split() splits the dataset into training_test and test_set by #randomsampling.

?ๅฝก ๐๐ก๐๐ญ ๐ข๐ฌ ๐ซ๐๐ง๐๐จ๐ฆ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ ๐๐ง๐ ๐๐ญ๐ซ๐๐ญ๐ข๐๐ข๐๐ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ ?ๅฝก

Suppose you want to take a survey and decided to call 1000 people from a particular state, If you pick either 1000 male completely or 1000 female completely or 900 female and 100 male (randomly) to ask their opinion on a particular product.Then based on these 1000 opinion you canโt decide the opinion of that entire state on your product.This is random sampling.

But in Stratified Sampling, Let the population for that state be 51.3% male and 48.7% female, Then for choosing 1000 people from that state if you pick 531 male ( 51.3% of 1000 ) and 487 female ( 48.7% for 1000 ) i.e 531 male + 487 female (Total=1000 people) to ask their opinion. Then these groups of people represent the entire state. This is called as Stratified Sampling.

๐๐ก๐ฒ ๐ซ๐๐ง๐๐จ๐ฆ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ ๐ข๐ฌ ๐ง๐จ๐ญ ๐ฉ๐ซ๐๐๐๐ซ๐๐ ๐ข๐ง ๐ฆ๐๐๐ก๐ข๐ง๐ ๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ ?

Letโs consider a binary-class classification problem. Let our dataset consists of 100 samples out of which 80 are negative class { 0 } and 20 are positive class { 1 }

โ โ โ โ โ Random sampling: โ โ โ โ โ

If we do random sampling to split the dataset into training_set and test_set in 8:2 ratio respectively.Then we might get all negative class {0} in training_set i.e 80 samples in training_test and all 20 positive class {1} in test_set.Now if we train our model on training_set and test our model on test_set, Then obviously we will get a bad accuracy score.

โ โ โ โ โ Stratified Sampling: โ โ โ โ โ

In stratified sampling, The training_set consists of 64 negative class{0} ( 80% 0f 80 ) and 16 positive class {1} ( 80% of 20 ) i.e. 64{0}+16{1}=80 samples in training_set which represents the original dataset in equal proportion and similarly test_set consists of 16 negative class {0} ( 20% of 80 ) and 4 positive class{1} ( 20% of 20 ) i.e. 16{0}+4{1}=20 samples in test_set which also represents the entire dataset in equal proportion.This type of train-test-split results in good accuracy.

๐๐ก๐๐ญ ๐ข๐ฌ ๐ญ๐ก๐ ๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ข๐จ๐ง ๐๐จ๐ซ ๐ฆ๐๐ง๐ญ๐ข๐จ๐ง๐๐ ๐ฉ๐ซ๐จ๐๐ฅ๐๐ฆ๐ฌ?

The solution for the first problem where we were able to get different accuracy score for different random_state parameter value is to use K-Fold Cross-Validation. But K-Fold Cross Validation also suffer from second problem i.e. random sampling.

The solution for both first and second problem is to use Stratified K-Fold Cross-Validation.

๐๐ก๐๐ญ ๐ข๐ฌ ๐๐ญ๐ซ๐๐ญ๐ข๐๐ข๐๐ ๐-๐ ๐จ๐ฅ๐ ๐๐ซ๐จ๐ฌ๐ฌ ๐๐๐ฅ๐ข๐๐๐ญ๐ข๐จ๐ง?

Stratified k-fold cross-validation is same as just k-fold cross-validation, But in Stratified k-fold cross-validation, it does stratified sampling instead of random sampling.