We usually consider training data set, development data set and test data set all three coming from the same distribution.
But not always all the dataset has to come from the same distribution. There are certain cases where we can have different distributions as well.
For example, lets consider a car detector application.
For this the initial dataset is created by collecting images from users through a mobile application.
From this we shall create train, development and test sets. If the total image itself is 10000. We might split 5000 for train, 3000 for dev and 2000 for test.
We will not be considering the test set for now. Lets build a initial model using train set and accuracy yields 90% on train, on development it is 80%
Optimum we are going to have is 97% as acceptable. Here the entire dataset is from single distribution, In this case both bias and variance needs improvement. If we can reduce acceptable bias to 0, i.e achieve train accuracy as 97%(using bias reducing techniques) then straight away we can consider adding mode data to build model and expect variance(difference between train and development) to reduce.
If say the data to build model is not enough, we expedite there is going to be different kind of data in realtime and what we got from users is not sufficient, we can include data from other sources. We will take 20000 car images from internet. Out of 10000 images provided by users, take 5000 and add it to internet images to form a new train set of 25000 images. Keep remaining 5000 for development and test set.
Since this internet train set has a new distribution, we have to ensure the model built from train set works as expected. We cannot have benefit of doubt against development set alone. Hence from the new train set split 5000 as train-development set and keep it aside. Now use 20000 images for train set and create a model. Say the accuracy is 95% Use this model and test against train-development set. The accuracy in this set is 85%. Then clearly we need to improve the model, adding any more amount of data is not going to help. Use bias reducing techniques. Even then it doesn’t improve, sample dataset and perform manual error analysis of train and train-development sets.
This train-development set is created to diagnose how much of bias, variance or data mismatch is affected by. Also a model built with train set should be able to generalise over development set.
Once we have train-development accuracy reach ~95% then test on development set now. Say the resulting accuracy in development set is now 85%. This is a data mismatch problem since model was able to generalise with train-development set.
To address data mismatch problem start with analysing what features are different from train set and development set. Try to find more training data that better matches with development set. If in this case there can be constraints with getting more and more data.
For different kinds of data problem there are certain approaches like artificially creating synthetic datasets. In a voice recognition system, training data is created by mixing multiple audios according to the real environment. Say the development data contains recording in open air.
Then we can mix wind sounds with clean recorded audio. But this can lead to bias towards repeated wind sounds that we mixed into audio. Because as humans we may not be sensitive to wind sound, but neural network models can easily sense it. So as a caution the ratio to mix artificial source of data(wind noise) has to be diverse and in equal quantity with the original data(clean recording).
For car recogniser app, you shall consider images from video games but that is very small portion of data when compared with real world.