Machine Learning, Deep Learning, Data Science
In general, we know that Machine Learning and especially Deep Learning require BIG data to work well. Although nobody can definitively answer how big a data is needed, it is specific to the problem at hand. Also, apart from FAANG (Facebook, Apple, Amazon, Netflix, and Google), most other companies do not have access to that kind of BIG data.
Or, what happens when you don’t have BIG data?
Overfitting.
Overfitting, as you know, is a well know phenomenon in Machine Learning. When the data is not BIG enough, Machine Learning models overfit and do not generalize well in the real world. For the non-ML folks, let us try and get some intuition about overfitting.
Say you’re preparing for an exam. You got hold of a nice little guide or book, and you trained yourself for the exam by memorizing every word in that book. You are confident that your answers will be 100% accurate. In the exams, though, questions are based on the application of subject knowledge and not just theory. In such a situation, your chances of being 100% accurate reduce drastically.
Similarly, ML models overfit the training data, especially the complex models like Neural Networks, by memorizing small training data without learning underlying patterns. In such a situation, the model performs exceptionally well on the training data but fails miserably on the test data or in the real world.
Another important consideration is it is challenging to spot outliers (significantly different observations) and detect noise (random chances) in the small dataset. Again, these could adversely impact the performance of your ML model.
Now that we have understood the pitfalls of not having big data let’s try and put ourselves into various Data Situations and see what we can do about those. We will be covering a lot of different tools and techniques. Some of those techniques are big topics on their own. The purpose is not to go into depth but to get a simple intuition of the given technique.
In this article, we will look at the following data situations and discuss possible solutions.
- Small Data
- No Data
- Rare Data
- Costly Data
- Imbalanced Data
You are working with only a few hundred or a few thousand examples.
Again, this is specific to the problem at hand but let’s consider this number for discussion here. So let’s see what you can do when you’re working with small data.
1.1 Data Augmentation
Data augmentation refers to increasing the number of data points by adding variations to your data. This technique prevents over-fitting and helps your model generalize better.
For image data, data augmentation can be done by
- Modifying lighting conditions
- Random cropping
- Horizontal flipping
- Applying transformations like translation, rotation, or shearing
- Zoom-in, zoom-out
You can also apply data augmentation to the text data by
- Back translation
- Synonym replacement
- Random insertion, swap, and deletion
In addition to these, there are techniques to learn optimal data augmentation policy for your dataset, referred to as AutoAugment. AutoAugment uses a search algorithm to find the best policy such that the neural network yields the highest validation accuracy on a target dataset.