Data Blending : A Simple, Yet Powerful Tool Counter Overfitting
Dr. Eren Unlu, Data Scientist and Machine Learning Engineer @Datategy, Paris
I don’t know why, but nothing feels better to me than delving into a straight, old tabular data challenge. Most of the time our workload in the company shifts solely towards complex image segmentation, video analysis, and some hefty complex time-series forecasting. But, I just miss the simplicity of tabular 2D data. Maybe, it’s just longing for laziness, or 30-second training times or pure nostalgia of my junior years in machine learning. But I am sure, there exists a lot of folks out there with a tabular preference of taste like me 🙂 .
Hidden Adversarial Attacks on Image Recognition
You cannot just get everything in life. Life is nothing but a proper management of trade-offs. And Artificial Neural Networks (ANNs) come with several costs at the expense of their augmented performance, most notable one being overfitting.
There are a pack of tools and strategies countering overfitting for ANNs, from simple dropouts to data augmentation or scrutinized training policies. No matter how many verifications you make with test datasets, believe me; your ANN is always overfit, even if you won’t notice for your use case and will satisfy you.
The most notable proof of this omni-present overfitting of ANNs is the phenomenon of hidden adversarial noise. This is a well studied concept in deep learning based computer vision systems. Even if you add a very small random noise on pixels; even you cannot perceive the change visually, it drastically perturbates the artificial neural network’s performance. This is almost entirely due to the omni-present overfit effect. The infinitesimally small disturbances in input can yield to unbounded gradients over the chaotic pathways of ANNs.
There are numerous methods to counter these types of effects, constituting already a vast literature. Metric learning based approaches such as contrastive loss, also including complex discriminator networks etc., elegant data augmentation techniques, implementing pseudo-Bayesian Monte-Carlo dropouts etc.
One very simple data augmentation strategy counter hidden adversarial noise effect is just to simply blend two (or multiple) images; generating a totally new type data point with a fuzzy class. And simplest form of blending is just to create a weighted sum of two images of two different classes (e.g. 0.6x(pixels of a cat) + 0.4x(pixels of a dog) => class = 60% cat, %40 dog)
What would happen if we would just simply apply this to a tabular dataset ?
When I was reading these papers, I just wondered how it would affect the performance of neural networks on tabular dataset. And when we need a representative dataset for binary classification with mixed categorical and continuous features, where we refer to ? Yes, you guessed it right ! Let’s try this simple technique with Titanic.
Preprocessing
Let’s keep everything simple. This is just a simple self-educative experiment. I just keep 2 categorical features (Sex (Male or Female), PClass (0, 1 , 2)) and 2 continuous features (Fare and Age).
A Class for Data Processing and Scaling
I wrote a class tasked with every necessary operation both for the original dataset and the synthetic one, before feeding them to neural networks. One important remark here, the single most recurrent mistake that newcomers to machine learning make is to include test samples while fitting the scalers (sklearn’s easy fit_transform function to blame); so we make sure we don’t do this.
Let us deconstruct the code a little bit. We have 2 separate functions : One for returning the training dataset directly and one which returns a randomly blended one. As you can see, I simply shuffle two versions of the training dataset and take a random weighted sum of them (you can think of this as a vectorial way of mixing two different samples. We want to leverage the extreme parallelism of numpy as much as possible. One humble advice to you folks in python; consistently ask yourselves : “Can I do this in a vectorial manner with numpy ?”). Also, note that both for the cases (blended or not) the datasets are of equal sizes for the sake of fairness in our experiment.
Going Extreme : Testing Overfit Robustness with Very Small Number of Training Samples
By far the most under-appreciated machine learning model of scikit-learn is Multilayer Perceptron (Artificial Neural Network). Most of the folks out there even are not aware it exists. When a high level neural network framework is required we tend to first recall Keras, but if you deal with simple tabular data and not require to delve into details of your model; MLPRegressor and MLPClassifier is your 1-line solution. Its default parameters and early-stopping are highly robust. You just need to define your layers with number of neurons in a simple list.
Note that, for the regular case (binary classification), we need to use MLPClassifier (sklearn module will automatically set itself to use a sigmoid activation and binary cross-entropy). The blended case requires MLPRegressor, as now we have fuzzy results such as 0.45 survived. (We will just simply round the results of the test inputs to predict their classes)
In order to test the potential of simple linear blending, I want to perform experiments with a small number of training data points (90% test , 10% training). Here’s the snippet for 10-fold cross validation :
And these are the test accuracies of each of 10 experiments (orange : Blended , blue : Regular ). Clearly, even a very simple 2-sample linear blending can improve the accuracy slightly.
Conclusion
I just wanted to test the simple image blending strategy on tabular data. Clearly, this approach has potential on tabular data. The advantages counter over-fitting are two-folds : (i) By introducing fuzziness, we enforce our neural networks to learn over smoother covariant spaces rather than rigid ones. (ii) We can assume that the test samples we have not seen yet will fall on to feature ranges between the training data points; which this method increases the likelihood.
However, note that, synthetic training datasets is a vast discipline; which this blending is probably the simplest one possible. Also enter the metric learning, contrastive losses, variational autoencoders etc.
Cheers,
[1] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
[2] Tokozume, Y., Ushiku, Y., & Harada, T. (2017). Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282.
[3] https://vitalab.github.io/article/2018/04/17/BCLearningClass.html