This had happened so many times that — machine learning models being naturally lazy — the model gave up trying to learn and instead memorized the data, something a neural network has no trouble doing at all. Hence, when presented with data it had truly never seen before, it flunked.
The model had already seen the validation data in previous training sessions, so it scored an almost-perfect 0.9968 — fooling me. This is, if you recall the performance of the model earlier, why the validation score could be higher than the train score, which is a rather bizarre and rare phenomenon.
So here is one important lesson that has withstood the test of time throughout every endeavor: if it seems too good to be true, it probably is. (This is, in particular, a great way to identify data leakage.)
This is one sort of data leakage. Other ways data can leak include:
- Preprocessing. If you process the data prior to splitting it, information may leak. For instance, if you use something like the mean across the entire dataset, the train set will have data about the validation set and vice versa.
- Time. Simple random train/validation splitting isn’t effective when time is involved in forecasting problems. That is, if one wants to predict
C
based onA
andB
, the model should be trained on[A, B] → C
and not something like[C, A] → B
. This is because knowing the value ofC
— a data point in the future — the model predictsB
, a point in the past.
While it’s always good to think about it when something seems too good to be true, it also may also be more subtle. That is, the score increases by a large but not blatantly odd amount; however, when the model predicts on test data it performs rather poorly.
The best thing to do, really, is to pay attention to when data is being split in any way for the purpose of separating them into train/validation/test sets, and to do it early on. Setting seeds, furthermore, are a strong safeguard.
Data leakage — in a more fluid sense — can be beneficial. If the model sees what it isn’t supposed to see, but what it sees aids its generalization and learning, then data leakage is positive.
For example, consider Kaggle’s ranking system, which consists of a public and a private leaderboard. Before the competition ends, the user has accessibility to all of the train set and part of the test set (in this case, 25% — this is used to determine the position on the public leaderboard). However, when the competition ends, the model is evaluated on the other 75% of the test set to determine the position on the final private leaderboard.
If we’re able, however, to use the 25% of test set to improve the score on the private leaderboard, this counts as data leakage.
A relatively common practice in Kaggle competitions is to do some sort of preprocessing on the 75% test data with knowledge of the other 25% test data. For instance, one could employ PCA for dimensionality reduction. This usually results in an improvement of model performance, since generally more data is better. It’s always worth looking out for how to improve data connectivity; generally, never label any sufficiently complex occurrence as something purely bad (or good).
In summary: data scientists should strive always to provide their models with as much data as possible, but to keep data with different purposes apart. And — lest your hope and time be shattered — program pessimistically. After all, if you expect the worst, you’ll either remain unsurprised or be pleasantly startled.