
Most ML engineers are familiar with the quote, “Garbage in, garbage out”. Your model can perform only so much when the data it is trained upon is poorly representative of the data. What do I mean by ‘representative’? It refers to how well the training data population mimics the target population; the proportions of the different classes, or the point estimates (like mean, or median), and the variability (like variance, standard deviation, or interquartile range) of the training and target populations.
Generally, the larger the data, the more likely it is to be representative of the target population to which you want to generalize. But this may not always be the case especially if the sampling method is flawed. For instance, say you want to generalize to the population of a whole school of students, ranging from the 1st standards to the 10th, but 80% of your training data consists of students from the 2nd standard. If the school’s student distribution does not correspond to 80% of them being in the 2nd standard, and the data that you want to predict is in reality largely affected by natural differences in the characteristics of the populations in the different classes, your model will be biased towards the 2nd standard.
It is crucial to have a good understanding of the distribution of your target population in order to devise the right data collection techniques. Once you have the data, study the data (the exploratory data analysis phase) in order to determine its distribution and representativeness.
Outliers, missing values, and outright wrong or false data are some of the other considerations that you might have. Should you cap outliers at a certain value? Or remove them entirely? How about normalizing the values? Should you include data with some missing values? Or use the mean or median values instead to replace the missing values? Does the data collection method support the integrity of the data? Data cleaning is probably the most important step after data collection.
The quote “Garbage in, garbage out”, is also applicable when it comes to feature engineering. Some features are gonna have a greater weightage towards the prediction than others.
Measures like correlation coefficients, variance, dispersion ratios are widely used to rank the importance of each feature. One common mistake that novice Data Scientists make is that they use Principle Component Analysis for reducing dimensions that are not inherently continuous. I mean, technically you can, but ideally, you should not. This usually results in assuming the features with the highest variability are the ones with the highest impact, which of course, is not necessarily true. Artificially encoded features that are originally categorical in nature, generally don’t turn out to be as highly variable as the continuous ones when encoded and so get undervalued in terms of their relevance.
Sometimes, creating new features using other known features can have a greater impact than keeping them separate. Oftentimes, having too many features with low relevance can lead to overfitting, while having too few can lead to underfitting. Finding the best combination of features comes with experience and knowledge of the domain. It could be the difference between an okay model and a near-perfect model, and by extension, an okay ML engineer and a pretty darn good one.
Unlike the previous ones where the data was our focus, this one really comes down to the algorithm used for the model, although, the effects can still be alleviated to some extent by considering the issues discussed above.
Overfitting is when the model fits the training data too closely and cannot generalize to the target population. The more complex a model is, generally, the better it is at detecting the subtle patterns in the training dataset. The data collected may not always be completely representative of the target population and so using more complex algorithms like deep neural nets instead of simpler ones with a lower-order polynomial could be the difference. But, use a model too simple for the problem and the model will not be able to learn and detect the underlying patterns well enough. This is of course, called underfitting.
One way to compensate for overfitting is by imposing a penalty, depending on the difference between the weightage the model gives to a feature and the value set by us before training (which could just as well be zero, if we want the model to completely disregard the feature). This effectively allows us to control the complexity of the algorithm at a finer scale and help find the sweet spot between overfitting and underfitting. This is what we call regularization of the model and the penalty is a hyperparameter. It is not part of the model but affects the model’s ability to generalize and is set before training.
But this doesn’t end here. After extensive tuning of the hyperparameters, you may find that your model predicts with an accuracy of 95% to the test dataset. But now you run the risk of overfitting to that set of test data and the model may not be able to generalize to the real-world data when it is deployed. The common solution to this is to carve out another set of data from the training dataset and use it as another layer for validating the model after testing it on the first test dataset with the different hyperparameter tunings. The three rounds of fitting generally render a model that works great, but this really ultimately depends on the size and quality of the data you have and the complexity of the problem.
Most ML models require a sh*t tone of data. And unless you have a pre-trained model that only requires some fine-tuning, you are gonna have to find a way to provide your model with enough data. Even for simple tasks like recognizing oranges and bananas, there should be at least a few thousand example images for the model to learn. This is a massive bottleneck in the pipeline. More than any other factor, the efficiency of today’s ML models and the efficacy of their applications are greatly choked due to a lack of enough data.
This is why companies like Facebook, Google, and Apple are so keen on collecting as much data as possible from their users (not gonna debate on the ethical concerns of this practice in this article here though). Data augmentation techniques like cropping, padding, and horizontal flipping have been critical in squeezing out as much training potential there is out of the available dataset but these can only do so much. This study from Microsoft illustrates how very diverse ML models performed similarly and had a pretty strong positive correlation with the size of the training data (number of words).