Learning with Small Data: Part 2

This is the second post in a series focused on learning with small data, or in other words, on how to deal with the Cold Start Problem when your application is such that data is not widely available. In the previous post I discussed how to plan your data collection so that even if you have a small amount, you can start extracting value from it as soon as possible.

In this post, I will discuss which models you can use when dealing with small amounts of data. In the upcoming posts in the series I will discuss how to improve your model using priors and how to evaluate your results.

After obtaining your data, the next choice you need to consider is the type of model you want to train. Although the amount of data is not the only reason why you might want to choose one model over another, there are some algorithms that are very data hungry, in comparison to others. One example is Deep Learning models: because of the large number of layers (and usually large amounts of hidden units) that your typical DNN requires, a small amount of data will greatly increase the chance that your model will overfit and fail to generalize once you get more real-world samples.

Before comparing algorithms, notice that one of the few advantages of having a small amount of data is that it is usually possible to label it, which means that you can actually create a supervised model rather than an unsupervised model, as it is sometimes the case when dealing with large amounts of data. There are two major categories of supervised learning algorithms: generative and discriminative models.

Generative models attempt to learn the full joint probability distribution, that is, whether the datapoints are described by a vector x and the labels by y; generative models learn P(y,x). They are called “generative” because sampling from such learned distribution can be used to generate synthetic data points. Common examples of Generative Models include Naïve Bayes, Hidden Markov Models, and Bayesian Networks.

Discriminative models, in turn, learn the posterior probabilities directly, in order to only discriminate between classes, i.e., they learn P(y|x). Popular discriminative models include Logistic Regression, Decision Trees, and traditional Neural Networks.

Figure 1: Example extracted from Ng, A., & Jordan, M., 2001. The continuous line represents the testing error of Naïve Bayes as the number of training samples increases, the dashed lines represent the testing error of Logistic Regression under the same circumstances. See more examples in the cited paper.

Although there are no strict rules to this, researchers have found [Ng, A., & Jordan, M., 2001] that Generative Models tend to perform better than Discriminative models when the amount of training data is small. In the cited article, the authors compared Naïve Bayes and Logistic Regression using several datasets from the UCI repository, while varying the amount of data used for training. Although the trend does not apply to 100% of the cases (as shown in the figure), Naïve Bayes tends to outperform Logistic Regression when data is small, while the differences decreases as the amount of data increases, eventually being overcome by Logistic Regression. The likely explanation for this is that Discriminative models focus the computational resources on the given task (i.e., on predicting the most likely task) that is improved as the amount of data increases, whereas generative models usually depend on strong assumptions (e.g., prior probabilities) to model the full distribution, and therefore are better when the amount of data is not that large (given that the assumptions are reasonable).

In summary, when you are dealing with small amounts of data it might be a good strategy to try Generative Models first (e.g., Naïve Bayes or a Bayesian Network) as they will probably perform better than Discriminative Models such as Logistic Regression and even some state-of-the-art models such as Deep Neural Networks (due to their need for large amounts of data). On the next post, I will discuss how we can improve our model using priors (i.e., the strong assumptions mentioned above) when comparing what are known as MLE and MAP estimators.

Footer