Realizability and Sample Complexity in Machine Learning
Machine Learning from First Principles: Blog Post 4
In the last blog we looked at the problem of Overfitting which arises when we have a hypothesis exactly learns the training samples as they are and fails to generalize on unseen samples. We also saw that in order to solve this issue, we can introduce a large class of Hypothesis, H in which the best hypothesis resides. This is called Inductive Bias in ML.
The way to think about this is, when you only had a single hypothesis that perfectly fit the data, what happened was that there were no other options to choose from. As the model trained itself, it made itself twist and turn such that it perfectly fit the data in order to converge. With Inductive Bias, we introduce other possibilities to choose from and now the data has more options to fit to. By introducing other options, you don’t force your model to twist and turn the fitting curve whichever way it wants to suit the training data and hence you avoid overfitting. Now with the assumptions made, the fitting curve cannot be completely arbitrary in shape and hence it gives a better generalization than before. The best hypothesis is in the hypothesis class H and can be found by finding the minimum empirical loss w.r.t every hypothesis in H.
H is pre-specified class of hypothesis.
In order to move forward, we will be making some assumptions which would later be relaxed.
Assumption 2 is called realizability and it states that the true function f which always gives the indisputably correct answer belongs to the class of hypothesis H which has finite number of hypothesis in it (as stated by assumption 1). Assumption 3 states that the samples obtained for training from distribution ‘Px’ are identical and independently distributed and ‘f(xi)’ gives the correct label ‘yi’ for every sample ‘i’ in the sample space with total ‘n’ number of samples.
Equation 1 is the Empirical Risk Minimizer. It finds the best hypothesis such that the empirical loss is the minimum. But the empirical loss which is the loss calculated on observed examples is not always equal to the real loss which is the actual loss based on the real distribution, distribution which is unknown to us. In order to find the real loss on the observed data, consider the following,
Equation 2 states that the probability of real loss ‘(L(p, f))’ of Empirical Risk Minimizer ‘(hs)’ being greater the epsilon.
And that is equal 2 what it is said in equation 3. The breakdown is as follows,
- Simply states the real loss calculated according to samples observed because we don’t actually have access to the distribution ‘p’ and only have the training data in sample space ‘s’.
- States that this loss is less than or equal to the probability of (all h in H such that empirical loss is 0) and (real loss is greater than or equal to epsilon). This is true because the set of all h in H such that empirical loss is 0 and real loss is greater than equal to epsilon also includes the ERM. This is like the union bound.
- Is just a restructuring of the above.
Looking at,
the empirical loss on the hypothesis h can only be 0 when it behaves exactly as the indisputable function f as stated in equation 5.
This is equal to the product of each individual probabilities of the form h(x) = f(x). And each such individual probability would be the same because it would actually be the probability of h(x) being equal to f(x).
The last probability can be written as,
and P[h(x) != f(x)] is the real loss.
this is less than or equal to epsilon obtained from a minor tweak in equation 2.
What we now have is,
The |H| comes from the summation above and symbolizes the cardinality of H. If we upper bound this value by delta we get,
Equation 12 is called the sample complexity and it is a function of delta and epsilon. Introducing PAC learnability again,