
Now towards relaxing the realizability assumption.
Continuing our papaya example, let’s assume that the true labels we got about the papaya from the indisputable function f are no longer indisputably true but are received according to P(Y | X). As you observe the papayas, you get to know about their tastiness. Now because there is no true tastiness labelling function f, f ∉ H and hence there is no realizability assumption. The loss function changes now as we do not have a true labelling function to compare to. The new loss function thus would need to compare it to labels Y that we get upon observing the papaya from the distribution.
Lpxy(h) = P[h(X) ≠ Y]. The true loss of hypothesis h is the probability that the prediction of x according to hypothesis h is not equal to the label Y that we get upon observing the papayas from the distribution. The empirical loss (observed loss based on the number of sample observed) would thus be,
where n is the number of training examples. When we had the indisputable labelling function f in the realizable case, the best we could do was that we could compare the prediction made by our hypothesis with it i.e h(X) ≠ f. But now, the best we could do is h(X) ≠ Y.
The first line Lpxy(h) = P[h(x) ≠ y] is the same as encountered before. E in the second line is the expectation function and the 1 is the indicator function. The expectation can be broken into expectation of y and x. Think of it like finding all the expected value of y while keeping x constant and doing this for all x. This has been done in the third step. In the last step, the expected value of y is simply broken into its constituent parts i.e 1 and 0. When y = 0, h(x) ≠ 0 as this is a loss function and similarly for when y = 1, h(x) ≠ 1. Also, when h(x) ≠ 0 is active, the h(x) ≠ 1 would disappear. At a time we would only be dealing with any one of the values.
Now the best loss we can get OR the other way to say it, the best performing hypothesis h we can get for any given sample x is when we choose the minimum of the 2 probability terms and then select the corresponding h. This is the best thing we can do with the loss function we have.
The optimal optimizer that we can get is,
The first step says that if the probability of Y=1 given x is greater than or equal to probability of Y=0 given x, we label the sample as 1 i.e. If P[Y=1|X=x] ≥ P[Y=0|X=x] we label the sample as 1 or True or Tasty in this case.
The second step uses the property of complement in probability and finally we get if P[Y=1|X=x] ≥ 1/2, we label the sample as 1 or True and 0 otherwise.
The optimizer we obtained here is called the Bayes Optimal Predictor and is the gold standard of classification.