The Machine Learning Experimentation Tradeoff: A Framework
If you’re a company, you’re continually seeking how to gain more profit. If a company is seeking to expand or change their current business (in both big or small ways), a common solution is experimentation.
Companies can experiment if a change works out or not; if a change does seem to be promising, they can incorporate that change into their broader business. Especially with digital-based companies, experimentation is a driving force of innovation and growth.
A common — and relatively simple — test is the A/B test. Half of users are randomly directed towards layout A, and the other half are randomly directed towards layout B. Then, after the experiment is over, the results of users in layout A and B can be compared to see which layout performs better.
By choosing an A/B test, you’re choosing information in the experimentation tradeoff between information and profit. That is, the A/B test is the most root-scientific testing method that exists; the idea of testing control and treatment groups is as fundamental to science as a test can be. With A/B tests, statistical significance is the most trustworthy and rigorous.
(Technically information is profit, just delayed and less concrete, in a sense, but not necessarily less valuable. It’s worth to keep the information or profit trade-off for now, though.)
However, experiments are not occurring in some irrelevant galaxy far, far away — the experiment is affecting real customers. If you’re a company, especially one without too many resources to spend, you’d want to minimize losses — all the time, including during experimentations.
So, for instance, if a new website layout is only 25% successful but the original website layout is 75% successful, then in an A/B test, then during the test, mere 50% of website visits are successful, compared to 75% originally. Doing the test hurts the company.
(0.5 × 0.25) + (0.5 × 0.75) = 0.5 #success rate in A/B test
In fact, Harvard Business Review found that at Google and Bing, only about 10% to 20% of experiments generate positive results. Given this relatively low success rates, should companies be performing so many of these high-information but also high risk A/B tests?
Multi Armed Bandit (MAB) tests put a spin on A/B tests to address this very concern — A/B tests may not be very profitable. These MAB tests are not static in the proportion of the population put into each group.
That is,
- A/B tests are static; they will always be 50% — 50% in the control group and the treatment group.
- MAB tests change; they may begin at 50% — 50% but at the end of the test the proportion becomes 10% — 90%, for example, if the second group performs better.
In this case, if we choose MAB over A/B tests, we are both at an advantage and a disadvantage.
- Advantage: Profit. In the MAB test, towards the end of the example test above, we put 40% more of the population towards the better-performing group. This suggests that throughout the test, the MAB “saved” a significant number of users that would have otherwise gone to the lower-performing group.
- Disadvantage: Information. In the MAB test, the final results are not split by an equal proportion, so making conclusions in a statistically rigorous way is not as simple. In fact, if the experimenter wants to be especially rigorous, depending on the circumstances one may not even be able to draw a conclusion from the experiment.
For instance, say you’re given that in the entire MAB experiment, 30% of participants were in group A with a 40% success rate and the other 70% were in group B with a 33% success rate. Can you make a confident decision between the a) the noise caused by the 40% gap between sample sizes and b) the difference in success rate of only 7%?
It’s useful to compare MAB and A/B tests in this profit-information framework. Using this framework, one can arrive at the difficult solution of which testing type to use in a more structured way.
First, though, one must recognize A/B tests as a subset of MAB tests. Let’s redefine a more rigorous (but still relatively simplistic) view of MAB tests via an exploration / exploitation perspective.
- MAB tests are δ percent explorative. That is, they send users randomly to the control and treatment groups to collect more data, thus making it more statistically significant.
- MAB tests are 100−δ percent exploitative. That is, they send users to the group that they know will perform better to maximize profit.
So, A/B tests are MAB tests where δ = 100 percent — it explores all the time and does no exploitation at all. In this framework, we are viewing MAB models as a blend of exploration and exploitation.
As a note, it can be misleading to take this too literally. The amount of exploration and exploitation a MAB model does is dynamic — it changes — but think of this framework more as “how much of a blend is the model between a purely explorative test and a purely exploitative test?” In reality, δ represents something like “how confident does the model need to be to exploit?”
In that case, the higher δ is, the more statistical confidence is needed to make the decision to exploit, meaning it will be more hesitant to exploit and spend time exploring. On the other hand, if δ is lower, the model is willing to exploit with little confidence.
Back at the framework at hand: we can begin to tune δ to decide between the trade-off of exploration (statistical significance) and exploitation (profit).
Let’s take an example scenario, in which the population is the same but different values of δ for modelling are used.
- At δ = 100, as discussed above, this is an A/B test. While profit is not even a consideration, the data collected is very statistically significant.
- At δ = 75, the model is mostly doing exploration, but also a bit of exploitation. Whereas the groups may have been 50% — 50% at the beginning of the text, with this small exploitation the model slowly learns towards 40% — 60% (the second group is more successful).
- At δ = 50, the model is doing equal exploration and exploitation. Whereas the groups may have been 50% — 50% at the beginning of the text, with the model relatively quickly reaches 30% — 70%.
- At δ = 25, the model is mostly doing exploration, but also a bit of exploitation. Whereas the groups may have been 50% — 50% at the beginning of the text, with this large exploitation the model surges quickly towards 25% — 75%.
- At δ = 0, the model is doing pure exploitation. Whereas the groups may have been 50% — 50% at the beginning of the text, with this small exploitation the model instantly becomes 100% — 0%. (A pure exploitative model follows whatever the first data point is, so in this case the first data point collected happened to suggest that the first group did better.)
These numbers are fictional, of course, but they are quasi-realistic and we can do an analysis on the fictional results we get from this framework.
As an experimenter, analyzing results based on the value of δ makes the trade-off a bit clearer. There are two things you are likely noticing:
- Statistical significance. Is there really much of a difference in drawing statistically significant conclusions between the 40% — 60% group when δ = 75 and the 50% — 50% group when δ = 100 for practical purposes? Problems do seem to emerge when δ = 50, though, since the group difference is a lot larger and can pose problems of statistical validity and significance.
- Profit. Ironically, at a certain point, exploitation causes us a loss since it has not done enough exploration and is exploiting false ideas. For instance, when δ = 0 our MAB test actually sends 100% of users to the wrong group. In general, anything below δ = 50 seems too hasty, since we would intuitively want the model to spend as much time exploring as it spends exploiting knowledge learned from that exploration.
Based on these two observations alone, our experimenter would come to the conclusion that somewhere around δ = 75 is the best testing model.
Of course, these numbers don’t apply to all scenarios, and there are good reasons to choose both high and low values of δ.
For instance, a news outlet may prioritize immediate profit over statistical significance because of the nature of their business. The statistical significance of A/B tests is derived from the amount of exploration it does, and exploration takes time. Given that the news changes quickly, by the time a test is exploiting enough, the news has changed already and the learnings from exploration are not applicable.
On the other hand, if an experimentation team at a large company has several months to develop and run a test on a new feature, A/B tests may be more suitable. Profit is not a significant concern, and given that large companies usually have large platforms, statistical significance in a certain hypothesis is very important — given that the results of the test will decide the interface for the company’s entire audience. The team can do extensive post-processing analysis of the results and be sure of a decision before it is exploited.
There are, of course, many other factors outside of the profit vs. information paradigm to consider in deciding the value of δ for your experimentation model. However, thinking through the lens of profit vs. information can make finding δ a bit clearer.