Omitted Variable Bias in Machine Learning models for Marketing and how to avoid it

This isn’t a highly technical article explaining the maths of Omitted Variable Bias (OVB), there are plenty of brave individuals who have already taken this approach, and theirs can be read here (1) or here (2). Instead, this is an article discussing what OVB is in plain English and its implications for the world of marketing and data.

Let’s start with the basics: what is OVB? We could define it technically:

When doing regression analysis while omitting variables that affect the relationship between the dependent variable and included explanatory variables, researchers don’t get the true relationship. Therefore, the regression coefficients are hopelessly biased, and all statistics are inaccurate.

(1)

Instead, we’re going to explain it more simply: if you developed a model that makes predictions considering some relevant factors, but not all relevant factors, then the predictions will never be entirely reliable because you cannot make an accurate prediction if you don’t have access to all the relevant information.

It’s like trying to predict the temperature just by looking out of the window, with no more information than what you see: sometimes your prediction will be right; that sunny implies hot, but a significant number of times you will be wrong. For instance, if you predict that it’s hot just because it’s sunny in the middle of the winter; or if you do it during the summer in Alaska. So, in this imaginary scenario, there would be at least two extra variables that we should be considering: location and season.

Another interesting example that a marketer could relate to:

Let’s imagine that last spring, a bathing suits brand saw that sales were really low and decided to change their media/creative agency. The new media/creative agency starts collaborating with them, and right at the moment their first campaign airs, sales spike. The brand is really impressed with their new agency’s performance and decides to extend the contract for 3 years.

A few months later, they analyse the data in greater detail and realise that during spring they had their highest market share in the history of the brand and that it kept improving during the season. When the new agency started, their share lowered, and it is now back to the levels it was one year ago.

How could this have happened? Because they were omitting the most important variable in their sales: the weather. It had been awful during spring, and right when their new campaign aired summer was starting. There was good weather for the first time that year, so everybody was running out to buy a bathing suit, which they hadn’t done before because they wouldn’t have been able to use it. In their hasty decision, maybe made by people who didn’t even live in the country where this happened, they had completely missed this. They let a media agency go that was actually giving them better results than the new one, with whom they now had a 3-year contract, causing them significant revenue loss.

These were simple imaginary scenarios, which I hope have convinced you that Omitted Variable Bias isn’t just some “mathematical thing”, but a real-world challenge to which companies should pay close attention if they want to make effective decisions. In the earlier examples in this article the missing variable was obvious, but sometimes it’s not so easy. How can we ideate a robust model where all (or as many as possible) relevant variables are taken into account?

(2)

By doing a thorough data discovery (4) process in which all stakeholders are on board and all processes are mapped. Before chasing any machine learning application, we must find out what the relevant variables might be by:

● Looking into our customer’s journey, and analysing their interactions and behaviour through each stage, with the help of the stakeholders who are there along the way: both internal and external (even customers). This is proper Journey Analytics.

● Considering the 8 Ps of marketing: Product, Place, Price, Promotion, People, Processes, Physical evidence, Productivity & quality, and assessing how each of them might be relevant as input for predictive modeling.

Once this is done, we will end up with a comprehensive list of all the critical variables, and we can start designing and building a data warehouse, if there isn’t one already, and then, finally, start building the model. During this process, we must not forget about everything we have worked on before. Instead, it’s the moment at which, by exploring the data and the model’s results, we can find out if the model’s outputs are fully predicted by the variables, if not then we are still missing something. We can then deploy an imperfect model (if it’s good enough), prototyping quickly, and following with various iterations to refine it or go back to earlier stages of ideation.

In a nutshell, for building a model that resembles reality we must first identify the right input, and for that, we need to involve every stakeholder in the ideation process. Not doing so could lead to incomplete models that, instead of assisting us in decision making, misguide us. Developing and relying on a more complete and accurate data model will lead us to make more effective and powerful data-driven decisions, that in the end will help us attain our main goals: more customers, and more satisfied customers.

Footer