Measuring Success of Machine Learning Products

Failure is part of the learning process. Unfortunately, it frequents being part of the machine learning development process far too often. ML projects can be doomed from conception due to a misalignment between product metrics and model metrics. Today, many skilled individuals can create highly accurate models, and low modeling capabilities are not a common pitfall. Instead, there is a tendency for accurate models to be developed that are not useful for a product, thus failing to meet business objectives.

Image by Isaac Smith on Unsplash

In defining success, it is crucial to consider the differences between business performance and model performance. The most straightforward way to put this is that business performance is a function of many variables, not just model performance. With poor model performance, business performance will be inadequate, but good model performance does not guarantee good business performance!

Image by Free To Use Sounds on Unsplash

To evaluate business performance, one must start with a product or feature goal. For example, increasing the revenue of an e-commerce site. Once this is defined, a product metric should be assigned to evaluate success. This metric needs to be separate from any model metrics, only quantifying the product’s success. Product metrics can vary, with metrics such as the number of users a feature attracts or the click-through rate (CTR) of recommendations, both representing valid examples.

At the end of the day (and fiscal period), product metrics are what matter. They represent the goals of the product. Any other metrics are to be considered tools available to optimize product metrics. Typically projects only aim to improve a single product metric, but their impact is frequently quantified concerning numerous metrics. Some of these include guardrail metrics, which represent metrics that are not to fall below a certain point. For example, an ML project can increase a product metric like the number of users while maintaining the stability of other metrics like average user session.

Measuring the effectiveness of an ML approach requires the tracking of model performance. Before the deployment of a product utilizing ML, it is not possible to quantify product metrics. During the building of the ML product, offline metrics or model metrics are useful for defining success. To consider an offline metric to be of quality, evaluating the metrics without exposing the ML model to users is required. Furthermore, a positive correlation between model metrics, product metrics, and business goals should exist.

Suppose you were developing a feature to offer users suggestions while typing queries to an online retail store. The success of this feature can be measured using CTR (product metric). To create these suggestions, a model that predicts the words a user will type and displays these predictions can be built. By measuring the word-level accuracy (calculating how often the model predicts the correct next set of words), the model’s performance can be defined. In this scenario, the model would be required to have extreme accuracy to increase the product’s CTR, as a single error in word prediction would render a suggestion useless.

A (better) approach would involve training a model that can take user input and perform classification into categories of your catalog, suggesting the top most likely predicted categories. Here, the number of categories in a catalog is significantly less than all the words in the English language, making this a much easier model metric to optimize. Moreover, the model only needs to correctly predict one category to generate a click.

Image by Glen Carrie on Unsplash

Business performance seems to get often lost in the hype of ML model performance, but it is always important to realize what metrics need to be optimized.

Beyond just business performance and model performance, other metrics for success are required. Considerations include model freshness and speed. Model freshness is an essential consideration as models age and data distributions change. Speed is always an important consideration with any software. Autonomous vehicles could never be achieved if it took even several seconds to process data and generate predictions.

For more on this topic, I highly recommend Building Machine Learning Powered Applications by Emmanuel Ameisen. It covers the skills necessary to design, build, and deploy applications powered by machine learning.

Footer