Art and Science of Testing Machine Learning Models

What is a model?

Model means different things in different contexts. Let us specifically try to understand what a model is in the context of data and statistics. In this context, a model is a mathematical representation (or approximation) of the selected (or underlying) data. This model is represented in the form of variables and constants.

What is Machine Learning?

Simply put, it is machines finding a pattern in given set of data and apply that pattern to another set of data it is presented with. How a machine ‘finds a pattern’ varies with the technique one uses — either we train a machine or make a machine find pattern on its own.

There are obviously differences but let us stick to machine learning for now.

Machine learning models do not have group of formula or set of rules, given as steps, that determine the outcome or prediction. It is this nature of machine learning that throws a curveball at testing.

In machine learning, we make the system learn and understand the data. What we know in machine learning is the problem statement, ‘why do we need this’ and not ‘how do we do this.’

Traditional Functional Testing is built upon set of requirements that are also used to build a software. These requirements are detailed and are expected to be explained without any ambiguity so that anyone reading these requirements would understand in the same way. Given the inherent nature of machine learning, these are absent.

Insurance case:

Let us consider an example of pricing an insurance policy, say, your car. If an insurance company wants to introduce this new product, it has to go through the rigor of statistically explaining that their pricing is prudent and they intend to make some money in the process. They may take published data, existing sets of ‘rates’ as it is called, the variables that they consider (could be from gender of owner to where the car is located), build a statistical model and it is operationalized in the form of a list of steps with formula that end up as an amount to be charged — premium — from an insured.

Let us say that the insurance company has gathered data over years — ranging from detailed insured’s data to some obscure data. Now, if the company wants to understand this trove of data using machine learning, it can easily build a model and find patterns involving multiple variables. So far, easy.

Here is the fun part: how do one assure that the built machine learning model is right and represents the underlying data?

Let us say that the model has found a relationship between baldness and accident — that bald men have a smaller number of accidents. How does one assure that this model is right?

Claim fraud use case:

Let us consider another practical case: machine learning to understand fraudulent claims. Let us further narrow it down to machine learning technique that is used — supervised learning.

What is supervised learning?

It is a technique where the input and corresponding output examples are provided and system builds a model based on these input-output pairs. So, for any subsequent inputs, the system will throw some output.

To test this, what we do is separate the data into blocks. One, to train the model — that means, we tell the model that this is a fraud and this is not a fraud. Two, to test the built model — use another set of ‘known’ data and check whether the model is correct or wrong.

It sounds pretty easy — we have something to compare against. So, what is the issue?

There are many.

Are two data sets similar? If not, wont the model change with the inclusion of new data sets?
How does one even separate the data sets? Where is that boundary?
What level of errors are acceptable?
If the expected values in test data are not known, how does one interpret the output?

Thus, testing machine learning is not straight forward nor simple.

How does one test machine learning model?

We are still restricting ourselves to functional testing machine learning and not even adding complexity of iterations, changes to models, repeated tests, automation, integration, performance, engineering related tests and data pipeline related tests. All these are important and functional tests ascertain machine learning’s raison d’etre.

Let us start with skillset.

One may wonder whether these skills are necessary for a developer. Yes, you are right. Does that mean only a developer can become a tester? Not really.

A tester should understand in depth, the business problem machine learning is attempting to solve or a business case that it is attempting to address. This provides critical context that is usually lost with mere problem statement or use case. This provides testers ‘what’ is being built.

Next, programming. Toolset to machine learning implementation is many. One cannot simply test without some basic understanding of how it is built. For example, it could be as simple as understanding loops in R (yes, loops) but nevertheless, it would help if the tester actually can review the code to a certain extent. This gives tester insights into ‘how.’

Next, Math. Machine learning is not only math. But it is also not ‘without’ math. One need not know multi variate calculus in depth to solve theoretical physics problem. However, tester needs to know basics of calculus, probability and statistics. This gives the tester ‘why’ a model is built in a particular way.

Only when we have understanding of basic what, how and why, we can begin to devise testing approach and strategy.

What this also tells us is that, every problem is unique, every solution is different and the underlying math driving these solutions are different.

Coming back to insurance fraud problem, the solution and the math for an automobile line of business will be different to that of marine cargo. The underlying data will also be different as one insures automobile every year and the cargo is insured for a journey.

Traditional testing techniques like boundary value, equivalence partitioning, pair testing can still be applied to machine learning models — but should be adapted. In fact, understanding these concepts helps us devise appropriate testing strategy.

Testing ML is an art because there are no standard sets of rules, checklists or methodologies, yet, to approach any and all machine learning (or data science) solutions.

Testing ML is a science because one cannot wing it just because there are no set of rules. One need to build a structured, reusable, repeatable and evidence-based testing strategy, where ‘hunch’ is converted into coverage and ‘assumption’ is translated into test hypotheses.

What’s the fun in writing about machine learning and not making a prediction?

Testing machine learning solutions will remain in this grey area — a combination of art and science. But one thing is certain: the budget for testing ML will be more than traditional applications.

Footer