Start your next AI project with a failing test

I found that many university courses, books or online training either taught me how to build software or how to train machine learning (ML) models, but few blended both worlds. Having worked on various AI projects in the IBM Garage, I want to share how my experience in using test-driven development (TDD) helped me build better AI-powered applications.

This blog post uses an AI-powered web app as an example on how to apply TDD in AI projects. In the web app people can upload images taken inside and outside of houses. Then the website will display whether the photo was taken outside or, if taken inside, in which room.

Image classification workflow. Image Sources: pexels.com

Software developers use test driven development (TDD) to make sure that code works as expected. Instead of aiming for the perfect solution in the first pass, the code and tests are iteratively built together one feature at a time. In TDD developers first write a failing test and then just enough code to make the failing test pass.

Automated tests and TDD help developers in several ways:

Instant feedback: You instantly know whether the code fulfills the specification and edge cases (= tests).
Test-driven debugging: Detect problems early on and pin down which parts of the code are not working as expected.
Change with confidence: Other developers can implement new features or refactor without the fear of breaking the product.
Tackle challenges: Solving more simple cases first gives you confidence when tackling tougher challenges.
Little to no wasted effort: You only write the code needed to implement a requirement.

I regularly start projects with a simple test — people can visit the website in their browser (not my computer) via an app that is deployed to the cloud. This is the minimum functionality needed to interact with our AI application.

Why start small when there is a plethora of model code on GitHub published together with state-of-the-art papers? While the code usually allows you to train a model and calculate the metrics listed in the paper, making the model accessible to an application, often via an API endpoint, is rather difficult and time consuming. Many tools are grounded in research and productionizing them or their artifacts is often not a priority. Thus, this simple task sets our focus on deployment early on and also provides us with the foundation to do so. Also, consider that if you postpone deploying the model because “it takes a long time,” chances are deployment poses a higher risk than building the algorithm itself.

In pseudo code, this simple test and code could look like:

Next we build the happy path where data flows through our system—the image inputs from our UI to an algorithm and the predictions back from the algorithm to the UI. What is the simplest algorithm to implement? Simply returning a constant or random value.

While this might seem trivially easy, it will help your overall project:

Alignment: “Show don’t tell” ensures that people are not talking past each other.
Data collection: You can collect data in the same way as people will use the product, reducing bias. For example, “in the wild” people will take photos with their old smartphones and subpar lighting. You should collect your testing data this way.
Collect feedback: Prove that the product is valuable to customers and get feedback to identify issues early on.
Integration: Integrate between the UI and ML sides of the project from the beginning. That way there is always a deployed version you can show.

Again, tests and implementation could look like:

With an unbalanced data distribution (e.g. ~60% of photos are taken outside), this algorithm is already better than a random coin flip.

Over 60% of images are taken outside. Image Sources: pexels.com

This will set the benchmark for all future algorithms:

What to test for?

Writing this test leads to the question what to actually test for? While code runs deterministic, AI algorithms are probabilistic. The success of machine learning models is often measured with algorithm-specific metrics such as accuracy, recall or mAP. We can spend days or weeks to optimize these metrics. Unfortunately, these metrics rarely predict the overall product’s commercial success. Therefore, we should primarily test for product-specific metrics that align with our business goals.

In the same way TDD keeps developers from wasting time with writing unneeded code, TDD can also ensure that we are not wasting time with unnecessarily optimizing algorithms. You should optimize and write a failing test for issues identified in testing with real end users, not hypothetical edge cases that might never occur. Instead of aiming for perfection (100% accuracy or test coverage), aim to meet the business objectives.

After we made it work, we will now make it better! We find the biggest bottlenecks in our happy path, create a list with possible improvements and test them. Here, TDD can facilitate the process of how research and data science work:

Instead of casually formulating a hypothesis and manually verifying, why not make it explicit and automate it? Re-use your automated hypothesis tests to quickly conduct experiments, measure their impact and incrementally improve the algorithm.

A hypothesis to improve the algorithm could be: If I classify images with a high proportion of green as “outside” and the rest “kitchen” accuracy will improve.
Which we could implement in pseudo code as:

Outside images have a high proportion of green. Image Sources: pexels.com

As there is also an in-balance among the indoor pictures, this algorithm gets us to ~80% accuracy, without any deep learning. This raises the threshold for all future algorithms:

What to test for?

Footer