How committing to transparency made us deliver better AI products
There has been quite a bit of debate around black box AI models being applied to real-world problems. Common AI models have become so large and complex, that even the developers and product managers building the thing don’t exactly know what kind of decisions it will make.
This has resulted in all sorts of unwanted outcomes and potential severe negative consequences for society. These have been evidenced by automated decisions with a racist or sexist bias, such as Apple’s infamous credit card launch in late 2019. Or other dangerous consequences, such as a Tesla misinterpreting a 35 miles per hour speed limit sign after it had been altered with a small piece of tape.
Even with these risks, the vast majority of people still believe in the power and potential of AI. This article looks at three real-life cases in which harmful consequences of black-box AI were mitigated. It closes off with a 6-step transparency guideline that we have implemented in our own development process at Slimmer AI to ensure we avoid the pitfalls of back box models.
There are techniques and best practices to open up AI models and peek under the hood. Sometimes this is done by selecting a simpler so-called white box model that lets itself be explained more readily. Other times, this entails using novel techniques or grabbing another AI model with the sole task of explaining the first one. The subfield of AI that is concerned with improving model transparency is called explainable AI (or XAI for short). As is perhaps evident by the name, XAI helps human beings understand why the machine reached a particular decision. If humans need to act on decisions made by the system, it’s often very important for these outcomes to be explainable.
Within explainable AI, there is a noteworthy distinction between global interpretability — which enables you to understand how the model generally responds in any situation — and local explanations — which show why a decision was made in one specific case. Global interpretability is very useful for developers and business operators because it improves debugging, understanding and adherence to regulatory compliance. Local explanations are also important to end users who want to know how an algorithmic decision came to its conclusion in their specific case and whether it used fair criteria in the process.
Do we have time for explainable AI?
In a typical AI development workflow, putting in the effort to open up a model might oftentimes feel like time you cannot afford. Machine learning engineers have done their data analyses, feature engineering, model selection and (hyper) parameter tuning in several cycles and finally landed on a model that gives satisfactory, may you even say impressive, results. Product managers have become excited about the model’s performance and its implications for business outcomes. Everyone is on fire to get this thing into production.
Why pause now and take the time off to play around with the outcomes?
Because, in my five years as a machine learning engineer at Slimmer AI, a company with more than 10 years of experience building over dozens of AI solutions, I have seen quite a few cases where explainable AI saved the day.
Case 1: Don’t copy that
One of the AI products in the Slimmer AI Science team classifies scientific documents based on relevant pharmaceutical information. When development started a few years ago, we were all excited that the first results seemed very promising. With both very high precision and recall, we were able to pinpoint which documents mattered.
To our customer however, an automated decision wasn’t enough. Their workflow — and the pharmaceutical industry itself — is heavily regulated. They needed to know why a document was classified in one category and not another.
Initially, we used LIME (paper, code) to get an approximation of which words in the preprocessed text had contributed most to the results. LIME fits a simple white box model on the feature space directly surrounding a data sample. This way, it learns what subtle differences in input values cause one prediction over another and hence, which features — in this case, words — are most important.
To everyone’s surprise, one of the most important “words” that popped up in several texts was the number 169 at the end of the last sentence. We were baffled by this random looking number and checked the original text for clues.
It turned out that texts from one specific source almost always included a copyright mark in html tags at the end of the text. After preprocessing, only the number 169 remained. Typically, texts from this source had a higher likelihood of belonging to one specific category, hence the model had picked up on this by using the copyright reference to discern between categories.
While it might be a good idea to include a feature that specifies the source of the text, our model would not have been robust if we had put it into production in its current form. All it would take is for this one data source to remove the copyright symbol, or for another source to add it to their texts as well, and our model’s predictions would be incorrect.
We improved our preprocessing by properly removing whole html tags, increasing our confidence in the outcomes of this new model in production.
Case 2: Do you see what I see?
In a pan-European medical partnership we were the designated party to automatically detect skin infections surrounding the driveline tube of patients with a ventricular assist device. The final AI product was an app through which patients could check the wound surrounding their driveline. In case of moderate or severe infection, immediate contact with a physician would be warranted.
In the initial dataset, photos of the driveline entering a patient’s body were categorized by human physicians into one of four categories ranging from no to severe skin infection. Because the physicians sometimes disagreed, it wasn’t possible to obtain a perfect score. Even more, the photos were often cluttered with tissues and bandages and the dataset was relatively small.
With such ambiguous data, it was even more important to make sure the model made sensible decisions.
While deep neural networks often get a bad reputation for being opaque black boxes, a technique called Grad-CAM (paper, code) enables you to visualize what the network is paying attention to when making its decision. For many use cases, this is a helpful technique to determine whether or not you should trust a prediction at face value.
Grad-CAM showed that even though our model often made very sensible predictions, it occasionally based its decisions not on the wound surrounding the driveline, but solely on the driveline itself! Presumably the small size of the dataset and the prominent appearance of the driveline had given the model the false impression that the orientation of the driveline contained an important clue.
This insight prompted us to automatically filter out the driveline from the photos before presenting them to our model. We tested and compared several masking and filling techniques, each time using Grad-CAM to analyze the impact. Eventually, we were able to settle on a specific filling technique that made the model’s predictions a lot more robust.
Case 3: Can’t touch this
In late 2018, we partnered with an institution to help identify customers at risk of getting into severe debt. The institution intended to reach out to these persons early on to get them on a customized payment plan before their finances would worsen even further.
Their database consisted of a multiyear fine and transaction history of millions of customers. They also had access to another database from a different institution, containing additional personal information about their customers.
The domain experts involved intuitively felt that this second database would provide valuable information. However, due to privacy concerns, the organization was weary to combine these two databases unless this proved to be significantly beneficial to the cause.
We trained two models on the data in the first database: a simple logistic regression and a gradient boosted trees model (XGBoost), both using a large collection of engineered features. XGBoost easily outperformed logistic regression, giving us confidence that a more complex AI model was justified.
Next, we tested if we could make these predictions even better by adding the second database. To everyone’s surprise, the resulting model did not lead to any better results.
We therefore computed the feature importances of both models. This showed which input characteristics had the largest impact on predicting who would end up with severe debts.
For the model using only the first database, one particular customer fine in combination with several payment behaviors proved to be most important. For the model using both databases, it turned out that one specific characteristic from the new database was by far the highest contributing feature.
This meant that the second database indeed contained information that significantly contributed to making a correct prediction, just as the domain experts had suspected. But the same information was already captured in the model using only the first database, albeit through a more complex combination of several manually engineered features.
This led to a sigh of relief, as the client we had partnered with felt they could safely focus on the first database and uphold their value of privacy by removing any dependency on the second database.
In the three different problems from the examples above, three different approaches to explainable AI were used. This shows that explainable AI is not just a single trick you can apply to each use case; it is a whole subfield of AI that is waiting to be mastered.
The main takeaway is that even though it is essential to have concrete metrics such as F1 scores and accuracy, we must also take the time to get to know and understand our models. No metric will capture the insights you can gather from opening up your model and getting an understanding of why it makes the predictions that it does.
To make the most of explainable AI, we use a 6-step transparency guideline at Slimmer AI when diving into a new AI adventure:
1. Agree on the goal
Before doing any data prep or model selection, align both engineers and business stakeholders on what the model is going to optimize. When building a predictive engine, how do we represent each category? And what are the business consequences if the model misses an instance or when it makes a mistake? The answers to these questions will determine how the model should be fine-tuned and help establish your first step towards a more transparent model.
2. Explore the data
Next, get to know your data through descriptive statistics and correlation plots. This will already provide some insight on class distributions, outliers and most relevant features. The results from this step will often cause you to refine the goal from step one. If the data contains personal information, rank the risk of using each feature or specific samples with respect to discriminative outcomes. Unfair bias should be checked at the end of the development cycle, but now is the time to make a first selection.
3. Start simple
It’s best to start off with the simplest baseline model you can think of to avoid overfitting on your data. Additionally, simple models are often more interpretable out of the box. Although customers and marketing may love the idea of a deep neural network, if a good ol’ linear regression can do the job just as well, then it’s best to stick with the basics.
4. Get some perspective
Before presenting evaluation scores to your team or other stakeholders, look at the relative importance of each feature. Does this make sense? Does domain expertise acknowledge their importance? Could any of these features indicate unfair bias? Present these global model explanations alongside your evaluation scores and bias report to stakeholders to improve the model’s trustworthiness.
5. Zoom in
Next, look at the local explanations of both correctly and incorrectly predicted samples. Is it plausible that the samples were misclassified given these features and their values? Often your features can be further fine-tuned based on the information you gather during this review. Once you’re satisfied with the model’s results, report indicative sample explanations so that stakeholders gain an intuitive understanding of the model’s limitations. If possible, present explanations alongside predictions once your model goes into production.
Once your model is in production, make sure to continuously check for changes in the input data which could cause your model’s performance to degrade over time. Monitoring data drift and real-time performance will increase confidence in the model’s robustness.
I hope these examples have made you excited to embrace explainable AI in your own workflow. It’s a topic I’m very passionate about and at Slimmer AI we are constantly monitoring and testing novel ways to improve our models’ transparency. Our current R&D focuses on whether some of our high performing tree ensembles could potentially be replaced by a more white-box Explainable Boosting Machine.
What is your favorite technique? And do you have any other examples where explainable AI saved the day? Please comment here or reach out to me to discuss!