Fighting off ML Bias

Artificial Intelligence, Machine Learning more accurately, has truly spread across industries and business functions. Granted, with vast differences of maturity and success. What we’re seeing is that returns are coming from systems that have successfully built innovative models, built up an ML & Data Ops foundation to support continuous improvements, and a business model that can leverage it appropriately. With more and more companies doing this we are now seeing that not all models are equal and in some cases risk of legal or brand damage can occur if models show signs of bias or unfairness.

“My bet for the next 10 years is that the most competitive businesses will be Responsible AI-driven companies.”

— Lofred Madzou, AI project lead at the World Economic Forum

This paper assumes you have the data scientist, ML engineers, MLOps, and everything in place to productionalize your needs. And you’ve implemented a few critical models or a system of many super-focused models working in conjunction, but you’re up at night with a fear that not much is in place to ensure ethical and fair results from the model. And that you could be managing a time-bomb if some governance isn’t put in place but wondering “what can be done?”

Image 1: Typical Sources of Bias in ML Systems

Bias, and by loose extension unethical usage, of machine learning stems from typically 5 areas:

Bias in the Organization — Is the business, industry or culture already full of bias/discrimination in the way things operate? It’s likely one may not even know if this is the case, but in organizations such as policing or college admissions there have been enough studies to show inherent biases exist. These should be dealt with before an ML system is built up around them, they will be replicated and typically amplified by the model in this situation.
Bias in the Problem — Is the problem defined in a way that will consciously or unconciously discriminate? For example, excluding mortgage applicants, because they’re not using the web site and didn’t approve sharing PII. A model might prioritize “safer” digital candidates where it has a complete view of the applicants personal data and past transactions versus ones that require a paper application provided in a branch.
Bias in the Data — Is there bias found in the training data used coming from sampling or collection issues? Profile it, interrogate it based on potential risk areas (e.g. gender or race) to look for data imbalances or outliers that could lead to bias in the trained ML system.
Bias in the Model — Is there trained bias caused by the model’s design or tuning, beyond ones inherited from the previous areas discussed above? Some algorithms are easier than others to peek into the model’s workings, but in general, ML models are seen as black boxes making a review of the inner-workings impractical.
Model Misuse & Incorrect Generalization — Were the models extended for use cases or data sets that were not intended for during training? Data describing how the world works changes over time resulting in potentially incorrect and biased results from models that may have operated as intended before. These issues can cause problems since most ML models are very narrow by design and don’t generalize well as underlying data and use cases change.

Consider a hypothetical university with an online learning platform. You might see the following set of ML models working together to improve student and university outcomes. More than just one monolithic model, the head of analytics here has a complex set of models working together as part of a student experience plan. This is becoming more and more common as we find it’s easier to maintain many narrow models compared to one big one.

A typical long list of applied models might look like this:

Planning & optimization models to create a personalized syllabus for students
Classification & regression models to monitor and forecast students based on progress
Conversation agents for forums and student support
Sentiment-analysis to detect student emotions and determination
Recommender systems to suggest additional courses and further readings
Classifiers and NLP techniques for automatic e-assessment of assignments
Etc….

The risk of something going wrong and offering up certain courses based on gender or grading based on an unconscious preference towards student segments or response styles becomes entirely possible in today’s world.

While this example above is hypothetical, recently during COVID-19 in the UK, the Office of Qualifications and Examinations Regulations (Ofqual) trained a model to grade students entering into university based on a poorly formed problem, objective definition, and poor data leading to 40% of the students doing worse than their teachers predicted. The resulting chaos around the “F#$% the Algorithm” protests is a text-book example of the results of not factoring in bias risks from the start. [14]. Upon inspection, it disproportionately hurt working-class and disadvantaged communities and inflated scores for students from private schools!

To add complexity, imagine two or more highly-correlated features such as ‘time spent answering questions’ and ‘primary language spoken’. A model should, potentially, infer based on ‘time spent answering questions’ and not ‘primary language spoken’ to topic mastery; even though the two are potentially heavily correlated.

Showing that a model is putting substantially more importance on defendable features and less on controversial ones like economic, nationality or gender would be very useful for the university when backing up a model’s robustness and impartiality.

In general, our poor head of analytics needs to guard against the following scenarios [5] where AI :

Unfairly allocate opportunities, resources, or information
Fails to provide the same quality of services
Reinforces existing societal stereotypes
Denigrates people by being actively offensive
Over or under-represent groups

Much like in law where the burden of proof falls on the disruptor, we’re seeing a new set of burdens or obligations arise for data scientists looking to innovate. This is an area that all the major players, including DARPA and Big Tech, are investing in heavily.

We’ve explored the risks associated with the data and analytics used within an analytics-driven organization in a previous post. Due to GDPR and a resurgence of government regulations and focus on consumer rights, the risk is real for organizations that used to hoard data and create intrusive ML models. A common practice for more than a decade. Now organizations need to put in place safe-guards against:

compliance/regulator fines
impact on brand/reputational strength
public perception of discrimination or unethical behavior

A recent Wing VC survey of data scientists [11] found model explainability was respondents’ top ML challenge, cited by greater than 45 percent of respondents, with data labeling a distant second (29 percent) and model deployment and data quality checks rounding out the top 4.

At Slalom we approach this proactively so that ML systems can be designed to be transparent as possible with an awareness of risks present. Leading to a system that can better operate in a sustainable and fair fashion.

Image 2: Slalom’s Sustainable ML Framework

The more complex a model or an ensemble of models becomes the harder it is to look under the hood to question the fairness or even logic used. Similar to the working of the human mind; it’s way too difficult to guess at what’s going on with all of the tiny chemical reactions. So we base judgments of intelligence and ethics on actions observed of the black box (i.e. our brain). The original Turing Test sought to define a general test for intelligence leveraging only observed signals coming out of the black box.

Following on this approach, for ML model transparency there are roughly two main & related approaches: SHapley Additive exPlanations (SHAP) [12] and Local interpretable model-agnostic explanations (LIME). In general, they tweak the inputs one at a time a bit and measure the impact on the model’s output to try and build, per feature, an indication of which features have the greatest influence on the model’s prediction or classification. This should sound a lot like feature selection when building a ML model….

SHAP was developed by game theorist Lloyd Shapely who explored how important is each player is to the overall cooperation [outcome], and what payoff he or she can reasonably expect.

“Traditionally, influence measures have been studied for feature selection, i.e. informing the choice of which variables to include in the model [8]. Recently, influence measures have been used as explainability mechanisms [1, 7, 9] for complex models. Influence measures explain the behaviour of models by indicating the relative importance of inputs and their direction.” [7]

Image 3: Shapely Values illustrated overview

For example, for a binary classifier, high positive shapely values (the red or blue values above) imply the feature is more likely to be “1” while a negative shapely value for a feature implies it’s contributing to an outcome that is likely to be a “0”.

A group from Carnegie Mellon University in 2016 introduced a family of Quantitative Input Influence (QII) measures that capture the degree of influence of inputs on outputs of systems. Leveraging Shapley values and other techniques they tried to calculate the marginal influence a feature has by itself and jointly with other features on the outcome.

Coming up with a set of generalizable QIIs and reports is a step in systematically monitoring models to understand if bias is creeping in or at least that the “right inputs” are more influential for the model based on human judgment.

Modern tools provide some predefined metrics (aka QII’s) to cover data and model biases.

Note how the first two sets of metrics don’t even look at the actual inner workings of the model itself. These are black box measurements. Some of the latest tools coming out of the big tech companies include ways to looking into neural networks to peer into how the weights and connections work together — for example Google’s XRAI which builds out a heatmap of sorts based on a input node’s importance, or any layer in a neural network.

Bringing this to the real-world we are now seeing various software tools coming available in 2020. One such tool seeking to address this is Truera (raised $12 million in VC funding for its AI explainability platform). While useful for deep learning it’s equally relevant for all classification and regression models out there.

‘We study the problem of explaining a rich class of behavioural properties of deep neural networks. Distinctively, our influence-directed explanations approach this problem by peering inside the network to identify neurons with high influence on a quantity and distribution of interest” [4]

Truera has some great case studies that can help provide real-world references [6][7]. Their platform will likely be a strong tool for assessing bias in ML.

Image 4: Truera report showing model disparity for gender and influence analysis for various features such as income and department

In addition to Truera and other niche tools in the market, Google Cloud and Amazon Web Services announced their own solutions. Google Explainable AI and AWS Clarify tool during re:Invent 2020. They implement many of the concepts above, and more, to let you explore the influence analysis of selected features on an ML model and to analyze data sets for potential biases.

Footer