Now we have understood machine learning bias seen the drastic impact it could have on our lives. Let us understand what the different types of biases or the different factors causing machine learning bias are.
Data Bias
Suppose certain elements of a dataset are more heavily weighted and/or represented than others. Then the resulting machine learning bias could be attributed to the data.
Data bias can further be specified into the following types:
1. Sample bias/ Selection Bias: the data used is either not large enough or representative enough to teach the system.
For example, If a facial recognition system trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities.
2. Prejudice Bias /Association Bias: the data used to train the system reflects existing prejudices, stereotypes, and faulty societal assumptions, thereby introducing those same real-world biases into the machine learning itself.
For example, using data about medical professionals that includes only female nurses and male doctors would thereby perpetuate a real-world gender stereotype about healthcare workers in the computer system.
3. Exclusion Bias: it’s a case of deleting valuable data thought to be unimportant. It can also occur due to the systematic exclusion of certain information.
For example, imagine you have a dataset of customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. However, this means your model will not pick up on the fact that your Canadian customers spend two times more.
4. Measurement bias: the data collected for training differs from that collected in the real world or when faulty measurements result in data distortion.
For example, in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera. Measurement bias can also occur due to inconsistent annotation during the data labeling stage of a project.
There could be other types of machine learning bias whose origins are NOT in data. Examples of such machine learning bias include:
1. Algorithm bias: when there’s a problem within the algorithm that performs the calculations that power the machine learning computations. Either the algorithm favors or unnecessarily opposes a certain section of the population.
2. Anchoring bias: occurs when choices on metrics and data are based on personal experience or preference for a specific set of data. By “anchoring” to this preference, models are built on the preferred set, which could be incomplete or even contain incorrect data leading to invalid results.
For example, if the facility collecting the data specializes in a particular demographic or comorbidity, the data set will be heavily weighted towards that information. If this set is then applied elsewhere, the generated model may recommend incorrect procedures or ignore possible outcomes because of the limited availability of the original data source.
3. Confirmation bias/Observer Bias: It leads to the tendency to choose source data or model results that align with currently held beliefs or hypotheses. The generated results and output of the model can also strengthen the end-user’s confirmation bias, leading to bad outcomes.