Intelligence is the ability to learn from experience, solve problems, and use knowledge to adapt to new situations (David G. Myers, Psychology 12 Edition). I refer to artificial intelligence (AI) systems as a collection of advanced technologies that allow machines to sense, comprehend, act, and learn.Machine learning and statistical models are often within the heart of these AI or data-driven systems.
I experience three major pitfalls repeatedly when designing statistical / machine learning models for data-driven systems.
not knowing the business value and definition of good
selecting wrong or biased information
designing models which are too complex and costly to maintain
In the following, I will only focus on the selection bias problem.
Machine learning systems and as well humans drive the learning through iterations with data/information. The quality, amount, preparation, and selection of data are critical to a machine learning solution’s success.
One famous statement in machine learning is garbage in will lead to garbage out. Of course, I fully agree with this statement. However, often it is not clear what garbage is.
There are obvious issues related to data quality problems, like missing data or outliers even before we can judge the quality of data. We have to select a representative data set linked to your business application.
Any AI engine and as well every statistic calculates on the seen datasets. However, every underlying dataset is the product of human decisions. Human biases occur within the selection and curation of data, which will show up in AI systems’ outputs.
The video summarises an AI inclusion problem; it shows the biased output concerning the search term ‚family ‘.
From a society or ethical perspective, it is a significant problem, often summarized as an AI inclusion problem. Note that the term AI inclusion many different aspects are typically discussed around development, social impact, policy implications, and legal issues concerning AI systems, see, e.g., aiandinclusion
Technically speaking, what happens in this video is a selection bias problem.
Selection bias occurs when the samples used to produce the model are not fully representative of cases that the model may use in the future. The selection bias comes not only within AI systems; it lies in the heart of any statistical evaluation.
In a data-driven world, automated processes and decisions are based on statistical models. Thus, a basic understanding of statistics is mandatory to judge critically on results and outputs.
I realize a strong wish towards content featuring deep learning systems in my current data science lecture class.
However, we have to teach the basics first before going fast into modeling. Within our data science education programs, we have to focus more on statistics and thus the judgment of AI systems instead of teaching fancy algorithms.
What do you think — explicitly teaching the topic selection bias — would it help tackle the bigger AI inclusion problem?