Imagine an autonomous vehicle traffic sign detector whose accuracy plummets when dealing with rain or unexpected inputs. With machine learning (ML) an increasingly integral part of our daily lives, it is crucial that developers identify such potentially dangerous scenarios before real-world deployment. The rigorous performance evaluation and testing of models has thus become a high priority in the ML community, where an understanding of how and why ML system failures might occur can help with reliability, model refinement, and identifying appropriate human oversight and engagement actions.
The process of identifying and characterizing ML failures and shortcomings is however extremely complex, and there is currently no effective universal approach for doing so. To address this, a Microsoft research team recently introduced Error Analysis, a responsible AI toolkit for describing and explaining system failures.
The Error Analysis toolkit performs two main functions:
- Identify cohorts with high error rates versus benchmarks and visualize how the error rates are distributed.
- Diagnose the root causes of the errors by visually diving deeper into data and model characteristics (via its embedded interpretability capabilities).
Error Analysis starts with error identification illustrated using error heatmaps or decision trees guided by errors. With the heatmap option, users select input features of interest and the heatmap highlights cells with higher errors in darker red colour. Microsoft explains this error identification view’s analysis is highly guided by users’ knowledge or hypotheses of what features might be most important for understanding failures.
The binary tree visualization meanwhile leverages input features to maximally separate model errors from successes. This approach is useful in real-world cases, where errors are often caused by more than one feature and it is difficult to discover what combination of features conspired to cause a critical failure. With tree maps, users can explore the following information:
- Error rate — a portion of instances in the node for which the model is incorrect, shown via intensity of the red colour.
- Error coverage — a portion of all errors that fall into the node, shown through the node’s fill rate.
- Data representation — number of instances in the node. This is shown through the thickness of the incoming edge to the node along with the actual total number of instances in the node.
After identifying cohorts with higher error rates, Error Analysis then further diagnoses the errors using four methods: Data Exploration, Global Explanation, Local Explanation and Perturbation Exploration.
Data Exploration explores dataset statistics and feature distributions, detecting underrepresented cohorts or cohorts with significantly different feature distribution. Global Explanation explores the top K important features that are most influential to model prediction for a selected data cohort. Local Explanation helps understand the accuracy prediction for each data point, identifying any missing features or label noise that could lead to prediction errors. Perturbation Exploration observes prediction changes after applying changes for selected data points.
It’s hoped the new toolkit will enable researchers and practitioners to more efficiently and accurately identify and diagnose error patterns, an important step in the development of robust ML systems. The toolkit is integrated with Microsoft’s Responsible AI Widgets OSS repository on the project GitHub. Additional information is available on the Error Analysis website.