In the last year, there is greater focus from the industry to move Machine Learning (ML) models out of the labs to real-life deployments. In their love of building cooler and cooler ML models, ML teams have not paid sufficient attention to some of the critical Software Engineering aspects that we have learned the hard way in the last decades. A ML model that can be taken as part of a business software should be of high quality (functionally), robust, compliant and explainable. This article is a quick tour of these aspects that define quality in the data science lifecycle (DSLC).
In practice, DSLC has been looked at as three mini lifecycles — data management, model build and deployment management. In this article, I am breaking the deployment management phase further into as two sub-parts — Acceptance into service (AIS) and In-life service (ILS). AIS includes the necessary independent validation & verification functions like integration testing, business acceptance testing etc. whereas ILS is more of an operations function maintaining & monitoring the ML model in the production system.
In an enterprise set up, typically no valuable project is one off. An application delivered would have to be maintainable and continuously developed over a long period of time. No single component on its own can deliver the business value. DSLC should be designed with well thought out configuration management plan for long shelf life for all the project artefacts. Even though we will not be explicitly discussing them, these fundamental aspects must be kept in mind for rest of this discussion.
Following broad aspects should be part of the DSLC for taking a ML model live.
- Functional validation — Ensure functional performance of the ML model through adequate testing including unit, integration, business acceptance testing. Code level validations are also an essential part of end to end validation strategy.
- ML Fairness: ML model is fair and favors or disfavors no single group or collection of groups
- Secure DSLC: Data Science lifecycle follows secure data and model building practices
Above three requirements in addition to Data Privacy (which we are not covering in this article) fundamentally form the quality foundation of a ML model. Let us look at them individually.
Early identification and handling of issues is essential to keep the cost of quality at optimal levels. A data engineer’s view on quality will revolve around standard data quality measures. Ask any ML model engineer about testing — answer will mostly involve test and validation data sets. But DSLC has to provide a holistic quality assurance to the delivered ML model.
Data management phase
As part of the data pipeline activities, data engineers have been including the necessary checks to ensure basic data quality goals like duplicate checking, missing values etc. Apache Nifi and Airflow for example have validator mechanisms in place. Testing requirements of this phase is more mature compared to other phases due to ongoing BigData/BI/Analytics projects. Current focus is to move away from batch-mode thinking and bring Data Management phase as close to CI/CD requirements as possible. In the last couple of years, specialized frameworks like Great Expectations and Tensorflow Data Validation (TFDV) have been built with focus on early testing of data. These frameworks are built with specialized testing features while bringing the necessary separation of validation requirements from the ‘programming’ part of the data pipeline.
Model build phase
An essential step that would reduce nightmare diagnostic runs during model build time would be to check the data before starting with ML model training. While a level of assurance on data quality should have been done during ingestion, data would have passed through many levels of transformation in the data pipeline to get to this point. Defensive programing at this stage is highly recommended with checks like data shape validations, schema checks, data anomaly etc. These can be achieved with a mix of inline programming using python code and data library function calls — eg. TFDV functions like display_anomolies/validate_statistics.
Build time validations with test dataset ensure that the ML model works within the constraints of the original data distribution. A ML model’s readiness to be promoted from development to production does not depend only on the aggregate performance metrics like F1 score, AUC, ROC etc. Theoretically, a valid scenario could be over-represented in the failures even in a high performing ML model. Only when the data scientist studies the result and the associated test data (sample by sample) could she/he find this skew. This very important aspect of manually validating ML model performance is more times skipped by the ML engineers.
Deployment Management — AIS phase
Integration testing: In projects where ML based application replaces an existing non-ML application, rule of thumb approach for integration testing should be that success criteria does not have to be changed on account of the ML introduction. Any change at test cases level to accommodate ML introduction will have to be clearly analyzed and signed off by the test manager.
Acceptance testing: Real value of acceptance testing comes when the end user develops a level of trust with the application before its launch. Developing trust with a ML-model-at-core application might be more difficult. Due to its non-deterministic nature, a ML model based system might look too temperamental as compared to a traditional, programmed one. Most preferred approach is to give the user a way to play around with the ML model through simple-to-use interfaces that give her/him an intuition on the ML model behavior. Behavior can be inferred by letting the user change the feature values within understood limits (create controlled perturbations) to see the impact on the predictions. If the user creates a perturbation expecting the inference to change from the original prediction, the behavior might be a cause for concern. Italicization is intentional to mean that unexpected ML model behavior need not be an issue in all cases. Coupling this kind of experiments with ML model explainability will get the user develop a deeper appreciation of the working of the ML model and develop confidence in the new system. IBM’s AIX360 is a fantastic framework for this. In addition to these, setting up a way to run counterfactuals is another way to peer into the mysterious workings of complex ML models. Google’s What-if tool and Cortex CertifAi are two tools that provide a nice framework for achieving this objective.
Deployment Management — ILS phase
Inference time data input has to be continuously assessed as well. Checking for training-serving data distribution skew is something that can help identify the problem before it hits the deployed ML model. For making this assessment, we need to have a continuous comparison of inference data against the baselined training data linked with the live ML model. A trend on the skew might provide a measure of drift over period of time and help take necessary action before the system falls over. TFDV provides handy generate_statistics/validate_statistics functions to create the training baseline and validate the inference data. For unstructured data, newer approaches using histogram & centroid based baselines are starting to find use in this area.
Usage of initial live testing like A/B, Multi Arm Bandit, Canary testing etc. before full-fledged roll-out is now getting to be more prevalent and maturing fast. As the real test of ML model performance is with live data, it is highly advisable to set up these kinds of live randomized controlled experiments wherever feasible. Success of these tests clearly depends on the selection of KPIs, cohort selection and length of testing. As the data is live and real-time, it might be tempting to react to the metrics quickly. But due to the statistical idiosyncrasies which is the main reason why we take up doing these types of testing in the first place, it is better to let the tests run their course, study the results and take actions.
- Usage of specialized testing frameworks like Great Expectations & TFDV during data pipeline phase highly recommended
- Build time validation is a combination of ML model performance metrics and manual checks of failure mode scenarios to ensure proper ML behavior
- During acceptance testing phase business users should be given the right tools to play around and develop trust with ML models
- A proper ML ops strategy includes monitoring for data drift using baselined training data and live inference data to prevent live ML model failure
- Use randomized control experiments to ensure ML model performs well under live traffic conditions
We shall be covering ML Fairness and Secure DSLC in parts 2 & 3…