Model selection, deployment, and cost estimation
In Part 1 of this series we covered the fundamentals that every AI product manager needs to know. Part 2 covered team management and the ML model development process. Now it’s time to select the best model and deploy it to production.
Model selection — the accuracy vs inference time vs size tradeoff
Choosing the correct machine learning model for production deployment is a tradeoff between model accuracy, size and inference time (inference is the process of the model generating results). Balancing the tradeoff depends on what your feature or application needs to do for the user.
As a PM you need to understand this tradeoff and consider the right balance while selecting the production model. For example, while selecting a recommendation model for a shopping or food delivery app where users want instant results, you should prioritize models with fast response time. While selecting a model that detects cancer in medical images, you should prioritize accuracy over everything else. For a model deployed on edge computing hardware, size is a major constraint so optimize for that.
Set model performance benchmarks beforehand and ensure that all stakeholders are aware of them.
Your team can then channel their efforts in the right direction. Many times non-technical stakeholders insist that the model have 100% accuracy, instant inference time and size in just a few MBs. Realistically speaking, this is not always possible.
Model deployment — saving the trained model
During model experimentation, a trained model is usually assigned to an object which is then used for generating results using the test set within the same script. In production, it’s just not practical to train the model all over again and then use that object every time you want to run inference. Some models are huge and take hours or days to train. Instead, the trained model needs to be saved as a file. While running inference, you need to load the model file, assign it to an object and then run a prediction function to which you pass input data as arguments.
Pickle is a popular and easy to use Python module used to save ML model files. It converts the Python object containing the ML model into a character stream. The ‘pickled’ model object is stored as a file that can be used to reconstruct the model in a different Python script. Yasoob Khalid has a great article explaining pickle. Joblib is similar to Pickle and works better for large numpy array objects. Both Pickle and Joblib are widely used with Python’s Scikit-learn machine learning library.
For models built using Tensorflow, an open source machine learning platform, a .pb file format is used to store the model and (if needed) neural network weights. Pytorch, another similar platform, uses Python’s pickle module to save models as .pt or .pth files. To convert a Pytorch .pth model file into a .pb file for Tensorflow or vice versa, engineers use ONNX — Open Neural Network Exchange format.
Model deployment — inference pipeline
While experimenting with several models, your team would have performed pre-processing, testing and result documentation manually and as independent tasks. Now it’s time to get them all to work together.
The process of a machine learning model generating results based on rneal-world input data is known as inference. An inference pipeline is a linear sequence of functions or containers that each perform a specific task. When fed live data via a suitable ingestion pipeline, the inference pipeline takes care of data pre-processing, generating results and any result post-processing that may be required. In a pipeline, the first function or container, such as one that handles pre-processing, passes its results as an input for the second, which runs the ML model and so on. The pipeline outputs the model inference results that can be sent to the application to show to the user.
Docker, a software containerization tool, is incredibly useful for ML deployment. Docker packages up all the application code, libraries, model files, and OS dependencies into a Docker container image. This image can be hosted in a container image registry such as Docker Hub or AWS ECR. From there the image can be downloaded into a different environment like a cloud server or another engineer’s machine. The downloaded image is used to run a Docker container that executes code in any new environment exactly as it did during development on the original engineer’s system. Docker is very convenient and frequently used tool for deploying the inference pipeline to production.
The same data pre-processing steps that you performed before model training need to be done on the live data before inference. Otherwise, your model will try to predict results based on data that has a different structure or format from the training set.
Here is a great resource by Shreya Ghelani describing an end to end ML workflow using AWS Sagemaker, a service to manage everything from data preparation to training to model deployment.
Model deployment — application integration
Once the inference pipeline outputs model results, they need to be shown to the user to be interpreted in a meaningful way. There are several ways to show results depending on your application such as highlighting a region of interest in an image, predicting future revenue in a table or displaying the recommended restaurant as a card. It’s your job as a PM to understand what’s the best way for the user to consume results.
For sending the inference results to the application, REST API and Websockets are some common and convenient methods. Websocket is preferred for real-time communication such as a chatbot while REST API is ideal for large data volume that is infrequently sent. You may also want to use an API to write results to a database to display when required.
Containers and their use in microservice based architecture are incredibly useful concepts to know not just for ML deployment but for any software applications. ML models and applications can run as separate microservices in their own containers that communicate to work with each other. This type of container orchestration can be done using Kubernetes.
Maintaining models in production — Data drift
A simple e-commerce app functionality of adding a product to the cart and making a purchase will work reliably and consistently unless changes are made to the code. A product recommendation model deployed within the app will not. This is because of a phenomenon known as data drift which is when a model gives increasingly erroneous results over time. That’s because in production, the input data based on which the model generates predictions starts to vary, or drift away from the data the model was trained on.
Consider a shopping app that uses a recommendation model to show you additional products based on what you have added to the cart. In 2020 the app lists Covid-19 face masks which sell a lot. Ideally, you would want the recommendation model to suggest something like hand sanitizers as relevant products for those buying masks. But the recommendation model has never been trained on this new data and so won’t know that in 2020, face masks and hand sanitizers are closely related products. To fix this issue we have to periodically retrain and re-deploy our models.
Maintaining models in production — retraining
Continue collecting and storing data using the ingestion pipeline even when your model is in production. This data will be used to train a new version of the model to tackle data drift. Once the model is re-trained, it needs to be deployed to production once again. If you are using Docker for model deployment, you can use a container image registry like Docker hub to maintain versions of the model and deploy the container with the latest one.
The frequency of training a new model depends on the performance of your model in production over time. Closely monitor whatever KPIs you had set to measure model performance. Retrain the model once you see the metrics drop to a level where your users are not getting a useful model output.
Cost
Machine learning is expensive, although costs are dropping every year with better cloud infrastructure and technology developments. Before you start any work, be sure to project monthly cloud costs for development and keeping the system running. There is no point in implementing machine learning if your cost is going to be greater than the potential savings / value addition.
Unchecked ML project costs have a bad habit of spiralling out of control and jeopardizing the entire project. GPU instances used for training deep learning models for computer vision and NLP are especially expensive. Here are some best practices that can help you save compute costs (AWS is used as an example but similar options are available on other cloud platforms as well):
- Set up alerts for when cloud costs cross 75% of the monthly budget.
- Create rituals to that remind your engineers to turn off instances once they’re done with them for the day. Daily rituals turn into habits.
- If your team has fixed working hours, set instances to auto start and stop during that time.
- If you are running inference on a GPU that needs to be up 24×7, go for AWS Reserved Instances with a commitment of 1 to 3 years. This can bring down you costs by as much as 75% as compared to on-demand instances.
- If you’re training massive models and don’t have strict deadlines, try AWS Spot instances which use currently available spare EC2 capacity and bring down costs by as much as 90% as compared to on-demand instances.
Machine learning often involves a massive amount of data, which means high cloud storage and database costs. Ensure you delete data that you don’t need any more. If you can’t delete the data but don’t need it frequently, move it to Glacier storage.
Pro tip for computer vision projects: Storing high resolution data captured with cameras as videos consumes up to 4x less space than saving it as frames. Always store videos to save cost and split them into frames before annotation and training. Delete frames once their job is done. You can always extract them from the videos again.