Machine Learning operations(MLOps) is a field created from applying DevOps principles to machine learning. A question that might arise is, why do we need to apply DevOps principles to machine learning? Well, to answer that, we need to understand the workflow involved in machine learning. To build a machine learning model to solve a real world problem, we need to extract data, preprocess the data, transform the data, perform data analysis, model selection, metric selection, model training, model testing, hyper parameter tuning, model validation, and then model deployment. Handling these task manually can prove to be daunting and could create issues if changes are to be made frequently. Implementing MLOps means that you apply automation and monitoring of all steps in the construction of a machine learning model/system, including integration, testing, releasing, deployment and infrastructure management. More information can found here
“Kubeflow is an open-source platform, built on Kubernetes, that aims to simplify the development and deployment of machine learning systems. Described in the official documentation as the ML toolkit for Kubernetes, Kubeflow consists of several components that span the various steps of the machine learning development lifecycle. These components include notebook development environments, hyper parameter tuning, feature management, model serving, and, of course, machine learning pipelines.”
Pipelines are used to automate and orchestrate the various steps in the workflow used in creating a machine learning model. Older approaches involve having the entire workflow for a model as a single script. Hence, each model to be tested will have its own script. Pipelines are made up of components, components are containers that houses each step in the workflow making it possible to use it in other pipelines if need be. It should be noted that these components act independently from other components, for example, everything needed for data preprocessing i.e. libraries and the code script can only be used in the preprocessing component, trying to access the libraries and script in the preprocessing component from another component is not possible. With this procedure, it is a lot easier to make use of a component from a pipeline in another pipeline if the methods are the same, making the component reusable. Hence, no need for multiple similar scripts for different models.
There are two ways to creating a pipeline:
Using light weight components.
Using reusable components.
For this article, I would focus on reusable components but speaking briefly, light weight components are used for fast and easy deployment, one of the set backs with light weight component is they aren’t reusable.
Building a reusable pipeline with kubeflow:
To access kubeflow, you need to deploy it to a cluster, I use the google cloud platform for deployment, other cloud services can be used such AWS, IBM, and even locally, your personal computer. I won’t be covering how to deploy kubeflow on clusters, but will provide a link at the end of this article.
Before deploying kubeflow, you need to make sure all your components are ready. Let’s go through the process.
Data Preprocessing Component:
For this example I will create a three component pipeline which includes a preprocess component, model training component, and model testing component. I will use a project I participated in on the subject of road safety, details, and code can be found on my git repo. Link to the repo is found at the end of this article.
The preprocessing script
The task it create a model that can predict the accident severity based on certain features. In this script we are basically just cleaning up the data for model training. We would also create a Dockerfile, so we can build and push an image to Dockerhub.
Don’t worry, if you do not understand what is going on, a link on how to build and push to dockerhub is provided at the end of this article. The article provided goes into great detail on what is happening above.
Your local directory should look like this:
Now, all you have to do is change directory to the data_preprocess directory using your command prompt, then build an image and deploy to dockerhub ( there are various ways to deploy to dockerhub, I do mine locally). A link on how to build and push to dockerhub is provided at the end of this article.
The same process would be done for the model training and testing. The only difference is that unlike the data preprocessing component, the model training and testing all receive inputs, to receive these inputs in the script we do this.
Model Training Component:
Here, we use argparse to handle the inputs. The same procedure would be used to build and push to Dockerhub using its own Dockerfile.
It should be noted that the outputs from the functions are serialized objects.
To know your build and push were successful, your dockerhub should have these:
Creating The Pipeline:
Once deployed to kubeflow, click on notebook servers, create a notebook and connect to the notebook. When connected to the notebook, the interface is the same with the jupyter notebook interface. You create a new .ipynb file and run the following:
!python -m pip install --user --upgrade pip
!pip install kfp
Once done installing, you comment out the lines of code above and you restart and clear the notebook.
Note: kfp is an sdk used in creating the components and the pipeline.
After all that has been done, the next thing is to import the necessary libraries and create the components.
from kfp import dsl
To create the components that access the docker images from dockerhub, we do this.
The image variable is the name of your docker image on dockerhub, that is how the dsl.ContainerOp function accesses your code. The arguments list are your input variables to your code script, and the file_outputs dictionary are the serialized objects or .txt files.
Now that all the containers have been created, we connect them to form a pipeline using the following:
All is set, now we run the experiment using the following code.
client = kfp.Client()
Once done correctly, it displays two links, click the run details to show the run process of the pipeline, if successful(assuming there are no bugs or libraries you forgot to import) the finished work would look like the pipeline graph displayed at the beginning of this article.
With this, you should be able to create your own reusable pipeline. Here are some helpful links.
Although the pipeline created in this video was done with light weight components, this tutorial shows how to deploy kubeflow on google clusters using the google cloud platform.