Google Cloud services for MLOps

You can assess your MLOps maturity by considering the level of automation and reproducibility that you have in your AI projects. One approach proposed in this article is to define three levels:

MLOps level 0 is a manual process for AI initiatives.
MLOps level 1 brings ML pipeline automation.
MLOps level 2 supports a full CI/CD pipeline for all ML activities.

Now how can you leverage this maturity model in your organization? First, you need to understand your current maturity level regarding MLOps practices, which you can do through a maturity assessment exercise. The assessment will help you identify your strengths and weaknesses in ML, not only from a process point of view but also from considering the people and tools in place.

Doing a maturity assessment means nothing if you don’t plan to act on it. The goal of doing an assessment is not to obtain a score but to identify where you are and where you want to go. So the next step is to identify the target MLOps level you wish to reach. Once you have a target goal, you can do a gap analysis to determine what is missing between your current state and your final objective. This information will help you define the different steps (roadmap) to reach your target.

The gap analysis is critical to understand how to change your organization from a people, process, and technology perspective.

Gap analysis to identify target state and roadmap to achieve it.

Now let’s assume that your MLOps process is defined and that you are finally ready to leverage tools to support your AI practices.

You will be looking for a technical solution that supports the entire machine learning lifecycle, from discovery and feasibility to model experimentation and operationalization. The tool must ideally automate some or all aspects of AI experimentations, model training and evaluation, model deployment, and model monitoring.

There are multiple options on GCP to support this need. As often, there is no one size fits all solution. You will have to decide based on your objectives and the teams in place. And it is absolutely fine to select multiple tools in order to address different needs.

The first option to consider is Google Cloud AutoML. Why AutoML? Because it will take care of many of the activities an AI team would typically have to conduct.

With AutoML, you inject data, and it trains a model that is relevant for the problem you are trying to solve. Then you can use the trained model and serve it for online or batch predictions. AutoML offers a shortcut and accelerates several typical ML typical tasks such as data exploration, data preparation, feature engineering, model selection, model training, model evaluation, and hyperparameters tuning. AutoML streamlines the AI process and automates tasks that are usually part of MLOps.

AutoML on GCP: From data acquisition to prediction

Another option for MLOps on Google Cloud is to create a pipeline using GCP services. As mentioned before, MLOps is all about the automation of AI tasks to support an end-to-end lifecycle. There are fully managed GCP services that you can use to automate data extraction, data preparation, and model training. Let’s take a simple example with four essential GCP services: BigQuery, Cloud Storage, AI Platform, and Cloud Composer.

You can use BigQuery for exploratory data analysis and data preparation (structured or semi-structured data). You can store unstructured data in Cloud storage (images, video, audio) You can use the AI Platform to train a model on a regression or a classification problem. And you can use Cloud composer to orchestrate the flow from the beginning (data acquisition) to the end (serving and monitoring).

Simple GCP services to orchestrate ML workflows.

I just covered some simple GCP services options to support basic MLOps processes. But for more advanced needs, you may want to consider the use of MLOps frameworks. The most popular ones are currently Kubeflow Pipelines and Tensorflow Extended (TFX). Both are open-source frameworks and are fully supported on GCP.

Kubeflow is a Kubernetes-native machine learning toolkit. With Kubeflow Pipelines, you can build and deploy portable and scalable end-to-end ML workflows based on containers. Kubeflow started as an internal Google project, then it has been open-sourced, and it can now run everywhere Kubernetes is supported (Cloud and on-premises).

With Kubeflow, the first step is to create a pipeline definition using a specific SDK. Once the pipeline and its different components have been defined, you need to create a Kubeflow experiment to run the pipeline. The various components are then containerized to be executed, as a workflow, on a Kubernetes cluster.

Another popular MLOps framework is Tensorflow Extended (TFX). TFX started in 2017 as an internal initiative at Google and quickly became the framework of choice for end-to-end ML at scale, not only within Google but also across all other Alphabet entities. Then in 2019, TFX became a public OSS offering.

TFX is a configuration framework that provides components to define, launch, and monitor a Tensorflow machine learning system. There are multiple components in the TFX library, and you assemble them based on your MLOps needs to create a Directed Acyclic Graph (the DAG is your TFX pipeline). The pipeline will typically mix data ingestion and validation, data transformation, ML modeling, model analysis, and model serving steps.

TFX components (Image from https://www.tensorflow.org/tfx)

If TFX provides everything needed to define a pipeline, you must leverage an external orchestrator to run the pipeline. At the time of writing, the TFX supported orchestrators are Apache Airflow, Apache Beam, and Kubeflow. This step is pretty straightforward as it is mostly a configuration in the pipeline where you specify which orchestration you want to use.

Kubeflow Pipelines and TFX are both supported on Google Cloud through the AI Platform Pipelines so that you can choose what is best for your MLOps implementation.

Using the AI Platform Pipelines, you have full visibility of the different steps in your workflow. You can also integrate with other GCP services, such as Cloud Storage or BigQuery as a data source, Dataflow for transformations, or the AI Platform to train and serve a model.

Footer