by Tom Corcoran, a Solution Architect at Red Hat
Many organisations are experiencing challenges in creating a streamlined and effective workflow for their Artificial Intelligence and Machine Learning (AI/ML) workloads. Gartner asserted as recently as 2019 that 80% of AI projects are run by practitioners whose talents don’t scale within organisations, ultimately leading to a failure to realise business value.
Why do we see this widespread shortfall in the realisation of business value? I would divide the reasons into three broad categories.
- Many parties involved
- Data Engineers take raw data from many disparate sources and prepare or cleanse it into a state that data scientists’ models can consume.
- Domain Experts may also be required in the data preparation stages in order to extract and engineer important features and discard others.
- Data Scientists train and retrain models
- Applications Developers add intelligence to their apps by making realtime calls to the model for inference.
- IT Operations are responsible for the smooth functioning of the system, including an effective handoff between parties.
Without an efficient mechanism for handoff of work between these groups, they can become silos. This can lead to difficulties and delays, which can lead to lost opportunity and models which are out of date by the time they reach production.
2. Large number of tasks, responsibilities and disciplines that need to be executed effectively
Related to the number of parties or personas, is the number of tasks and responsibilities that need to be executed. In their white paper Hidden Technical Debt in Machine Learning Systems, Google outlines the many responsibilities that need to be taken care of for a successful and sustainable AI/ML workflow system. With the proliferation of tooling currently available, it’s relatively easy to create functioning AI/ML models — but as Google highlights, the ML models represent only a small portion of what’re required for an end to end, functioning system.
3. The highly iterative nature of AI/ML workflows
This image depicts at a high level the end to end AI/ML workflow, which appears quite a liner flow.
But when one considers the nature of AI/ML it’s highly iterative. The data that was used to train the model can go out of date quickly. Just look at the effect of the worldwide recession in 2020 on previously sound credit decisioning models. Tasks at any stage of the workflow, may require relegation to any of the previous stages. So workflow looks more like this:
Because the workflow is so iterative, it exacerbates any weaknesses in
- your workflow handoff capabilities and
- in your breadth of coverage across the stages.
When one puts these challenges into the context of hybrid cloud architectures, where data, AI/ML models and consuming applications span the data centre, public cloud and increasingly the edge, the potential for greater complexity and business challenges increases further.
So how are organisations that are realising the potential of AI/ML navigating all of this differently those who are falling short? We’ll address that next…
McKinsey in The state of AI in 2020 places effective data strategies amongst the key differentiators for those who are profiting most from AI/ML. I believe Kubernetes is the best manifestation of such an end to end platform. In fact, a common differentiator we’re seeing, between those who succeed and those who don’t, is the use and maximisation of the value of Kubernetes.
But first, let me provide you with an interesting parallel to AI/ML workflow practices, i.e. DevOps in the realm of software engineering. Devops inspired technical and organisational practices are producing tremendous advances in organisations’ ability to deliver software faster and more safely. Those practices include
- Continuous integration & Continuous Delivery (CI/CD), GitOps and widespread automation
- Shifting left on security
- Adoption of a generative organisational culture where risk taking and experimentation are encouraged, as espoused by Westrum. (see Accelerate reference below)
Containers and Kubernetes as a platform, provide the ideal technical basis to deliver Devops, especially an enterprise grade Kubernetes distribution such as Red Hat OpenShift. But when one considers AI/ML’s
- multiple personas and groups between which work handoffs occur
- the higher number of tasks and responsibilities that need to be executed in AI./ML
- the even greater level of iteration in AI/ML over software engineering
the payoffs from an effective Kubernetes based Devops approach for AI/ML (a.k.a. ML OPs) are even greater.
Let’s take a look at some of the specific ways a great Kubernetes based platform can help in this area. First what does Kubernetes bring to any workload
- Containers. Kubernetes is an orchestration engine for Containers. Containers, which can encapsulate all required runtime artifacts, binaries etc, are usually built from declarative configuration such as Dockerfile syntax or from agility enabling techniques such as source-to-image from Red Hat. This makes them easy to construct, and recreated and suitable for frequently iterated upon workloads such as AI/ML models..
- Reliability. Kubernetes enables reliability — through autoscaling in response to changes in traffic levels, and high availability and resiliency by allowing workload redundancy and placement across machines (or nodes).
- Hardware Utilisation Efficiency. One of the biggest problems besetting AI/ML work, is lengthy wait times for IT to provision hardware and software. See the next section Self Service for a discussion on software provisioning. On the hardware side, Kubernetes brings efficiency to its provisioning. With Kubernetes clusters, there is effectively a central pool of nodes, along with their associated resources such as GPUs, CPUs and Memory. This centralised pool results in much greater utilisation efficiency — resources are claimed by the Kubernetes scheduler when needed, then returned to the pool when done. This is particularly important for expensive and scarce resources such as GPUs and can result in significant cost savings.
- Self Service. Kubernetes platforms, in particular enterprise grade distributions such as OpenShift, expose self service catalogs of Kubernetes operators and tools for data engineers, data scientists and software developers. This facility eliminates tickets, extended delays and limited access to scarce resources, e.g. Spark clusters. Furthermore, OpenShift provides a reference architecture of proven AI/ML tools called the Open Data Hub — though Kubernetes and OpenShift are not restrictive or opinionated in the set of containerised tools that can be used. These capabilities are bringing revolutionary improvements in agility and efficiency and indeed job satisfaction amongst these expensive professionals.
- Fast Feedback and Rework Loops. Kubernetes supports the rapid iteration loops that are so critical for AI ML workloads — in 3 principal ways:
- Code and image artifacts are usually placed in source and image repositories where they are easily retrievable, enabling quick access for rework.
- Kubernetes based CI/CD systems (discussed above) enable rapid, safe and secure redeployment of reworked models — speeding up the required iterations.
- Kubernetes can host tools that automate retraining and redeployment, such as Apache Airflow, freeing up valuable data science time for other activities such as training new models.
- De-facto Standard. Kubernetes is becoming the standard of choice for cloud native software development and deployment. Utilising the same platform for AI/ML brings efficiency to organisations’ enterprise IT efforts in the following ways:
- The same Kubernetes based skill set amongst IT Operations can be applied to both workload types, driving efficiencies in skilling and training these professionals.
- Standardisation and proximity between deployed models and intelligent applications that consume them for inference. This unified platform and approach for software and models results in a stronger and robust system for both.
- Hybrid and Multi Cloud. Certain Kubernetes distributions, such as OpenShift from Red Hat enable hybrid cloud by providing an identical experience across clouds, the data center and the edge.
In this post, we described the iterative, multi persona workflow that AI/ML requires. This can lead to challenges in the realisation of AI/ML business returns. Leading industry authorities such as McKinsey advocate a platform based approach in order to address these challenges. I argue that Kubernetes platforms and in particular Red Hat OpenShift are the most effective platform type in this regard. See references below, for other blogs in which I explore some of these capabilities in more depth, and provide examples of organisations succeeding through OpenShift Container Platform.
Forsgren, Humble & Kim — Accelerate
McKinsey — state of AI 2020
Google — Hidden Technical Debt in Machine Learning Systems
Gartner — Top Strategic Predictions for 2019
Kubernetes: the Savior of AI/ML Business value?
Business Centric AI/ML With Kubernetes — Part 2: Data Preparation
Business Centric AI/ML With Kubernetes — Part 3: GPU Acceleration
Business Centric AI/ML With Kubernetes — Part 4: ML/OPs