The process of building and training Machine Learning models is always in the spotlight. There is a lot of talk about different Neural Network architectures, or new frameworks, facilitating the idea-to-implementation transition.
While these are the heart of an ML engine, the circulatory system, which enables nutrients to move around and connects everything, is often missing. But what comprises this system? What gives the pipeline its pulse?
The most important component in an ML pipeline works silently in the background and provides the glue that binds everything together.
Despite the AI craze, most projects never make it to production. In 2015, Google published a seminal paper called the Hidden Technical Debt in Machine Learning Systems. If you’re working in ML for more than 6 months, you have already seen the figure below.
In this work, the authors try to warn that it is dangerous to focus only on the powerful ML tools available today and take them for granted; I quote: “Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems.”
There are many components in an end-to-end ML system, and each has an important role to play. Data Collection and Validation provide the oil to the machine, Feature Extraction the filtration system, Serving Infrastructure the actual service, and Monitoring the engine’s real-time overview.
However, today we talk about a box that is not present in the figure. We talk about the component that works silently in the background, gathering information and provides the glue that binds everything together. In my opinion, this is the most important element in an ML pipeline: the Metadata store.
Learning Rate is a newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news and articles. Subscribe here!
What is a Metadata store, you ask, and why is it that important? It is a library for recording and retrieving metadata associated with ML workflows. What were the inputs to a pipeline step? What artifacts did the step produce? Where do they live? What is their type?
This story talks about a specific Metadata store implementation: the ML Metadata (MLMD) library by Google, an integral part of TensorFlow Extended (TFX), but at the same time a stand-alone application. Using this example will help us understand the need for such a component better.
The MLMD library
MLMD helps us analyze all the parts of an ML pipeline and their interconnections instead of looking at the boxes in the figure in isolation. It provides the full lineage of every event that happened, and most importantly, the full history of our model. Among others, MLML can help us identify:
- Which dataset did the model train on?
- What were the hyperparameters used to train the model?
- What were the metrics of the model?
- Which pipeline run created the model?
- Have we trained any other model using this dataset?
- How do the different models compare?
- Which version of a specific ML framework created this model?
MLMD needs a database to store the information and dependencies of every step. To this end, it exposes an API to perform the necessary operations on several entities in an SQL database. To this point, MLMD supports SQLite and MySQL. However, in most cases, you won’t ever care about the DBMS that is running underneath.
The most important entities created and stored by MLMD are:
- Artifacts that are generated by the pipeline steps (e.g., the trained model)
- Metadata about the executions (e.g., the step itself)
- Metadata about the context (e.g., the whole pipeline)
MLMD in action
Let’s now walk through a typical ML workflow and integrate MLMD into the pipeline steps. Initially, we need to create two Artifacts
: one to represent the dataset and one for the model. To this end, we should register the relevant ArtifactTypes
first. Think of it like this: the ArtifactType
is the class, and the Artifact
is the object.
Let’s see the ArtifactType
representing the dataset. In our declaration, we specify that each dataset Artifact
should have two custom properties: a day
and a split
. Similarly, the model Artifact
has a version
and a name
.
On top of that, other properties are passed directly to every Artifact
. Think of it as the inheritance property in object-oriented programming. For example, each Artifact
should have a uri
pointing to the physical object. Thus, let’s create an Artifact
for the dataset.
Next, let’s create an ExecutionType
and the corresponding Execution
object to track the steps in our Pipeline. Let’s create a trainer execution
object to represent the training step and set its status to running.
Now, we would like to specify that the dataset Artifact
we created before is an input to the Execution
step named “Trainer.” We can do that by declaring an Event
entity.
When the training step is done, it produces a model. Let’s define a model Artifact
and set it as an output to the Execution
step “Trainer.”
Finally, the “Trainer” step is done, and we can set its status to “COMPLETED.”
To get the whole picture, let’s bind everything together, and record the complete lineage of our model Artifact
, using Attributions
and Assertions
entities.
These ~23 lines of code create a Context
entity for the experiment and link the “Trainer” as an Execution
step of the experiment and the model as its output. That’s all; MLMD takes care of the rest, so you will be able to track everything as we saw in the first section.
While the training code is the heart of an ML engine, the circulatory system which connects everything is often missing. There are many components in an end-to-end ML system, and each has an important role to play.
However, today we talked about a component that works silently in the background and provides the glue that binds everything together—the Metadata store.
We saw how MLMD implements this idea, its core concepts, and how we could use it in a simple ML setting. To get started, see the installation instructions here. However, you won’t see its full potential if you install it locally. Instead, it is better to use it in a complete and cloud-native environment. Thus, I would suggest working inside a MiniKF instance, where everything is preconfigured. To get started, see the story below:
My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.
If you are interested in reading more posts about Machine Learning, Deep Learning, Data Science, and DataOps, follow me on Medium, LinkedIn, or @james2pl on Twitter. Also, visit the resources page on my website, a place for great books and top-rated courses, to start building your own Data Science curriculum!