Metaflow: the Swiss knife tool in data science

And how we use it for serverless data pipelines tasks

Almost all data science projects require some ETL/ELT (extract, load and transform) tasks. There are lot of tools for doing this job: Airflow, Luigi, BEAM, etc. and each of those tools have pros and cons. There is a non trivial learning curve to be able to use most of these solution.

But sometimes we don’t want to kill an ant with a tank and we just need to load a file, run queries, replace some tables and do something useful with that data. So what can we do when we just simply want to orchestrate a set of tasks in a very lightweight and flexible way without having to learn a complex framework?. For those situations we found that Metaflow was a tool that fit perfectly with most of our requirements.

As the people of Metaflow says:

Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

We ❤️ Metaflow because allows us to create workflows in a very simple way. The structure of these workflows is quite flexible and our team has been using it to create workflows for: preprocessing data, training models, scoring predictions, etc.

This is because Metaflow integrates very well with almost all Python libraries, it can be used in your local computer (or in a virtual machine) and Metaflow handles the parallel processing and threading of steps. Metaflow is very well integrated with AWS and you can scale your workflows very easily there, but if you are not using AWS this could be a con because you loose some useful functionalities.

An advantage of Metaflow is that it’s very easy to use. Our data scientists have started using it and it wasn’t too hard for them to get used to it. This is because Metaflow looks (and feels) like any normal Python program, unlike Airflow or Luigi which requires much code overhead because are more complex frameworks. A simple Metaflow pipeline looks like this:

In this workflow we have a start step, then we run a and b steps in parallel (all handled by Metaflow), then (this is required by Metaflow) we need a join step which allow us to catch the results of the parallel steps, and then we have an end step. Visually it looks like this:

The hello.py flow

You can run this workflow by simply typing:

python hello.py run

And our console should look like this:

Metaflow output

This is a very basic workflow (and code sample) but it shows how easy it is to use Metaflow!!!, there is a lot of nice features which you can read at the Metaflow documentation.

Some of my favorites are:

Resuming your workflow from where it failed.
Saving all the parameters, outputs and metadata from each step.
Generate the visualization of the graph of operations.

At Spike we use BigQuery as our Data Warehouse (DW) in some of our projects. And often we need to feed our DW with new data and then run a series of queries that generates summary tables from our raw tables.

As I said before, there are plenty of tools and frameworks for this. But these tools require having a server turned on and someone maintaining those servers. And we don’t want to have to do that if we really just need to load data and update our tables once a day or week.

For those situations, we started to use Metaflow as our pipeline orchestrator. We simply put all the queries and sometimes the required code to read, validate, upload the new data into BigQuery and then run the required queries to update everything.

To do that, we simply used the python BigQuery API and we created a wrapper for the API to do some operations more cleaner:

With this wrapper, then we can simply create our Metaflow pipeline, and in each step we can update, create or delete the required tables. If we need to create parallel tasks Metaflow will handle the scheduling and coordination of each process. Our pipelines will typically look like this (of course, we’ll usually have more steps):

An advantage of using Metaflow is that we can run this pipeline locally without having to configure any server, we just need to have installed Metaflow as any normal python library. Once we have our pipeline created the next question is “How we do schedule the operation of this pipeline? 🤔”.

This turns out to be unexpectedly simple: we just need to put all our code into a Docker container and upload it into Google Container Registry which allows us to have private Docker containers. Then we can run our container in a serverless way using Google AI-Platform. I’m not going to go into details about how to do that now, but you can read here an article that I wrote a few weeks ago where I explain that.

So we use the Google Python API to call AI-Platform. We basically have wrapper from the Python API that allows us to send jobs into AI-Platform and this service manage all the required infrastructure to run our container so we don’t have to worry about VM, configurations, etc. We just simply says “run this container please” and Google does the rest.

Our wrapper of the API looks like this:

We use a Cloud Function to trigger our containers, we pass to our Cloud Function the URI of our container at Google Container Registry. With only one function we can send jobs from our differents pipelines.

Our function is called via HTTP (not the best solution but it works), so we finally can schedule using Google Cloud Scheduler. With the frequency needed. If something goes wrong, we can see the AI-Platform logs for debug.

In general, our process looks like this:

Pipeline orchestration

All the process is managed by Google Cloud Platform and we don’t have to be worried about servers, configurations, etc. We can happily enjoy our coffee each morning ☕️

Conclusions

Metaflow is a great tool, we have been using it for different tasks in our data science projects. And for this particular problem (doing data pipelines) fits very well when you need to do something lightweight without having to use a complex framework and paying for infrastructure that won’t be used often.

Please feel free to make comments or suggestions.

And how we use it for serverless data pipelines tasks

Conclusions

Footer