A new contender for ETL in AWS?

It’s not all bad, once you get to grips with Glue it can become very powerful. The DynamicFrame library (which wraps the spark Dataframes) is awesome for reading and writing data to Glue Tables, with one line of code (a few more if you insist on PEP-8 compliance…) you can write a nearly infinite amount of data to a partitioned table — something which still boggles my mind! This makes Glue a great tool if you have large-scale datasets that can be imported using DynamicFrame, but inefficient at best for anything less.

Just as a quick example, as a common pattern I spend lots of my time creating ETL jobs for generating new datasets from raw data stored in S3 — where this transformed data is then stored back into S3 to then perform analysis. Sounds simple right?

Assuming I have my code locally on my machine and I want to create a Glue Job to run this code, I would have to:

Upload my script to S3 via the AWS CLI (Let’s hope it works the first time…)
Go to the AWS Console
Go to the Glue page
Create a new Job
Configure an IAM role for the job to run with the relevant permissions
Input the location of your script in S3
Setup ENI if required to ensure data access across VPCs
Include the zip dependencies for any required libraries — (No C extensions though, so good luck using pandas!)
Add a schedule
Run the Job

And this doesn’t even go into how to develop the script, I mean just look at the instructions for attaching a development endpoint to Pycharm! (https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html)

This doesn’t set up Glue as a good tool for day-to-day data science/machine learning exploration. To quickly summarise, when evaluating against our specification defined above, Glue is:

Difficult to quickly iterate on new script versions due to a lack of good local testing
Very hard to debug effectively due to spark logging
Poor integration with Software Engineering processes, due to remote code editing and clunky deployment
Easy to integrate into S3 and Glue

So what other options are there? Let’s introduce our contender. Metaflow!

Metaflow is a fairly new tool, making its entrance in late 2019. It’s designed and built by Netflix to be an open-source tool that gives power back to data scientists, reducing complexity and increasing iteration speed. It comprises a Python/R client library that orchestrates the user’s builds and a set of AWS resources (in the form of a Cloudformation stack) that need to be deployed by the user. So, enough teasing – let’s explore why I’m now using Metaflow for my Python ETL jobs.

Developing new code on Metaflow is a breeze

Metaflow adds some new annotations to provide its functionality, but the base of the orchestration uses established features of python, classes, and methods (as a “classically trained” software engineer, this fills me with great happiness). A Flow is a class and steps within that flow are functions, each of which links to the next step to execute. A Flow or step can then have annotations applied to it to add a further configuration.

# An example of a Metaflow Flow running locally
class TestFlow(FlowSpec):  @step
def start(self):
print(“This is the start step!”)
self.next(self.process) # Runs the process step next  @step
def process(self, inputs):
print(“This is the process step!”)
self.next(self.end) # Then the end step  @step
def end(self):
print(“This is the end step!”)if __name__ == ‘__main__’:
TestFlow() # This initialises the Flow then runs the start step

I don’t think I’ve ever found a tool quite as seamless as Metaflow for switching up the remote deployment of code. With one annotation to my steps, I can go from running my Flow on my laptop to a 64 core behemoth on EC2, it’s that simple.

# Same Flow as before, but will run on the Cloud!
class TestFlow(FlowSpec):  @batch(cpu=64, memory=2000) # Each step get's these resources
@step
def start(self):
print(“This is the start step!”)
self.next(self.process)  @batch(cpu=64, memory=2000)
@step
def process(self, inputs):
print(“This is the process step!”)
self.next(self.end)  @batch(cpu=64, memory=2000)
@step
def end(self):
print(“This is the end step!”)if __name__ == ‘__main__’:
TestFlow()

Metaflow is easy to manage

So this one may be a little hard to swallow if you’re not familiar with AWS but bear with me. Metaflow provides a cloudformation stack to deploy its infrastructure. This is the slimmest form of the infrastructure, providing all the required parts for using the platform. This single stack is easy to deploy and utilize yourself, and only costs around $40 a month to maintain. This is Metaflow’s biggest downside however, it requires some knowledge of AWS to maintain at scale, but bringing up and tearing down a single stack is trivial and something I’d expect most to be able to do.

Data can be stored directly into Metaflow

Another great feature of Metaflow is its metadata storage, allowing different steps in a Flow to share data, and persist that data into S3 once the execution finishes. In my previous example of creating an ETL pipeline, I wouldn’t need to store the data into S3 myself as Metaflow would do this for me. So when I come back to perform my analysis, I could just get the latest version of my Flow and extract the dataset. This makes Metaflow very powerful for both ad-hoc and scheduled data “rollups” without having to manage complex databases.

# A portion of a Flow showing the metadata storage and retrieval
@step
def start(self):
print(“This is the start step!”)
self.message = “Metaflow metadata is cool!” # Saved to S3
self.next(self.process)@step
def process(self, inputs): # Loads message from S3 and into self.
print(f”Let’s print the previous steps message {self.message}”)
self.next(self.end)

Excluding the Infrastructure cost, running jobs on Metaflow is cheap

As Metaflow uses Batch under the hood, you can configure it to use SPOT pricing. This makes instance cost virtually negligible and as Metaflow can handle retries automatically, there is little risk posed with SPOT instances going offline. With this, I ran a 64 core Flow for just under an hour for $1.10 in compute, a stark contrast to Glue’s 16DPU job cost at around $7 for an hour.

Footer