Version Data Lakes, Declarative DAGs and shared SQL stuff with SQLPad.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
The three data points for today are next-gen data lakes with lakeFS, declarative DAGs with boundary-layer, and fast data engineer onboarding with SQLPad.
1 LakeFS, versioning and branching data
lakeFS is a tool that provides a layer on top of your AWS S3 or GCS data lake. It allows automatic versioning and branching of your data. The team provides lots of best practices, e.g. showing how to set up a data mesh using lakeFS. It’s open-source and evolves pretty fast, so I suggest you take a look at it!
I suggest you first take a look at the docs which are really well written, and then head over to the blog post about data quality and finally maybe take a look at how to use lakeFS with apache airflow.
DAGs or directed acyclic graphs have become the concept data scientists and data engineers use alike in their data pipelines. A data pipeline represented by a DAG usually contains both a “graph” meaning steps and the logic chaining them together, and the possibly complex transformation logic inside the steps.
This violates the “Single Layer of Abstraction” principle and thus makes DAGs really hard to understand. In the “Composed Method” developers aim to provide code on the same “level”. Since DAGs often contain two separate levels or more, this can be solved by extracting one of the other. Declarative DAG tools aim to do just that, and the DAG tool from Etsy seems to be most promising. It’s built for Apache Airflow DAGs, and allows a YAML DAG declaration for the step logic. The YAML then compiles down to an Apache Airflow DAG.
Fwiw, of course, the composed method can be used in a normal Python DAG using plain old python. The benefit of declarative DAG tools is that they enforce this method, not that they are the only way to do it.
I remember being set up as a data guy. Get some SQL editor, asking someone to tell me the connection strings I needed, getting to know the databases etc. When working on a ticket I usually had to hack together completely new SQL.
But versioning & configuring connections and SQL is actually really easy! And having a nice looking UI + be able to share credentials etc. speeds up development quite a bit. I’ve been using SQLPad for querying and simple visualizations for quite some time and enjoyed it…
You can use a combination of versioned & seed connections to have both a versioned set of data as well as a “custom set” for each developer if needed.
I simply like to use SQLPad as a local SQL editor run inside docker with versioned connections + queries that can be shared with the team. But you can of course also deploy SQLPad and put the data onto some persistent storage.
In other news
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!