How Dask accelerates Pandas Ecosystem?

Deep dive understanding of Dask data frame, and how it works under the hood

Python has tons of open-source libraries that ease data science project development. Some of the famous Python libraries such as Pandas, Numpy, Scikit-Learn provides high level usable and flexible API along with high-performance implementation. These libraries focus on providing a vast list of API’s but largely ignores performance and scalability.

In other words, these libraries fail to load large-sized data or out-of-memory datasets and perform explorations and visualization. Dask library comes into the rescue, which has a similar API that of Pandas and Numpy and accelerates the workflow by parallelizing across all CPUs.

In this article, you can get a deep-dive analysis of the Dask framework and how it works under the hood.

Note: At the end of this article, you can get the benchmark time constraints comparing Pandas and Dask libraries, and observe how Dask perform better than Pandas.

Dask is an open-sourced Python library that provides high-performance implementation and similar API as Pandas and Numpy libraries along with the ability to accelerate the workflow by parallelizing the operations across all CPUs. It is a flexible library for multi-core and distributed parallel execution on a larger-than-memory dataset.

High-Level Collections: Dask provide a high-level collection like an array, data frames that extend common API as of Numpy and Pandas along with the distributed environment. These collections run in parallel on top of dynamic task Schedulers.
Dynamic Task Scheduling: These are an alternative way of threading and multiprocessing libraries. Dask provides a dynamic task scheduler that executes task graphs in parallel.

(Source), Dask working architecture

As observed in the above diagram, Dask comes up with 5 high-level collections: Dask Array, Dask Dataframe, Dask Bag, Dask Delayed, Futures. Any computation on these high-level collections tends to generate a task graph. This task graph is executed in parallel by the dynamic task scheduler.

As mentioned in the above diagram, the Dask framework has 5 high-level collections, that extend common interface as of Pandas, Numpy, Scikit-Learn, etc.

Dask Array:

Dask Arrays are data structures that consist of multiple Numpy arrays stored in chunks. In other words, a dask array can break into multiple Numpy arrays. The API of the Dask Array is very similar to the Numpy library.

(Source), Dask Array divided into multiple NumPy Arrays

Any operation performed on Dask Array, under the hood is performed on the Numpy arrays. Dask Arrays uses Blocked Algorithms to perform the computation that uses multiple cores of the system to parallelize computations. By breaking large NumPy arrays into smaller chunks it allows users to perform out-of-memory computation and load larger-than-memory datasets.

Dask DataFrame:

Similar to Dask Arrays, Dask Dataframe consists of multiple Pandas data frames in chunks. A dask data frame is partitioned row-wise into small pandas data frame grouping rows by the index for efficiency.

(Source), Dask DataFrame divided into multiple Pandas DataFrame

By breaking the dask data frame into smaller chunks of the Pandas data frame, the computations run in parallel using inbuilt blocked algorithms. Dask uses memory mapping, which does not load the entire data at once, instead just points to the location of data in sync with the operating system.

Dask Bag:

Dask Bags is a high-level parallel collection of Python objects used to deal with semi-structured or unstructured datasets. Like other dask collections, Dask Bag follows lazy propagation and can be easily parallelized over clusters of machines.

Initial data massaging and processing is done using a list, dict, sets because the initial set of data might be JSON, CSV, XML, or any other format that does not enforce strict datatypes. Dask Bag can be used for such type of tasks.

In a nutshell, Dask Bag can be referred to as:

dask.bag = map, filter, tools + parallel execution

Dask Delayed:

Dask Delayed is a dask collection that can be used to parallelize custom functions and loops. Delayed functions can be used to parallelize existing codebases or build complex systems.

Rather than executing the functions in a fly, it will defer the execution, placing the function and its arguments into a task graph. As soon as .compute() function is called it will be executed in parallel.

Dynamic Task schedulers are designed to scale just on a personal laptop to thousands of node clusters.

(Image by Author), Dynamic Task Scheduling Architecture

All the high-level collections in Dask have API that generates task graphs where each node in the graph is a normal Python function and edges between nodes are normal Python objects that are created by one task as outputs and used as inputs in another task. After the generation of these task graphs, schedulers are required to execute them on parallel hardware.

(Souce), Animation of a sample task graph being executed in parallel using Dynamic Task Scheduler

(Image by Author), Benchmark time constraints for Pandas and Dask Dataframe for small, mid, and large-sized dataset

The above benchmark time constraints are records using a laptop with configuration: RAM (8GB), Disk (1TB), Processor (i5–8th Gen @ 1.6GHz).

Deep dive understanding of Dask data frame, and how it works under the hood

Dask Array:

Dask DataFrame:

Dask Bag:

Dask Delayed:

Footer