Process Dataset with 200 Million Rows using Vaex

Some popular data explorations experiments are performed for 200 million rows dataset on a windows OS with 8GB of RAM:

Read Data
Data Shape
Data describe
Value Counts
Group by column and aggregation
10th percentile computation
Visualizing a column
Apply function
Adding a new column
Filter data frame

Read Data:

The experiment was designed in a way that follows best practices for each tool — this is using binary format HDF5 for Vaex. Need to convert the CSV file to HDF5 format so that Vaex can perform its best. Vaex required 33minutes to convert 2313 partitions of the CSV file to its HDF5 format.

Now to read the HDF5 data from disk:

df = vaex.open("200M_data_hdf5/analysis_*.hdf5")

Vaex requires 6 minutes to read the entire dataset.

Data Shape:

To compute the number of rows in the dataset using Vaex requires no time (0 ns). The entire data has around 200 million rows.

Data describe:

To generate descriptive statistics including the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values using .describe() function.

Vaex took around 15 minutes to compute the descriptive statistics of each column.

Value Counts:

To compute the frequency distribution of the categorical column ‘name’ using the function .value_counts() in Vaex data frame.

Vaex took around 3.5 minutes to return with the frequency distribution of the ‘name’ column.

Group by:

Similar to pandas API, Vaex also comes up with a function to compute grouping and aggregation. The below command groups the ‘name’ column and aggregates the mean of column ‘x’.

Vaex took around 2.5 minutes to compute the above grouping and aggregation command.

To compute grouping for the ‘id’ column and mean aggregation for two columns ‘x’ and ‘y’.

Vaex took around 11.5 minutes to compute the above grouping and aggregation command.

10th percentile computation:

Vaex has percentile_approx function to compute an approximation of a given percentile.

Vaex took 46.8 secs to compute the 10th percentile of the ‘id’ column.

Visualizing a column:

To plot a histogram of large-sized data is problematic as traditional tools for data analysis are not optimized to handle them.

Using plot1d function in Vaex to plot a histogram of numerical vector, it took 3.5mins to return with the plot.

Apply function:

Similar to Pandas API, Vaex has apply function to apply a function along an axis of the DataFrame. Function to return list vowels in name column:

Vaex took almost no time (132 ms), to process 200 million records of name column.

Adding a new column:

Vaex actually takes no time to add a column to the dataset, as it does not add the new column instantly, rather uses an expression system to generate just the expression of the new column.

To add a new column Vaex took nearly 251ms.

Filter Dataframe:

Similar to pandas API, Vaex has a similar concept of selection, to filter the data based on any given condition. Vaex does not instantly filter the data frame, instead generates an expression.

Vaex takes almost no time (273 ms), to apply the filter, and as observed from the above image, the shape of the data frame reduced from 200 million to 98 million.

(Image by Author), Time Constraints for Vaex operation on 200 million dataset

In this article, we have generated 200 million records of time-series artificial data having 4 columns of the size of nearly 12GB. Using Pandas library it’s impossible to read the dataset and perform exploration and visualization on it.

Vaex data frame can easily read the data and perform the required exploration and visualization. The only requirement on the Vaex data frame, it works well with HDF5 data. So the CSV files need to convert to the required HDF5 format.

Also, most of the popular Pandas API is available in the Vaex library, hence it makes it the most useful library to work with a large-sized dataset.

[1] Vaex Documentation: https://vaex.readthedocs.io/en/latest/

Thank You for Reading