How to Become a Pandas Jedi

Let’s suppose, for a particular branch, we need to calculate the highest difference in total sales between two consecutive days. It might help us understand the factors that increase the sales. We can also find out if there is an extraordinary situation.

It is better to design our solution before writing the code. We can approach this tasks as follows:

Filter the data points (i.e. rows) that belong to the branch of interest
Select the date and total sales amount columns
Group the data points by date and calculate the sum
Sort the results by date
Calculate the difference between two consecutive dates
Select the maximum

It seems like a highly complicated operation. However, pandas provides versatile and powerful functions that allow us to tackle down such tasks easily.

Here is our solution:

df[df.branch == 'A'][['date','total']]
.groupby('date', as_index=False)
.sum()['total'].diff().max()2888.53

Let’s elaborate on the code. The first line filters data points that belong to branch A and selects the date and total columns. The second line groups the rows by date. The output of the group by function is automatically sorted by date because we use the date as the grouping column. If not, we could use the sort_values function after the groupby function.

The third line calculates the total amount for each group (i.e. each day). Then, we select the total column and apply the diff function. It calculates the difference between a row and its previous row. Since the rows are consecutive days, we end up getting the difference between the sales amount of two consecutive days. Finally, we use the max function to see the maximum difference.

We know the highest difference but it is not useful without knowing the date. Thus, we should improve our solution a little.

df_sub = df[df.branch == 'A'][['date','total']]
.groupby('date', as_index=False).sum()df_sub['diff'] = df_sub['total'].diff()df_sub.sort_values(by='diff', ascending=False, inplace=True)

The first operation creates a dataframe that contains the total daily sales of branch A. The second one creates a column with the difference between the totals of two consecutive days. Finally, we sort the rows based on the difference in descending order.

The first row of df_sub shows the date with the highest difference with its previous day.

(image by author)

We can confirm the result by checking the total amounts on 2019–02–16 and 2019–02–17.

(image by author)

It is important to note that this might not be the only solution for this task. Pandas provides various functions and techniques which make it a versatile and powerful tool for data analysis. Thus, we may come up with multiple solutions for a given task.

Footer