Let’s suppose, for a particular branch, we need to calculate the highest difference in total sales between two consecutive days. It might help us understand the factors that increase the sales. We can also find out if there is an extraordinary situation.
It is better to design our solution before writing the code. We can approach this tasks as follows:
- Filter the data points (i.e. rows) that belong to the branch of interest
- Select the date and total sales amount columns
- Group the data points by date and calculate the sum
- Sort the results by date
- Calculate the difference between two consecutive dates
- Select the maximum
It seems like a highly complicated operation. However, pandas provides versatile and powerful functions that allow us to tackle down such tasks easily.
Here is our solution:
df[df.branch == 'A'][['date','total']]
Let’s elaborate on the code. The first line filters data points that belong to branch A and selects the date and total columns. The second line groups the rows by date. The output of the group by function is automatically sorted by date because we use the date as the grouping column. If not, we could use the sort_values function after the groupby function.
The third line calculates the total amount for each group (i.e. each day). Then, we select the total column and apply the diff function. It calculates the difference between a row and its previous row. Since the rows are consecutive days, we end up getting the difference between the sales amount of two consecutive days. Finally, we use the max function to see the maximum difference.
We know the highest difference but it is not useful without knowing the date. Thus, we should improve our solution a little.
df_sub = df[df.branch == 'A'][['date','total']]
.groupby('date', as_index=False).sum()df_sub['diff'] = df_sub['total'].diff()df_sub.sort_values(by='diff', ascending=False, inplace=True)
The first operation creates a dataframe that contains the total daily sales of branch A. The second one creates a column with the difference between the totals of two consecutive days. Finally, we sort the rows based on the difference in descending order.
The first row of df_sub shows the date with the highest difference with its previous day.
We can confirm the result by checking the total amounts on 2019–02–16 and 2019–02–17.
It is important to note that this might not be the only solution for this task. Pandas provides various functions and techniques which make it a versatile and powerful tool for data analysis. Thus, we may come up with multiple solutions for a given task.