Data science has experienced a tremendous growth in recent years. The advancements in data collection, storing, and processing have contributed to this growth.
The potential to create value using data attracted many industries. More and more businesses have adapted data-centric strategies and processes in their operations.
The ever growing demand has also motivated developers and open-source community to create new tools for data science. Thus, the people who work in the field of data science has many libraries, frameworks, or tools to do their work.
Some of these tools are designed to perform same tasks just in a different programming language. Some are more efficient than others. Some focus on a particular task. The undeniable truth is we have many tools to use.
You may argue that it is better to stick one tool for a particular task. I, however, prefer to have at least a couple of options. I also would like to be able to do a comparison between tools.
In this article, I will try to explain how I learn new tools. My strategy is based on comparison. I focus on how a given task can be accomplished with different tools.
I clearly see the differences as well as the similarities between them. Furthermore, it helps to build an intuition about how the creators of such tools approach particular problems.
Let’s say I’m comfortable with Pandas library in Python and want to learn dplyr library in R. I try to perform the same tasks with both libraries.
Consider we have the following dataset about a marketing campaign.
I would like to create a new column that contains the ratio of the spent amount and the salary. Here is how it can be done using both Pandas and dplyr.
#pandas
subset['spent_ratio'] = subset['AmountSpent'] / subset['Salary']#dplyr
mutate(subset, spent_ratio = AmountSpent / Salary)
Let’s do another example that compares Pandas and SQL. Consider we have a dataset that contains groceries and their prices.
We want to calculate the average item price for each store. This task can be accomplished with both Pandas and SQL as follows.
#Pandas
items[['store_id','price']].groupby('store_id').mean()price
#SQL
store_id
-------------------
1 1.833333
2 3.820000
3 3.650000
mysql> select store_id, avg(price)
-> from items
-> group by store_id;+----------+------------+
| store_id | avg(price) |
+----------+------------+
| 1 | 1.833333 |
| 2 | 3.820000 |
| 3 | 3.650000 |
+----------+------------+