data:image/s3,"s3://crabby-images/28f5a/28f5a96d653a227da8bdce11e86c3cfaf7882a3c" alt=""
data:image/s3,"s3://crabby-images/e3704/e3704bdc8b9184103ffebcb3ed4bf4e917dabbb7" alt="Soner Yıldırım"
How SQL can be used as a data analysis tool
SQL is a language used for managing data in relational databases that store data in tabular form with labelled rows and columns. We query data from a relational database with the select statement of SQL. The select statement is highly versatile and flexible in terms of data transformation and filtering operations.
In that sense, SQL can be considered as a data analysis tool. The advantage of using SQL for data transformation and filtering is that we only retrieve the data we need. It is more practical and efficient than retrieving all the data and then applying these operations.
In this article, we will use SQL statements and functions to analyze YouTube trending video statistics. The dataset is available on Kaggle. I created an SQL table that contains a small part of this dataset.
Note: I’m using MySQL as the database management system. Although SQL syntax is mostly same for all database management systems, there might be small differences.
The table is called “trending” and it has the following structure.
We have the dates a video is published and becomes trending. We also have the title and channel of the video. The views and likes are the other two features the dataset contains.
Regarding all these features (i.e. columns) we can do a bunch of different operations. For instance, a simple one can be finding the top 5 channels in terms of the number of trending videos.
mysql> select channel_title, count(*) as number_of_videos
-> from trending
-> group by channel_title
-> order by number_of_videos desc
-> limit 5;+-----------------+------------------+
| channel_title | number_of_videos |
+-----------------+------------------+
| Washington Post | 28 |
| Netflix | 28 |
| ESPN | 27 |
| TED-Ed | 27 |
| CNN | 27 |
+-----------------+------------------+
We select the channel title column and count the number of rows. The “as” keyword is used to assign a new name to the aggregated columns. The group by clause is used to group the videos (i.e. rows) based on channels. Finally, we sort the results in descending order using the order by clause and display the first 5.
The number of videos seems to be too low because I only included the ones published in January, 2018.