In part I of the analysis, we looked into the relationship videos view count has with the number of likes, dislikes, and comments and fitted two different linear models based on the findings. However, they fail to tell the entire story successfully. Therefore, we focus on the categorical variables like tags, channel, and category next. Lastly, we take a look at the time when the videos were published.
YouTube tags are words and phrases used to give YouTube context about a video. They are an important ranking factor in YouTube’s search algorithm. The word cloud below shows that the content creators repeatedly used tags like “funny” and “comedy”, suggesting the entertainment videos’ dominance among the trending ones. The observation makes sense as videos from this category represent 24.33% of the dataset.
Next, we expand the dataset with logical columns for each of the top 500 most frequently used tags. The values in columns are “TRUE” if the tag is among the video tags and “FALSE” otherwise. Then, we fit a Random Forest model with the below formula:
Fitted on 500 trees and using the select categorical variables as predictors, the Random Forest model’s R-squared value of 0.76 is slightly worse than that of the two linear models. However, the variable importance chart reveals that certain channels are likely to get more views, likely due to a large subscriber base. The chart further confirms the popularity of entertainment and music videos.
Next, we take a look at the time the trending videos were published to see if there is a preferable time of day or even an hour for publishing a video to get more views. After splitting the day into four categories, we see that most videos are published in the afternoon hours, i.e. between 12:01 pm and 6 pm UTC.
Getting more granular by looking at the number of videos published by the hour, we observe that the time period between 3 pm and 5 pm UTC is particularly popular for publishing YouTube videos that end up trending.
Looking at the most popular hour for publishing by category reveals an interesting observation that Music videos are mostly uploaded to YouTube at 5 am UTC.
We also split the channels into four categories based on the aggregate number of views on their videos to see if the channel size plays a role in determining the time to publish a video. The chart below confirms it does not.
Finally, we fit another Random Forest model based on the observations with the following formula:
The aggregate Random Forest model outperforms all the other ones with an R-squared value of 0.98. The variable importance chart confirms that user engagement is important for increasing a video view count. Additionally, entertainment, music, and sports videos are likely to get more views. Lastly, even though the majority of the videos are published in the afternoon hours, it seems that the ones published at night tend to perform better.
The model comparison table summarizes the accuracy of the four models.
In summary, the analysis shows that even getting dislikes on a video is beneficial for increasing the views count up to a certain point. As expected, more likes lead to more views on videos. Additionally, the rising comment count eventually stops contributing to increasing the number of views. Furthermore, some videos do exceptionally well. Those are often entertainment and music videos. Although the afternoon is the most popular time of the day to publish, videos uploaded overnight are more important for predicting the view count. The next step of the analysis could be to narrow the focus to the outperforming videos to explore the driving forces behind their performance.
Link to the dataset: YouTube trending videos dataset
Link to the code: YouTube trending videos analysis