- Introduction
- Model Comparison
- Summary
- References
Because Spotify and other music streaming services are incredibly popular and widely used, I wanted to apply Data Science techniques with Machine Learning algorithms to this product to predict song popularity. I personally use this product, and what I apply here could be applied to other services as well. I will be examining every popular Machine Learning algorithm and pick the best algorithm based on success metrics or criteria — sometimes is some sort of calculated error. The goal of the best model developed is to predict a song’s popularity based on various features current and historical features. Keep on reading if you would like to learn a tutorial on how to use Data Science to predict the popularity of a song.
I will be discussing the Python library that I used, along with the data, parameters, models compared, results, and code below.
Using the power of PyCaret [3], you can now test every popular Machine Learning algorithm against one another (or more of them at least). For this problem, I will be comparing MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec) — the time it takes for the model to be completed. Some of the benefits of using PyCaret overall, as stated by the developers, is that there is increased productivity, ease of use, and business-ready — all of which I can personally attest to myself.
The dataset [4] that I am using is from Kaggle. You can download it easily and quickly. It consists of 17MB along with data from Spotify from the years 1921 to 2020, including 160,000+ tracks. It consists of 174,389
rows and 19
columns. Below, is a screenshot of the first few rows along with the first columns:
Columns:
After we eventually pick the best model, we can look at the most important features. I am using the interpret_model()
function of PyCaret, which is based on the popular SHAP library. Here are all of the features possible below:
['acousticness',
'artists',
'danceability',
'duration_ms',
'energy',
'explicit',
'id',
'instrumentalness',
'key',
'liveness',
'loudness',
'mode',
'name',
'popularity',
'release_date',
'speechiness',
'tempo',
'valence',
'year']
Here are the most important features using SHAP:
All of the columns are used as features, except for the target variable, which is the columnpopularity
. As you can see, the top three features are year, instrumentalness, and loudness. As a future improvement, it would be better to have the categorical features that are broken out into one column instead of tens of columns, then as a next step, be fed into the CatBoost model so that target encoding can be applied vs one-hot-encoding — to perform this action, we would confirm or change the key
column to be categorical instead, and for any other similar columns.
These are the parameters that I used in the setup()
of PyCaret. The Machine Learning problem is a regression one, including data from Spotify, with the target
variable being the popularity
field. For reproducibility, you can establish a session_id
. There are a ton more parameters, but these are the ones that I used, and PyCaret does a great job of automatically detecting information from your data — like picking which features are categorical, and it will confirm that with you in the setup()
.
I will be comparing 19 Machine Learning algorithms, some are incredibly popular while some, I have actually not heard of, so it will be interesting to see which one wins with this dataset. For the success criteria, I am comparing all of the metrics MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec), which PyCaret automatically ranks.
Here are all of the models that I compared:
- Linear Regression
- Lasso Regression
- Ridge Regression
- Elastic Net
- Orthogonal Matching Pursuit
- Bayesian Ridge
- Gradient Boosting Regressor
- Extreme Gradient Boosting
- Random Forest Regressor
- Decision Tree Regressor
- CatBoost Regressor
- Light Gradient Boosting Machinee
- Extra Trees Regressor
- AdaBoost Regressor
- K Neighbors Regressor
- Lasso Least Angle Regression
- Huber Regressor
- Passive Aggressive Regressor
- Least Angle Regression
It is important to note that I am just using a sample of the data, so the order of these algorithms may rearrange if you use all of the data if you test this code yourself. I used only 1,000
rows instead of the total ~170,000
rows.
As you can see, CatBoost
was ranked first, having the best RMSE, RMSE, R2. However, it did not have the best MAE, RMSLE, and MAPE, and it was not the fastest. Therefore, you should establish what you mean by success in terms of these metrics. For example, if time is essential, then you will want to rank that higher, or if MAE is higher you might want to pick Extra Trees Regressor
instead to win.
Overall, you can see, even with a small sample of the dataset, we faired pretty well. The popularity
target variable has a range of 0 to 91. Therefore, for MAE for example, our average error is 9.7 popularity units. Out of 91 that is not too bad, considering we would be off by up to just a difference of 10 on average. However, the way the algorithm is trained not would probably not generalize that well since we are just using a sample, so you can expect all of the error metrics to decrease (which is good) significantly, but unfortunately, you will see the training time increase dramatically.
One of the neat features of PyCaret, is the ability for you to remove algorithms in your compare_models()
training — I would start on a small sample of the dataset, and then see which algorithms generally take longer, then remove those when you compare with all of the original data since some of these could take hours to train depending on the dataset.
In the screenshot below, I am printing the dataframe with the predictions and the actual values. For example, we can see that popularity
or original is compared side-by-side to theLabel
, which is the prediction. You can see that some predictions were better than others. The last prediction was quite poor, while the first two predictions were great.
Here is the Python code that you can try testing yourself from importing libraries, reading in your data, sampling your data (only if you want), setting up your regression, comparing models, creating your final model, making predictions, and visualizing feature importance[9]:
# import libraries
from pycaret.regression import *
import pandas as pd# read in your stock data
spotify = pd.read_csv(‘file location of your data on your computer.csv’)# using a sample of the dataset (you can use any amount)
spotify_sample = spotify.sample(1000)# setup your regression parameters
regression = setup(data = spotify_sample,
target = ‘popularity’,
session_id = 100,
)# compare models
compare_models()# create a model
catboost = create_model('catboost')# predict on test set
predictions = predict_model(catboost)# interpreting model
interpret_model(catboost)
Using Data Science models to predict a variable can be quite overwhelming, but we have seen how, with a few lines of code, we can compare several Machine Learning algorithms efficiently. We have also shown how easy it is to set up different types of data, including data like numeric and categorical. For the next steps, I would apply this to an entire dataset, confirm data types, making sure to remove inaccurate models, as well as models that take too long to train.
In summary, we now know how to perform the following to determine song popularity:
import librariesread in datasetup your modelcompare modelspick and create the best modelpredict using the best modelintepret feature importance
I want to give thanks and admiration to Moez Ali for developing this awesome Data Science library.
I hope you found my article both interesting and useful. Please feel free to comment down below if you applied this library to a dataset or if you use other techniques. Do you prefer one over the other? What do you think about automatic Data Science?
I am not affiliated with any of these companies.
Please feel free to check out my profile and other articles, as well as reach out to me on LinkedIn.
[1] Photo by Cezar Sampaio on Unsplash, (2020)
[2] Photo by Markus Spiske on Unsplash, (2020)
[3] Moez Ali, PyCaret, (2021)
[4] Yamac Eren Ay on Kaggle, Spotify Dataset, (2021)
[5] M.Przybyla, Dataframe Screenshot, (2021)
[6] M.Przybyla, SHAP Feature Importance Screenshot, (2021)
[7] M.Przybyla, Model Comparison Screenshot, (2021)
[8] M.Przybyla, Predictions Screenshot, (2021)
[9] M.Przybyla, Python Code, (2021)
[10] Photo by bruce mars on Unsplash, (2018)