My Prediction Will Go On

For as long as we, as people, have been speaking, we have been singing. Presumably for that same amount of time, we’ve been getting tunes and melodies stuck in our heads wondering either how to get it out or to make it live there forever. Like many, when I was young, the lure of music grabbed me, settled in and has done nothing but grow ever since. Late night drives with friends, belting out our favorite songs, each word as it passes your lips, cementing a memory that will rush back any time you hear a simple tune for the rest of your life. Smiling and standing in a group of strangers, all connected to someone they’ve likely never met, on a stage making a noise that they enjoy. The power of music is fascinating, often just as much as the music itself is mesmerizing. But for every song that brings back memories of a first love, friends from a time gone by, achievements and life milestones that we may cherish, everyone has been the victim of a radio being played and a song becoming irritating. Later, you find out that song you could not bear to ever choose to listen to is one of the 40 most popular songs in the entire country and you wonder to yourself “How could everybody have bad taste but me?” Maybe that’s a little more melodramatic than the average response, but the question of a song’s popularity being able to be broken down, classified, understood at a “building blocks” level has captivated more than myself, I’m sure. So when given an opportunity to examine a topic of my choosing, it will come as no surprise to anyone who has ever seen my music room, filled with albums, memorabilia and instruments from wall to wall, that I would, again, be digging into something musical. The discovery of a dataset detailing spotify songs and streaming data was seemingly made just for this and I had to know, once and for all: Can you predict the popularity of a song if you know what it’s made of?

To begin with, the dataset was particularly clean, with only slight modifications being necessary. The most important of which was the target, popularity. It had the popularity stored as a variable integer between 1 and 100, which is incredibly broad. To get an accurate view of whether the model would be able to predict at least somewhat close to the actual popularity, I broke those values down into 5 categories, based upon those existing values, in intervals of 20. Also, since Spotify is driven by customer searches, new features for the length of the name of both the song and performer are both created.

After cleaning up our dataset, splitting things apart and it comes time to establish our baseline, we have our first serious indicator of how unbalanced the dataset is. While unfortunate, that doesn’t mean we can’t learn anything from it! We can see we have a Baseline Accuracy of about 38.5% so it shouldn’t be too hard to improve on that, but we clearly have our work cut out for us.

For any given task, there’s a multitude of models one could choose to use. Here we look at a couple and why they would, or would not, work:

Logistic Regression

Utilizing Ordinal Encoding, Simple Imputer and Logistic Regression, we build this model, but see that it isn’t really suited to this type of problem. Since our target is not Binary in it’s classification, this isn’t the most appropriate use of such a model. It can’t hurt to run it and see why exactly it doesn’t work though, we’ll have other, more fitting models to illustrate what does work well.

Random Forest Classifier

A better choice than logistic regression, and what we will be focusing on for the bulk of our analysis, this model works fairly well on it’s own, but continues to improve after tuning the hyperparameters. This is done in this particular exercise using a RandomizedSearchCV, which can be seen in more detail in the notebook itself and when we compare results.

Gradient Boosting Classifier

Similar to the Random Forest Classifier, as both are utilize decision trees to train the data, the key difference is that Gradient Boosting trains in sequence, rather than in parallel as Random Forest Classifier does. While this is computationally a bit more taxing and requires slightly more time, it can in many cases offer a more accurate model. We will examine a bit more in a moment the margins it offers on this particular exercise.

XGBoost Classifier

Finally, this particular model is similar enough to the Gradient Booster that it may seem like overkill to even examine this model, but my own curiosity got the best of me. While seemingly more popular due to it’s enhanced training speed, the inquisitive part of my mind was hoping to see if that came at the expense of accuracy.

Here is the meat of the exercise and where we can start seeing how our models performed.

Logistic Regression

As discussed above, this dataset is not conducive to using a Logistic Regression model. Geared more towards a Binary Classification task, which this exercise is not, we can see in our training and validation data being negligibly better than the baseline that this is not an effective model.

Gradient Boosting

Again, as discussed above, our dataset seems to be imbalanced, however, this provides some insight as to how the models themselves work. We see a steady increase in both training and validation accuracy over the baseline at roughly about 28% improved! But how does Gradient Boosting stack up against…

XGBoost Classifier

Clearly, very little difference between XGBoost and Gradient Boosting, slightly lower accuracy at a negligible level, however, much more computationally and time efficient. While these factors may not be so much of a factor for myself or others learning on their computers at home, they can become critical at high levels with extensive data and provides a lot of value learning these inner workings at a smaller, more manageable level.

Random Forest (RandomizedSearchCV)

Finally, we have the results from our Random Forest model, passed through a RandomizedSearch for the best hyperparameters. Not only did this particular model perform the best overall, but also, after shuffling the data around with multiple random splits, it was also the most consistent. Keep in mind, that while the other results shown were on Training and Validation data, this was on the Test data. With the improvement almost 30% better than the baseline (~28.71%) this model is clearly working the best. We’ll discuss more about potential limitations to this dataset, potential solutions moving forward and such in a moment, but first, let’s take a look at how the features are affecting these predictions.

This graphic shows how the values of a random observation is broken down, in term of how each feature affects the value of the prediction. In red, you can see features that will push the value of the prediction higher, while in blue, it’s features that push the value of the prediction lower. What we see is that the largest factors (though there are still many at play) are the values found in the ‘Release Year’ and ‘Song Name Length’. We can, however, zoom in even farther and examine each of those features individually. Let’s start with ‘Release Year’.

As we can see by isolating the feature, is that as year increases, so too does it’s affect of the prediction up until right around 2001 when it begins to decline again. An important facet to keep in mind is that this data is merely for Spotify, not the popularity of the song overall. We’ll talk more about this in a moment, after we analyze these features a little more.

Here we can notice that the longer a song’s name, the less popular it seems to become, up to a point. Right around the 40 character mark, it levels out, but there is a noticeable decrease up until up until that point.

This interaction plot shows the same information as the last 2 line charts, but in a different way, a bit more inclusively, as both features are used together. We can definitely see that songs from the early 2000s with short names is the sweet spot for popularity.

Unfortunately, it’s hard to really consider this undertaking as a success in terms of achieving our goal of predicting popularity. The dataset itself, while a wealth of interesting information, is limited by nature. While I would like to attempt again, perhaps using synthetic data to smooth out the imbalance, it’s important to note as well that the features of the ‘popular’ observations are incredibly similar to the ‘unpopular’ data. We notice that the most important feature, for predictive power, is the release year, as we notice the data is skewed to more recent songs. This is likely a result of using Spotify, a service that debuted in 2008, to measure popularity rather than, say, historical radio airplay. While there are notable exceptions to the radio airplay as well, for example in 1994, heavy metal band Pantera released ‘Far Beyond Driven’ with no songs on the radio that still achieved #1 on the albums sales charts and was certified platinum (sales in excess of 1 million units). This is important to note because each song from that album (and others like it) is included in this dataset, but offers little historical insight. Spotify users also tend to trend towards a specific age group (between 18–29 years old) living in the United States. Music being something that transcends cultures, age groups, and any other personal boundary one may find themselves within, perhaps this exercise was doomed from the start. The lessons we can learn from exercise however, make it hard to call it a failure, but perhaps the mystery of what makes a song have that particular ‘X factor’ will elude us for just one more day.

Footer