We have picked the Million Song Dataset (MSD lyrical) which consists of 2,33,662 songs with lyrics. We collected the Top Search Results of playlists using phrases like ’Summer’, ’Love,Party’, ’Breakup’, ’Rock’, ’Country’, ’Romance’, ’Rap’, ’Metal’, ’Hip-Hop’, ’Latin’, ’Blues’, ’Soul’, ’Classic’, ’Pop’, ’Jazz’, ’Folk’, ’RB’ from Spotify API which have a significant song overlap with the MSD dataset. In total, we obtained 169 playlists amounting to 11,159 unique songs.
We extracted 14 audio features for each song using Spotify API as shown below :
The class distribution of playlists is shown below :
We performed Topic Modelling and extracted additional 20 features using Latent Dirichlet Allocation (LDA) on the lyrics which will be described in the subsection below.
2.1 Preprocessing, Feature Extraction
The lyrics obtained from Million Song Dataset were already preprocessed to some extent. The words had been stemmed and lyrics were converted to Bag of Words format. We further removed Stopwords and words with less than 3 characters for better topic creation.
2.1.1 Topic Modelling using Latent Dirichet Allocation (LDA)
We preprocessed our lyrics, applied topic modelling using Latent Dirichlet Allocation (LDA) and extracted the probabilities of each topic in every song. Latent Dirichlet allocation is a generative statistical model which provides us with a fixed number of unobserved topics which would help in analyzing some similarity between playlists. Latent Dirichlet Allocation (LDA) was applied on the lyrical corpus, we tried different values for the number of topics ranging from 3 to 30. The topics obtained were evaluated using Covariance ‘c v’ score. “20 number of topics” gave the best c v score of 0.58 and we extracted the probabilities of each topic in every song for the same.
20 new columns were added to our dataset containing the topic wise probabilities for each song.
A small visualization of the LDA topics are shown below :
- The demo is an interactive visualisation of the topics obtained from topic modeling. The diagram denotes the importance of each topic which is represented in the size of their bubbles. It also gives the 30 most salient terms in the overall lyrics corpus. The topics that are closer in the 2d space (in the visualisation) have words which are usually closely associated together(in real life songs).
- For example, in our visualisation the topics associated with the dark genres are close to each other. Topics 7 and 14 look are topics of songs about satanic evil, power (dark metal songs) whereas topic 9 which is close to these topics in this space is about killing, death, blood, fight (is another dark topic). Topic 6 is about love and is far away from these three topics. Topic 12 and Topic 18 are about non-english language songs and are far away from the english topics.
- If we click on any topic, it shows the words in each topic that are ranked. There is a slider in the interactive visualisation which can be adjusted. When the labda is closer to zero, the words are ranked according to how exclusive a word is in a given topic. When the lambda is closer to one, the words are ranked according to how probable the word is to appear in the given topic(Lift).
2.1.2 Dimensionality Reduction using Principal Axis Component (PCA)
Principal Component Analysis is a method of reducing dimensions of a dataset by transforming a large set of variables into a smaller set of variables that still contains most of the information in the larger data set. The various steps in PCA include standardization, computation of covariance matrix and eigenvectors to identify principle components. PCA can be thought of as fitting a “ p-dimensional ellipsoid” to the data; every axis of the ellipsoid means a single principal component. For our project, we are working with 30 principle components.
Figure below shows the explained variance of the principle components.
2.1.3 Visualizing High Dimensional Data using t-SNE
t-Distributed Stochastic Neighbour Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. tSNE, unlike PCA, is not a linear projection. It uses the local relationships between points to create a low-dimensional mapping. This allows it to capture non-linear structure. tSNE creates a probability distribution using the Gaussian distribution that defines the relationships between the points in high-dimensional space.
2.1.4 Data Standardization
Data Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. Here’s the formula for standardization:
x¯ is the mean of the feature values, σ is the standard deviation of the feature values.