Visually, you can see patterns where the euphonious playlist typically has low energy, high acousticness, and low valence whereas the happy playlist has high energy, low acousticness, and higher valence values. The other attributes had very mixed graphs and no consistent pattern that’s unique to each playlist. Removing the music attributes that don’t have distinct patterns results in a stronger model because there’s less noise and its easier for the model to distinguish between the 2 playlists.
4. Refining Datasets
Now that I narrowed down my dataset to these 3 definite attributes, I started to create and refine my final dataset. For this, I found that the most efficient option for me was to use Google Sheets.
I first imported both .csv files to each playlist into different tabs of the sheet. I then combined the 3 attributes from each dataset (energy, acousticness, and valence) and merged them into 1 dataset removing all other attributes (song names, artists, danceability etc.) The Numpy Library only takes in integer values so I multiplied each column of numbers (which were originally small decimals) by 100 to get integer values.
Finally, I added a “Target” column which identifies what playlist each data point belongs to. 1 = euphonious playlist; 2 = happy playlist. The target column serves as the output (playlist) while the attributes of energy, acousticness, and valence serve as the inputs of the model.
5. Now it’s time to create and train the model!
Essentially, all we are doing is splitting the entire dataset into 2 subsets (training and testing) and storing the training set into a model that will be used when classifying new data points. Training the model means fitting the classifiers (outputs) to their attributes (inputs).
After merging all my data together, I downloaded the sheet as a csv file and uploaded it into Jupyter Notebook.
Normalize (Standardize) the data: Normalizing the data is setting all numeric values in each column to a common scale so the difference between values are not distorted. Fundamentally, we’re creating z-scores and rescaling the data to have values between 0 and 1.
Here are some drawn out diagrams that explain it conceptually:
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
happy_eupho_feat = pd.DataFrame(scaled_features, columns = happy_eupho.columns[:-1])
scaler.fit(happy_eupho.drop('target', axis=1))
scaled_features = scaler.transform(happy_eupho.drop('target',axis=1))
“happy_eupho” is the name of my csv
Above, I imported StandardScaler from sklearn and created an object of the standard scaler which is going to normalize the dataset. I then fit the object to all the data and dropped the “target” column because that is the output. I standardized (.transform) those same columns and created a pandas data frame (optional) for the standardized values.
Splitting data into training and testing sets: The KNN algorithm needs to take in 2 sets: the training and testing set. These sets come from the dataset itself where a share of the dataset gets trained (into a model) and the rest gets tested for the model’s accuracy.
from sklearn.model_selection import train_test_splitX = happy_eupho_feat
y = happy_eupho['target']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30, shuffle=True)
To actually split your dataset, you need to import train_test_split from sklearn and create the variables that represent your input (X) and output (y). Since we want to know what playlist is the best fit for certain songs, our playlist (“target” column) will be the output (y) and the input (X) will the the music attributes/features (in my case: energy, acousticness, and valence). The random state is the number the random generator starts with and doesn’t change the algorithm’s behavior so it can be any value. Shuffle is to make sure the data I split is randomized so it is best representative of the entire dataset. The test size is the ratio of the dataset that I am putting towards the testing set which I chose as 0.3, meaning 0.7 (70%) of my dataset is going to be trained.
Training the model: Now that we officially have 0.7 of the dataset split towards testing, it’s time to finally use KNeighborsClassifier from sklearn to train the model.
from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier(n_neighbors=7,p=2,metric='euclidean')
knn.fit(X_train, y_train)
The standard way of finding the K value is to take the square root of n (number of data points in entire dataset). In my case, K = 7 (7 nearest neighbors). We then fit the knn object to the X_train and y_train (the inputs and outputs which make up the training set).
Testing set and evaluating the predictions: The model is now trained; it’s time to put it to test and check for accuracy by using the Classification Report.
prediction = knn.predict(X_test)
predictionfrom sklearn.metrics import classification_report
print(classification_report(y_test, prediction))
The prediction object contains all the inputs (music attributes) within the testing set and the trained model outputs the prediction of which playlists each song should belong in. To see how accurate the model is, I’m going to use the Classification Report which gives us the Precision, Recall, and F1 values (in depth explanation here) by comparing the output values from the testing set with those of the training set. Right now, I will just focus on the average accuracy value which is the percent of correct predictions that were made.
On the first try, the average accuracy was a 0.82 which is alright, but could be better. This means that 82 percent of the model’s predictions were correctly identified to their respective playlists.
Since I wasn’t satisfied with this accuracy rate, I decided to change the k-values around to see which value returns better accuracy. With help from this article, I learned how to make an error chart which graphs the error rate for every k-value from 1–40 using outputs from a for-loop.
error_rate = []
for i in range(1,40):knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
prediction_i = knn.predict(X_test)
error_rate.append(np.mean(prediction_i != y_test))
Using this chart, I substituted values near 5 and 10 and rested on K=10 which returned the lowest error rate and highest average accuracy of 0.87.