Exploring the MovieLens 100k dataset with SGD, autograd, and the surprise package.
By Gavin Smith and XuanKhanh Nguyen
This project was the third project for my machine learning class this semester. The project aims to train a machine learning algorithm using MovieLens 100k dataset for movie recommendation by optimizing the model’s predictive power. We were given a clean preprocessed version of the MovieLens 100k dataset with 943 users’ ratings of 1682 movies. The input to our prediction system is a (user id, movie id) pair. Our predictor’s output will be a scalar rating y in range (1,5) — a rating of 1 is the worst possible, a rating of 5 is the best. Our main task is to predict the ratings of all user-movie pairs. The recommendation system is performed using four different models.
- Problem 1 follows the use of the Simple Baseline Model with SGD and Autograd. We take into account the simplest possible baseline model — a model that makes the same scalar prediction for a movie’s rating no matter what user or movie is considered.
- Problem 2 follows the use of the One-Scalar-Per-Item Baseline with SGD and Autograd.
- Problem 3 follows the use of the One-Vector-Per-Item Collaborative Filtering with SGD and Autograd.
- Problem 4 is open-ended. We were given a dataset where ratings have been omitted.
For all problems, our task is to obtain the best possible prediction results on the validation set and the test set regarding Mean Absolute Error (MAE). Methodologies are explained in all sections, along with respective figures.
Figure 1a: Simple Baseline Model Epoch vs. Mean Absolute Error
Figure 1a shows how the mean absolute error for the simple baseline model, which only predicts a value mu for each example, changes over time. The left plot, which has a batch size of 10000, converges to an MAE value steadily because its batch size is large, whereas the right plot, which has a smaller batch size of 100, starts to converge to a value more erratically. We choose n_epoch=10 for this problem. For batch_size=10000, we start seeing some learning after 4 epochs; MAE starts to approach a constant value while the number of epochs keeps increasing. For batch_size=100, after 4 epochs, the data is less noisy.
1b: How would you compute the optimal μ value for this model?
The line that could compute the optimal mu value would be “print(ag_np.mean(ag_np.asarray(train_tuple[2])))” which would compute the mean of all of the examples. This computation produces a value of 3.58, which is the same value as our model’s mu value. This is true because if we are predicting the same value for every example, the mean of the scores will be “most” correct “most” of the time.
Figure 2a: Scalar Pattern Model Epoch vs. Mean Absolute Error
Figure 2a shows show the mean absolute error of the scalar pattern model changes over time with stochastic gradient descent. The figure on the left shows how the error changes when the batch size is 10000, and since the batch size is larger the error gets lower at a steadier rate before settling around .75. The right graph shows the same process, but with a batch size of 100, so the error gets around .75 much faster, but its value is more unpredictable because the batch size is smaller. We choose n_epoch=250 for this problem. For batch_size=10000, we start seeing some learning after 200 epochs; MAE starts to approach a constant value while the number of epochs keeps increasing. For batch_size=100, after 200 epochs, the data is less noisy. To choose the step size, we first pick a large number for step size (step_size=1.5), and we see that the training plot diverging. So, we decrease the step size by 0.25. Eventually, we experience that step_size less than or equal to .75 doesn’t make training loss diverge. For this problem, we choose a step size is equal to 0.75.
Table 2b: Selected Movies Titles Versus cj
Table 2b shows each of the movies from the selected list and its learned per-movie rating adjustment parameter cj in order. We pick the model with the best MAE on the validation set, where the batch size is 10000, for 250 epochs, and the step size is 0.75. The list result is showed from this model. From this list, we can see that movies with a larger positive cj value tend to be more universally “lauded” movies, or “classics”, such as Star Wars or Indiana Jones, and movies with a lower value tend to be less so, like Jurassic Park 2 or Scream 2, these are horror movies and do not always appeal to a wide audience. In our problem, the bias of a movie would describe how well this movie is rated compared to the average across all movies. This value depends only on the movie and does not take into account the interaction between a user and the movie. Therefore, for a movie to be large and positive, it means the movie is likely to be rated highly by people, and a large negative value means that the movie is likely to be rated low.
Figure 3a: Epoch Pattern Model Epoch vs. Mean Absolute Error, α=0
This figure shows how the one-vector-per-item collaborative filter model mean absolute error varies over time, with each graph showing the model with a different number of dimensions, K, that are used to represent users and movies. For all the graphs we can see that when only a short amount of time has passed, the models are underfitting the data as both the training and validation set errors are very high. All the models then start to overfit after 250 epochs, as can be seen by the fact that the training set error decreases, but the validation set error starts to increase again. As K increases, we see that the validation set error has a more defined dip around 350 epochs, and the model overfits much faster with larger K values.
Figure 3b: Epoch Pattern Model Epoch vs. Mean Absolute Error, α = 0.1
This figure shows how the mean absolute error changes with the number of epochs for our model with an alpha value of 0.1. We chose this alpha value by training models with many different alpha values using a grid search, and we found that an alpha value of 0.1 reduced the mean absolute error the most. With this alpha value, we were able to reduce the mean absolute error to a lower value than the model, which was trained with an alpha value of 0. For this problem, the batch size is fixed to 1000. We do early stopping to find the parameters that perform best on the validation set. 0.75 is the largest step size we can get that doesn’t make training loss diverge. To find the parameters that perform best on the validation set, we used early stopping.
Table 3c: Mean Absolute Error on Validation and Test Sets
This table shows the mean absolute error for each of the models we trained in questions 1 through 3 on the validation sets and the test set. To determine the best version of each model, we tried to minimize the error on the validation set. For M1, this is easy because we are just choosing a fixed variable to make each guess, so there is only one real optimal value. For M2, we chose the model which used a batch size of 10000 because it minimized the mean squared error on the validation set. We used early stopping for each of the M3 models; we searched over different alpha values, batch sizes, and step sizes to determine which version of the model produced the smallest mean absolute error. We recommend using a K value of 50 because it had the smallest error. It could be beneficial to try models with a larger K value because it seems to be that with larger K values, the error decreases. The best model we found is M3 with a K value of 50 and an alpha value of 0.1; this model reduces both the validation set and test set errors. Additionally, an L2 penalty was added to M3 models. Setting L2 regularization on vector u and v force the values to be as small as possible. It can help us avoid overfitting.
Figure 3d: vj Values of Selected Movies
This figure shows a two-dimensional embedding of the learned per-movie vector vj for the movies in the select_movies data set. This was created using the best M3 model with K=2 factors. In this graph, movies are placed based on their factor vector. We can see that some movies that are similar tend to be grouped together; however, the grouping is not super obvious to us. In the lower right, we can see horror movies like Scream and Nightmare on Elm Street. We notice that movies with similar ratings will come out closer in the embedding space. One reason to explain this is our M3 model has learned that those movies are associated with a similar rating. To check our observation, we calculate the average rating from ratings_all_development_set dataset. The average rating for Sleepless in Seattle is 3.55, while the average rating for While you were sleeping is 3.56. And these two movies are placed close in the embedding space.
4a: Open-Ended Recommendation Challenge Method Description
For our open-ended recommendation model, we chose to use the KNNWithMeans classifier from the surprise package. KNNWithMeans works the same as the regular K nearest neighbors’ algorithm, where it calculates the similarity between K points and returns the prediction that is most in line with those points. KNNWithMeans differs in that you must specify a minimum and maximum K value because the algorithm only looks at points where the similarity is positive. After all, it would not make sense to use points that are negatively correlated. Also, this model considers the mean ratings of each user, which helps normalize the data. We chose to use this model because a good way of recommending movies to someone is to give them choices which are the most like movies they already like, and since KNN works by predicting based on the similarity between data points, we thought it would be the best choice. Another reason we also chose to use this model is that other options like SVD took significantly longer to train, so we would have less time to do in-depth hyperparameter searching on those models.
To train our model, we used a grid search to find optimal hyperparameter values along with 5-fold cross-validation to validate our model. The hyperparameters we searched over were K, the maximum number of points to compare to, min_k, the minimum number of points to compare to, and sim_option, which controls how to compute the similarity between points. We found that a K value of 50, a min_k of 2, and the Pearson similarity option were the best. We determined that these were the best by choosing the values which minimized the mean absolute error on the validation sets of the cross-fold validation.
Figure 4b: Model Hyperparameter Selection
This figure shows the three hyperparameters we tuned for our KNNWithMeans model. The left-most graph shows how the mean absolute error changes with an increase in K, the number of points which are compared to. As the number increases, the model starts to overfit less, as both training and validation error decrease. For a K value of 200, the model starts to underfit on the data because the prediction becomes more reliant on data points that are not similar to the point being predicted. From the second graph, we can see that any increase in the minimum number of neighbors to take into account increases the error, so the optimal value should be 1. Last, the right graph shows that the similarity option has almost no influence on our model’s performance, but the Pearson option has a slight edge over the other two.
Table 4c: Model Mean Absolute Error Performance
This figure shows the mean absolute error of our model on the held-out data set used for the leaderboard, as well as its performance on the validation set and test set when training our model.
4d: Analysis of Model Performance
Table 4c shows that the mean absolute error of our model on the held-out data set is lower than the error on our cross-fold validation set. The reason for this is because the held-out data set has a smaller size (10000 ratings) compared to our training data (89992 ratings). There may be a case that the testing data is not a good representative of our training data. Therefore, it behaves well and gives a low error.
When comparing this data to our models’ results in table 3c, we can see that our leaderboard mean absolute error is the lowest and that our test error is higher than any of the M3 errors. The reason for this is because SVD decomposes the original matrix, the sense matrix is used for predicting the rating (use, item) pair. While in KNN, the prediction is made by finding a cluster of similar users to the input user whose rating is to be predicted and then take the average of those ratings for prediction. SVD does a better job in learning the training data; we see a smaller error on the validation set.
4e: Model Limitations and Opportunities for Future Work
One limitation that we faced was that more complex models like SVD took far too long for us to be able to tune the model effectively; if we had access to better computing power it would be more feasible to use models like this. Another thing that could be improved upon is looking at different methods of recording loss; when training our model, we only used mean absolute error, but it could be that the model performs better when very incorrect guesses are weighted more heavily like with mean squared error. Additionally, we would want to consider other user features such as age, gender, nationality, spoken language etc., and the item features like the main actors, the duration of the movie, spoken language, etc. Still considering user and item, we would try to model the fact that people who speak a certain language are more likely to view movies in that language or that may be older users are more likely to watch classic movies. To make a prediction, we look at the user profile and predict the rating.