data:image/s3,"s3://crabby-images/f52b1/f52b1125f6db356d23a3bdb9b9f6ce735cd30354" alt=""
Introduction of demographic filtering, content-based filtering, and collaborative filtering in a practical way
data:image/s3,"s3://crabby-images/d4a14/d4a1424cec9d3a0e28d2cfee2783c76d09a625e0" alt="Tommy"
In this article, let’s discuss a project that articulates how the Machine Learning algorithm recommends what is the next movie that you might want to watch by using the Recommender System. This approach not only can be implemented for movie contents, but also for other digital objects chosen distinctively for each user, for instance, books, web pages, music, messages, products, dating preference, and of course, movies that have been widely executed by several companies to improve their customer experience within their digital platforms.
There are three types of recommender systems that will be implemented in this project, which are:
- Demographic Filtering offers users with similar demographic backgrounds similar movies that are popular and well-rated regardless of the genre or any other factors. Therefore, since it does not consider the individual taste of each person, it provides a simple result but easy to be implemented.
- Content-Based Filtering considers the object’s contents, in a movie case, it would be the actors, directors, description, genre, etc. therefore, it will give users the movie recommendation more closely to the individual’s preference.
- Collaborative Filtering focuses on user’s preference data and recommends movies based on it through matching with other users’ historical movies that have a similar preference as well and does not require movies’ metadata.
After understanding the mechanism of Recommender System, let’s jump-start our first Recommender System project by using TMDB’s movie dataset that can be downloaded through Kaggle here. This dataset contains 2 sets of files, which are Credits file and the Movies file. The Credits file has the size of 38MB with 4 features, which are, movies’ ID, titles, the cast members’ name (on-screen members), and the crew members’ name (backstage members). On the other hand, with the size of 5.4MB, Movies file contains more features, namely, the movies’ budget, genre, homepage, ID, keywords, original language, original title, production companies, production countries, release date, revenue, runtime (in minutes), status (released or rumored), tagline, title, average vote, and vote’s count.
As usual, firstly, we need to import several starting libraries as follow:
import numpy as np
import pandas as pd
If you are using Google Colab, don’t forget to upload the Movies and Credits files to Colab as follow:
from google.colab import files
uploaded = files.upload()
And then assign those files to variables by using Pandas’ read function and read their sizes:
credits = pd.read_csv('tmdb_5000_credits.csv')
movies = pd.read_csv('tmdb_5000_movies.csv')
print(credits.shape)
print(movies.shape)
As we can see below, both files have 4803 data with 4 features for Credits file and 20 features for Movies file.
Since there are two files, we should merge those files based on their movies ID. But before merging, let’s change the Credits’ file “movies ID” column into “ID”, therefore, they would have identical “ID” feature when merged and then check the new merged file’s size.
credits.columns = ['id','tittle','cast','crew']
movies= movies.merge(credits,on='id')
print(movies.shape)
Now our new merged file contains 23 features as shown below:
As we understood, demographic filtering is one of the simplest Recommender System that only offers the users the best rated and most popular movies. However, although it might be simple, we still need the appropriate formula to calculate the best-rated movies because some movies have 9.0 rating but only have 5 votes, so it is not fair for the rest of the movies that are rated slightly lower but with much more votes.
The best formula to calculate movie rating is provided by IMDB, which is articulated clearly here. It basically taking a number of votes, the minimum number of votes required to be considered, mean of votes, and average rating into account and ended up with a formula as follow:
Where:
- W = Weighted Rating
- v = number of votes for the movie
- m = minimum number of votes necessary to be considered
- R = average number of the movie’s rating
- C = the mean vote from overall data
Therefore, we need to determine each of those elements in order to obtain W. Number of votes (v) and the average number of votes (R) have already been provided in the dataset, therefore, we do not need to calculate further for those variables. Next, we need to find out C, which is the mean of the overall votes that can be determined through the following function:
C= movies['vote_average'].mean()
If we try to print out the value of C, we will get 6.092.. as follows:
Next, we need to determine m, which is the number of votes required for a movie to be considered as a recommendation. We can set it as any number, however, let’s say in our algorithm, we will set it as 85th percentile as our cutoff, which means to be considered, the movie needs to have more votes than 85% of the overall movies. Let’s find out:
m= movies['vote_count'].quantile(0.85)
As we can see the result below, in order for a movie to be considered in the recommendation, it has to have at least 1301 votes in its rating.
By using the value of m, we can eliminate movies with the number of ratings below 1301 as follow:
demograph_movies = movies.copy().loc[movies['vote_count'] >= m]
Now we can see only 721 out of 4803 movies that have more than 1301 votes.
After we have found out all the elements, let’s create the IMDB weighted rating formula by defining its function as shown below:
def weighted_rating(a, m=m, C=C):
v = a['vote_count']
R = a['vote_average']
return (v/(v+m) * R) + (m/(m+v) * C)
Then we can insert the IMDB formula’s results into the demographic recommendation file by creating a new feature called “score”
demograph_movies['score'] = demograph_movies.apply(weighted_rating, axis=1)
Afterwards, we need to sort the movies based on the weighted rating score in descending order.
demograph_movies = demograph_movies.sort_values('score', ascending=False)
Now let’s see what are the top 10 movies based on our demographic recommendation algorithm by using IMDB formula that we have just built:
demograph_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)
Turns out The Shawshank Redemption topped the chart followed by Fight Club, and Pulp Fiction. They indeed are great movies, however, this recommendation system applies for everyone regardless of the users’ genre or other factors preferences, therefore, it is considered far from perfect.
Unlike demographic filtering, content based filtering considers every elements in the movies before recommend them to the users, such as the movies’ descriptions, genres, casts, crews, etc. This way, the users will more likely to receive recommendations that are more aligned with their favorite movies.
Recommendation Based on Movie’s Description
Let’s start off by offering movie recommendations that have similar descriptions, where in Movies dataset, the data stored in the “overview” feature which we can be found through here:
movies['overview'].head(10)
Since we are dealing with sentences here, it is wiser to adopt one of NLP (Natural Language Processing) techniques called TF-IDF, which is a short of Term Frequency-Inverse Document Frequency. What TF-IDF does is, it analyses the importance of each word by finding TF and IDF by using the following formulas:
And then TF-IDF can be found by simply multiplying the result of TF and IDF, therefore:
TF-IDF = TF*IDF
TF-IDF calculation has been provided by Scikit-Learn library, which can be imported by the following code:
from sklearn.feature_extraction.text import TfidfVectorizer
Before we execute TF-IDF, we need to do the necessary NLP pre-processing tasks, such as removing stop words (words that do not have meaning, for instance, “a”, “the”, “but”, “what”, “or”, “how”, and “and”) by assigning a new variable.
tfidf = TfidfVectorizer(stop_words='english')
And we also need to replace NaN with an empty string:
movies['overview'] = movies['overview'].fillna('')
Then we can apply TF-IDF vectorisation to the movies’ overview and check its size:
tfidf_overview = tfidf.fit_transform(movies['overview'])
tfidf_overview.shape
As we can see above, there are more than 20,000 words that are used to describe 4803 movies in the “overview” feature.
Because we have calculated the TF-IDF vectorisation for the overview’s sentences, we can now find out similarities between two movies, which actually have several methods, such as Euclidian, Pearson Correlation, and Cosine Similarities. However, by considering simplicity, we will use Cosine Similarities, which can be obtained by using linear_kernel() function from sklearn library.
First, we need to import linear kernel from sklearn below:
from sklearn.metrics.pairwise import linear_kernel
Then we can find out the cosine similarity through it.
cos_sim = linear_kernel(tfidf_overview, tfidf_overview)
This way, we have discovered the similarities between the movies’ description across the dataset. However, before we create a function that returns movie recommendations based on the description’s similarities, we need to set index in each title as follows:
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
Then, we can start building a function for movie recommendation based on their descriptions as follows:
def des_recommendations(title, cos_sim=cos_sim):
idx = indicesHow Machine Learning Recommends Movies for You
sim_scores = list(enumerate(cos_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:16]
movie_indices = [i[0] for i in sim_scores]
return movies['title'].iloc[movie_indices]
In the description based recommendation algorithm above, firstly, we obtain the movie’s index based on its title, and then gather movies with similar cosine similarity results, then sort the movies in descend order, then set the number of results to be 15, then get the recommended movies’ indices, and finally, show us top 15 movies based on the stated methods.
Let’s try the movie recommendation for Minions:
des_recommendations('Minions')
We get the following cartoon/kids movies as recommendations:
If we try The Dark Knight:
des_recommendations('The Dark Knight')
We get mostly another Batman movies as recommendations:
I would say this type of Recommender System will be able to provide recommendations that are much more relevant compared to demographic filtering system.
Different to Content Based Filtering which recommend movies for us only based on the other movies’ elements, Collaborative Filtering will allow more personal experience for the users because it involves the user’s ratings into account. Before moving further, first we need to understand two types of Collaborative Filtering, which are user based filtering and item based filtering. As we can see from their name, user based filtering assesses similarity of ratings based on the users, on the other hand, item based filtering assesses similarity between their ratings based on the items. Moreover, the similarity between both users and items can be calculated from Pearson Correlation and Cosine Similarity formulas.
Therefore, CF can predict how much user will like a certain movie even though the user hasn’t rated it yet. Moving on, to get started in CF project, we need to download another dataset from Kaggle here, specifically the “ratings_small.csv” dataset because the previous dataset does not contain a User ID feature, which is essential in CF project.
ratings = pd.read_csv('ratings_small.csv')
We will also need to import scikit-surprise library to utilise its SVD and other functions. If you have not install surprise, you can run the following code:
pip install surprise
Then, we can import surprise libraries:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
Because we are dealing with large number of user and product based data, we need to mitigate the possibility of scalability and sparsity issues by implementing Singular Value Decomposition (SVD), which we will be able to check the dataset performance by assessing the RMSE (Root Mean Square Error) and MAE (Mean Absolute Error). Note: the lower the values of RMSE and MAE, indicate the better performance it is for the dataset.
reader = Reader()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)
As we can see, the result of MAE and RMSE after implementing SVD are less than 1, which is in an acceptable range:
Since it has shown a good performance, let’s train our data:
train = data.build_full_trainset()
svd.fit(train)
Let’s check the data with user ID of 1:
ratings[ratings['userId'] == 1]
And let’s make a prediction for user ID 1 with movie ID 302:
svd.predict(1, 302, 3)
We get an estimation rating prediction of 2.87
Demographic, Content Based, and Collaborative are very distinct Recommender Systems that operate by considering different element, however, Demographic is considered to be the simplest among the others, but Content Based and Collaborative give more personalised movie recommendations.
Thank you for reading and I hope you enjoy it.