In a previous article, I wrote about the tree types of recommender systems:
However, it is extremely important to understand how can we evaluate recommendation systems automatically. This allows us to perform rapid development of the system we are working on, and evaluate different experiment parameters and choose the best performing one.
Earlier, many of the recommendation algorithms were evaluated based on their accurate predictions. However, this was not the best approach to evaluate recommender systems. Even though predicting users’ exact needs is crucial, it is not enough in most of the cases.
“A lot of times, people don’t know what they want until you show it to them” — Steve Jobs
That is why recommendation systems changed their perspective from being the most accurate engine to the best engine which allows people to discover more.
Platforms like Netfix, Amazon, or Spotify use their recommendation engine to not only predict the specif interest of the user but also allows them to discover new taste.
Due to its social side, humans tend to explore new things.
A decade ago, this type of recommendation was coming from family, friends, or colleagues. Lately, recommendations systems take this responsibility and have been trying to show people different recommendations based on their profile to find new interests.
This strategy is a win-win for both commercial applications and users. While customers expand their horizons with new interests, companies are earning more from sales.
This article will show the evaluation methods of recommender systems using three different strategies; offline, online, and user studies, as well as describing the evaluation metrics.
One of the easiest ways to evaluate a recommender engine is to use offline testing. Offline testing is applied to the existing data set, and the model is being evaluated by using performance metrics such as prediction accuracy. Another method is conducting an experiment on real users in live environments.
Finally, it is also possible to start a user study, where a group of people is being asked to test the proposed solution and answering a questionnaire after the experiment.
While measuring the performance of the recommendation algorithms have different metrics, accuracy is mostly used for academic research. However, researches are not completely sure that if accuracy directly reflects the user intention. That is why, new researches focus on metrics like coverage, novelty, diversity, Discounted cumulative gain (DCG), Normalized DCG (nDCG) and so on.
Offline evaluations are done by using historical data that contains user actions. The main idea of the offline evaluation is to simulate the user interactions with the recommender system. Performing offline evaluation has an advantage since it does not require real-time user interaction. That is why it is possible to test multiple recommender algorithms by using the same data. On the other hand, offline tests only show the user behavior when the time data was created.
User behavior tends to change, and performing offline evaluation often will not give proper information on how recommendations ft users’ interest in general. That is why the offline evaluation results should not be seen as the real-life performance of the recommendation algorithm. One possible use case to perform offline evaluation is to find the best possible combination of parameters by tuning the algorithms and picking the best one.
There are multiple ways to simulate user behavior. One can gather a particular user session and the items in it. Then for each item in the session, the algorithm produces a list of ranked recommendations. It is possible to see the predicted items and the item that were clicked by the user. In this method, the algorithm receives one item at a time and creates a recommendation. Another possibility is that dividing the number of items in the session to individual bins and feeding the algorithm for each bin to predict the items in the next bin.
Finding a way to simulate user behavior often decided based on a domain. While predicting the next item in the user session might be useful for e-commerce websites, to create a recommended playlist in Spotify, the algorithm might need to consider more than one song in the user history.
Online evaluation is one of the best way of seeing user interactions with the recommendation engine. The real-life performance of the recommendation system depends on a variety of factors.
During the online evaluation, real users interact with the systems. So it is possible to understand the correct user intent and the success of the recommendation model directly.
At first, subset of the total users is separated as test group. This separation is generally random so that both parties will be similar to each other. While test users are being shown the recommendation from the new algorithm, the others will continue to interact with the current recommendation system.
It is crucial that there must be only one change in the new algorithm to understand the effect of the applied settings clearly. If the test case is to measure the recommendation accuracy, there should not be changes on the user interface of the web page. There must be a certain amount of time between changes so that tester can understand if the proposed algorithm improves or worsens the recommendation quality.
Even though online evaluations give the most accurate success rate of the proposed recommendation system, they are risky to be applied. If the proposed algorithm produces irrelevant recommendations, it might cause test users to leave the website and eventually reduce the income of the business.
Applying online evaluation will give the direct performance of the recommendation system since real users test it. However, it might cause businesses to lose customers in case of poor recommendations. That is why online evaluation must be done after fine-tuning the algorithms by using offline testing and selecting the optimum algorithm.
As stated before, accuracy is not the only metric to evaluate the performance of the recommendations. When it comes to the real life scenarios, algorithms must consider user needs. It is a debate that if recommendations only consider the products close to each other or also give customers some of topic choices as well. Even though it mostly depends on the domain, in order to keep users in the online shop and let them discover new possibilities, recommendations should show different items as well. When this is the case it is not completely possible to evaluate the performance by using accuracy only.
One can use other metrics like Coverage, Novelty, Diversity, MRR, DCG or nDCG to evaluate the performance of the recommendations.
Coverage is the percentage of the possible recommendations that an implemented recommendation algorithm can produce. Coverage is essential from the perspective of the new item being recommended. It gives an idea of how quickly a new item will appear in the recommendation list.
New items will reduce the coverage metric since they need to be purchased/graded by at least a couple of people. Based on the used recommendation technology, the list of the items that the system can produce prediction varies.
If the system uses only collaborative filtering, items must have ratings higher than the decided threshold. Similarly, some systems may apply filters, and only items that satisfy the given filters can be used for prediction. One can donate the list of items that can be used for the recommendation as Is to indicate a subset of possible items in the inventory. Then, coverage can be easily calculated by dividing the possible items by the total number of items:
Diversity is a measure of how your recommendations are different from each other. Consider the customer finished watching the first movie of a trilogy on Netflix. Low diversity recommender would recommend only the next parts of the trilogy or the films directed by the same director. On the other hand, high diversity can be achieved by recommending items completely random. Diversity may seem like a subjective measure, but it can be calculated by using the similarity between recommended items. One can calculate the similarity between the recommended item and get the average of the similarity score. Consider the average similarity score of the top 10 recommendations as sim10. Diversity can be concluded as the opposite of the average similarity score, and it can be calculated as the following formula:
Considering only diversity as the measure of quality can be misleading. That is why it should be used as a supplement along side other measurements.
Novelty is a measure of how popular the items system recommending. Novelty may be similar to diversity. One can achieve high novelty by randomly recommending items since most of the items are not in the popular item list. Even though it is possible to calculate novelty, it is hard to use as a solid metric. Customers often want to see related products in their recommendations. This bring user trust to the recommender system. If the recommender systems only recommends items that do not reflect user interest, it might damage the user trust and cause customers to leave the online store.
System should keep the balance of the items that user might like and items that user do not know but could discover in the recommendation list. It is an important topic since while familiar items will let customers to trust the recommender system, unfamiliar item will let them discover new things. Achieving good novelty score can help with Long Tail Problem and that is why it is an important metric to keep track of.
MRR is a metric that returns a ranked list of predictions for a given query. For a single list, the reciprocal rank is 1/rank , where rank is the position of the most correctly predicted item. Unless there exists no correct prediction, then the reciprocal rank is 0. When there are multi queries Q, the Mean Reciprocal Rank will be the mean of all reciprocal ranks.
Recommendation algorithms produces a list of ordered recommendations. It is possible to judge the result based on the ordered recommendations. There are two distinct cases while exploring a ranked list of recommendation results:
- Highly similar items are more valuable than marginally similar items, and
- The greater the ranked position of a similar item (of any relevance level) the less valuable it is for the user, because the less likely it is that the user will examine the item.
It is possible to grade recommended items by their relevance. Consider relevance scores from 0–3 (0 is the lowest and 3 is the highest relevance). One can replace the recommended item by their relevance value each having a relevance score 0, 1, 2 or 3.
Consider the top 5 recommendations produced by the system as R and corresponding relevance scores as G:
Cumulative Gain (CG) can be calculated by summing the relevance scores in the recommendation list:
Once the cumulative gain formulate is applied to vector G, we can obtain the CG of the recommendation list. In this case, the total CG score of vector is:
3 + 3 + 2 + 0 + 1 = 9
Additionally, one can create a vector of gains and receive the cumulative gain in the given position. Consider G′ is a vector of cumulative gains:
It is possible to get CG at position i by using the value from the vector G ′ directly. As an example; CG(3) = 8.
Major drawback with calculating the GC is the ordering. One can end up having same CG score by using different set of recommended items. For example;
Even though R1 is able to recommend two most relevant items in the first two recommendation it has the same CG score as R2. If we only consider CG, we can conclude that R1 is equally good as R2 which is not completely correct.
In order to fix this problem, position of the recommended item must be included in the formula. Then the calculation follows dividing the relevance score of the item by the log of the respective position. This is known as Discounted Cumulative Gain and it penalizes the recommender algorithm for listing the most relevant items in lower ranks:
I presented in this article multiple criteria of evaluation recommendation systems. It is clear that there is no direct and absolute metric to evaluate the system because the performance metric relies heavily on the business model itself.
If you have any question or ideas how to improve this article, let’s discuss it below.