Evaluating Recommender Systems

Evaluation Methodology

Approach 1

A recommender system is a machine learning system. One can train it using prior user behavior, and then use it to make predictions about items new users might like.

So on paper at least, one can evaluate a recommender system just like any other machine learning system.

Here's how it works:

One can measure a recommender system's ability to predict how people rated things in the past. But to keep it honest, we start by splitting up the ratings data into a training set, and a testing set. Usually the training set is bigger, say 80 or 90 percent of all of the data. So we train the recommender system using only the training data. This is where it learns the relationships it needs between items or between users.

Once it's trained, we can ask it to make predictions about how a new user might rate some item they've never seen before. So to measure how well it does, we take the data we reserved for testing. These are ratings that our recommender system has never seen before.

For example, let's say one rating in our test set says that the user actually rated the movie "Up" five stars. We just ask the recommender system how it thinks this user would rate "Up" without telling it the answer. And then we can measure how close it came to the real rating.

If we do this over enough people, we can end up with a meaningful number that tells us how good the recommender system is at recommending things, or more specifically, recommending things people already watched and rated.

Below is a visualization of this methodology:

That's really all one can do, if it is not possible to test things out in an online system.

Approach 2

Another more fancy way of eavluation is using a technique called k-fold cross validation. It's the same idea as train/test, but instead of a single training set, we create many randomly assigned training sets.

Each individual training set, or fold, is used to train your recommender system independently, and then we measure the accuracy of the resulting systems against the test set.

We end up with a score of how accurately each fold ends up predicting user ratings, and we can average them together.

Below is a visualization of this methodology:

This methodology obviously takes a lot more computing power to perform evaluation, but the advantage is that one doesn't end up over-fitting to a single training set.

If the training data is small, we run the risk of optimizing for the ratings that are specifically in the training set instead of the test set.

So k-fold cross-validation provides some insurance against that, and insures that we create a recommender system that works for any set of ratings, not just the ones in the training set that we happen to choose.

Limitations

To reiterate, train/test and k-fold cross-validation are ways to measure the accuracy of a recommender system. That is, how accurately one can predict how users rated movies they have already seen and provided a rating for.

But there is an important point. By using train/test, all we can do is test our ability to predict how people rated movies they already saw.

That's not the point of a recommender system. We want to recommend new things to people that they haven't seen, but find interesting. However, that's fundamentally impossible to test offline.

So researchers who can't just test out new algorithms on real people, on Netflix or Amazon etc, have to make do with approaches like this.

Mean Absolute Error (MAE)

MAE is given by the following equation:

āˆ‘i=1n=āˆ£riā€¾āˆ’riāˆ£nwhere,nā‰”no.ofratingsinthetestset.riā‰”Actualratingofithtestsetinstance.riĖ‰ā‰”Predictedratingofithtestsetinstance.\sum^{n}_{i=1} = \frac{|\overline{r_i} - r_i|}{n} \\ where, n \equiv no. \hspace{0.1cm} of \hspace{0.1cm} ratings \hspace{0.1cm} in \hspace{0.1cm} the \hspace{0.1cm} test \hspace{0.1cm} set. \\ r_i \equiv Actual \hspace{0.1cm} rating \hspace{0.1cm} of \hspace{0.1cm} i^{th} \hspace{0.1cm} test \hspace{0.1cm} set \hspace{0.1cm} instance. \\ \={r_i} \equiv Predicted \hspace{0.1cm} rating \hspace{0.1cm} of \hspace{0.1cm} i^{th} \hspace{0.1cm} test \hspace{0.1cm} set \hspace{0.1cm} instance.

Root Mean Squared Error (RMSE)

RMSE is given by the following equation:

āˆ‘i=1n(riĖ‰āˆ’ri)2nwhere,nā‰”no.ofratingsinthetestset.riā‰”Actualratingofithtestsetinstance.riĖ‰ā‰”Predictedratingofithtestsetinstance.\sqrt {\frac {\sum^{n}_{i=1}(\={r_i}-r_i)^2}{n}} \\ where, n \equiv no. \hspace{0.1cm} of \hspace{0.1cm} ratings \hspace{0.1cm} in \hspace{0.1cm} the \hspace{0.1cm} test \hspace{0.1cm} set. \\ r_i \equiv Actual \hspace{0.1cm} rating \hspace{0.1cm} of \hspace{0.1cm} i^{th} \hspace{0.1cm} test \hspace{0.1cm} set \hspace{0.1cm} instance. \\ \={r_i} \equiv Predicted \hspace{0.1cm} rating \hspace{0.1cm} of \hspace{0.1cm} i^{th} \hspace{0.1cm} test \hspace{0.1cm} set \hspace{0.1cm} instance.

Evaluating Top-N Recommenders

Top-N Hit Rate

For computing top-N hit rate, we generate top-n recommendations for all of the users in the test set. If one of the recommendations in a user's top-n recommendations is something they actually rated, we consider that a hit. It means, we actually managed to show the users something that they found interesting enough to watch on their own already. So we'll consider that a success.

Hit rate is given by adding up all of the hits in top-n recommendations for every user in the test set, divide by the number of users and that's the hit rate.

hitrate=hits#usershit \hspace{1.0mm} rate = \frac{hits}{\# users}

Hit rate itself is easy to understand, but measuring it is a little bit tricky. We can't use the same train test or cross validation approach we used for measuring accuracy because we're not measuring the accuracy on individual ratings.

We're measuring the accuracy of top-n lists for individual users. So a clever way around this is called leave-one-out cross-validation. What we do is compute the top-n recommendations for each user in our training data and intentionally remove one of those items from that user's training data. We then test our recommender systems ability to recommend that item that was left out in the top-n results created for that user in the testing phase. So we measure our ability to recommend an item in a top-n list for each user that was left out from the training data.

The trouble is, it's a lot harder to get one specific movie right while testing, than to just get one of the n recommendations. So hit rate with leave-one-out, tends to be very small and difficult to measure, unless you have very large data sets to work with.

But it's a much more user-focused metric when you know your recommender system will be producing top-n lists in the real world, which most of them do.

Average Reciprocal Hit Rate (ARHR)

A variation on hit rate is average reciprocal hit rate or ARHR for short. This metric is just like hit rate, but it accounts for where in the top end list your hits appear. So you end up getting more credit for successfully recommending an item in the top slot than in the bottom slot.

Again, this is a more user-focused metric since users tend to focus on the beginning of lists. The only difference is that instead of summing up the number of hits, we sum up the reciprocal rank of each hit. So if we successfully predict a recommendation in slot three, that only counts as one third, but a hit on slot one of our top-N recommender system receives the full weight of 1.0.

ARHR=āˆ‘i=1n1ranki#usersARHR = \frac {\sum_{i=1}^n \frac {1}{rank_i} } {\#users}

Cummalative Hit Rate (cHR)

Last updated