Matrix factorization for predicting attractiveness ratings


I applied matrix factorization methods to make a personalized recommendation system for women using a ratings dataset from the dating website

When compared with the  “highest average rating” recommender system, the personalized recommender system reduced the fraction of suboptimal recommendations by an average of ~3% across users, and by as much as 7% for users who rated large numbers of profiles.

On online dating websites, users with highest average ratings are in high demand. By way of contrast, users with profiles that are recommended by a personalized recommender system are less likely to be overwhelmed by suitors, and so more likely to reciprocate interest from a suitor.  So a personalized recommender system would increase the fraction of suitable matches found by more than the numbers above suggest.

The code corresponding to this post is on Github.

The data

The dataset was made available by Vaclav Petricek and is available at It contains 17 million ratings of profiles by ~135k users in total. Ratings are on a scale from 1 to 10.

A quick look at the dating website shows that the primary information  that profiles contain are photos, suggesting that the ratings can be interpreted as being primarily ratings of attractiveness.

I restricted consideration to ratings by women of men’s profiles. As is standard in predictive modeling, I split the data three ways, using 60% of the data for a train set, 20% for a validation set for hyperparameter tuning, and 20% for a test set.

The algorithm

The algorithm that I used is a matrix factorization method similar to the algorithms that performed best on the Netflix Prize.  The idea is to approximate the ratings matrix as a product of two much smaller matrices, one of which represents different “types” of users, and the other which represents different “types” of items. These matrices can then be multiplied together to predict the ratings that have not yet been filled in.

Having done this, we can recommend a user the items for which we predict that the user’s rating is highest. In our case, instead of predicting ratings of movies by users, we’re predicting ratings of user profiles by users.

The specific matrix factorization method that I used is  SOFT-IMPUTE from Spectral Regularization Algorithms for Learning Large Incomplete Matrices by Mazumder, Hastie and Tibshirani (2010). The authors write:

Our semidefinite-programming algorithm is readily scalable to large matrices; for example SOFT-IMPUTE takes a few hours to compute low-rank approximations of a 10^6 ×10^6 incomplete matrix with 10^7 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques.

It’s plausible that SOFT-IMPUTE outperforms the algorithms used in the original paper that accompanied the dataset, Recommender System for Online Dating Service (2007), but it’s not possible to get an apples-to-apples comparison, because the authors didn’t specify a train / test split in their paper.


The root mean squared error of the “highest average rating” recommender system was ~1.73 points on the test set, on a scale from 1 to 10. By way of contrast, the personalized recommender system had root mean squared error of only ~1.66 points.

I estimated the practical significance of this change in terms of the decrease in fraction of “bad” recommendations. I defined “bad” recommendations as followed.

  • I counted the number of profiles in the test set that the user gave her highest rating. Call this number N.
  • I calculated the fraction of the recommender system’s top N recommendations that the user did not give her highest rating.

The average fraction of bad recommendations of the “highest average quality” recommender system was 24.8%. By comparison, the average fraction of bad recommendations  by the personalized recommender system was 24.1%, for a ~3% reduction in bad recommendations.

The percent reduction in bad recommendations increases with the number of profiles that a user has rated.

This makes sense – the more profiles a user has rated, the more knowledge we have of about how that user’s preferences differ from the average user’s preferences.