Bayesian adjustment of Yelp ratings

(The code corresponding to this post is on Github)

Yelp helps not only customers, but also small business owners. Despite this, some business owners feel unfairly maligned by negative reviews. For example, in answer to the Quora question How Reliable Are Yelp Reviews?  Dean Vinsomnia wrote:

I am a small business owner on yelp. I had a disgruntled customer write a horendously dishonest and inaccurate review on yelp. Yelp is not conserned with mediation. They feel that 1 star rating are 1 star for a reason. This same user also gave the humane society 2 stars because she felt they sold her a stupid cat (as opposed to a smart one) yet this review stuck….

Such complaints don’t give enough credit to readers, who know that individual reviewers can be unreliable, so that a single positive or negative review should be taken with a grain of salt. Most people would choose a restaurant with a 4.5 star average based on 10 ratings over a restaurant with a single 5 star rating.

But I’ve heard friends and family trying to figure out how much weight to give small numbers of reviews when deciding where to go, often without coming to a confident conclusion — it actually is true that people can be misled by small numbers of reviews, on account of not having a good intuitive sense of the effect size of regression to the mean in this context.

Yelp released a dataset  consisting of ~1 million reviews, together with other information on the reviewers and on the businesses being reviewed, and which we can  to explore this question. Using methods from Bayesian statistics, we can use our knowledge of all business ratings to estimate what each business’s average rating would be if it were rated by a huge number of people rather than by just a few people. (The numerical rating may not be the most important part of a review, but it’s important, and also relatively easy to analyze.)

Let’s say that we just look at the ratings given between 2014 and 2015, and restaurants in categories with >= 100 restaurants. The distribution for average number of stars  by business (simple arithmetic mean, with no rounding) is then:


Without even checking, we can be sure that the reason that the fraction of restaurants with a given star average increases when we finally hit 5 stars is not because some restaurants are so amazing that nobody on earth could feel that they’re below 5 stars, but instead that these restaurants have been reviewed by too few people. And indeed, checking the numerical data shows that 99% of the restaurants that have average rating 5 also have 7 or fewer reviews.

To get a sense for how much of the variation in average ratings is coming from some restaurants not having been reviewed by enough people, we can try plotting average rating against review count:


This isn’t so helpful: we can’t see what’s going on  very clearly because of the small number of restaurants with more than 250 reviews.   We could simply remove all restaurants with more than 250 reviews and then look again, but we’d have a similar problem: the main thing that we’d see is that a small number of restaurants have a lot of reviews. We can do better by comparing the business average with the logarithm of the number of reviews:


This spreads out the data nicely on the vertical axis. The two most visible things are

  1. The fractal-like character of the lower portion of the graph, which comes from the fact that users give ratings that are exactly 1, 2, 3, 4 or 5 stars rather than somewhere in between two of them.
  2.  The distribution of average ratings of  those restaurants that have  log(Review_Count) at least 3, that is, the restaurants with at least 20 reviews.

Let’s take a closer look at (2), filtering for restaurants with at least 20 reviews, and plotting the density of restaurants of a given rating average:


The density plot reveals a noticeably stronger central tendency than is present when we don’t filter for restaurants that have been reviewed a lot, which is what one expects on general statistical grounds. But the fact that the peak occurs at exactly 4 stars also suggests that the phenomenon is coming in part from a general tendency of people to give established restaurants that they review 4 stars.  It makes sense intuitively: in a competitive market place, the business that succeed are the restaurants that are “above average”  in some sense —  restaurants that were better than the restaurants that people had been going to before they discovered  the ones that they currently go to, hence people rating the ones that they go to 4 or higher. On the other hand, 5 out of 5  is the highest possible score, so giving it can feel like an overly strong endorsement.

We’re interested in determining the “true” quality of a restaurant with a small number of ratings – what its average rating would be if a huge number of people were to rate it. Now we run into a stumbling block:

  • We know that those restaurants that do go on to get at least 20 reviews likely had quality distributed according to the green graph above before they got 20 reviews.
  • We can estimate the fraction of restaurants that once had 5 reviews that went on to get 20 or more reviews.
  • We know that the  quality of restaurants that won’t go on to get 20 or more reviews is lower than the quality of restaurants that do go on to getting lots of reviews.

But it’s not clear how large we should expect the difference in the third bullet point to be, and this substantially muddies the picture.

Still, even without doing further analysis, it is in fact possible to use off-the-shelf software to predict future ratings from past ratings better than is possible just by using each restaurant’s current moving average.

It’s not obvious what the best measure of predictive power to use is, but correlation is the simplest. For the subset of data that I looked at, just using the moving average gave a correlation of 0.357 whereas a simple Bayesian model taking regression to the mean into account gave a correlation of 0.361. If we take the average of the two models’ predictions, the correlation with the number of stars rises to 0.364. Since averaging the two models’ predictions is a naive thing to do, it’s clear that one would get a further improvement by improving the off-the-shelf model. One also sees a small boost by taking into account the category of the restaurant, and a larger boost by looking at trends in ratings over time for each restaurant, and then taking into account the amount of variation that one  would expect to see in the trends in ratings over time across restaurants by chance.

I’ll write more later. Most of my code is here: