This document summarizes research on predicting Yelp ratings for restaurants from user reviews. The author collected review and business data from Yelp and divided it into development, cross-validation, and test sets. Various classifiers including Naive Bayes, SVM, and logistic regression were tested on the cross-validation data, with logistic regression performing best. Feature engineering using POS tags and selecting the top 500 features improved results. Tuning logistic regression with L1 regularization further optimized performance. The author concludes POS features and selecting an optimal number of top features improves predictive accuracy and discusses ideas for future work.
2. Prediction of Yelp Rating using Yelp Reviews
Kartik Lunkad
May 12, 2015
Abstract
Yelp provides two main ways for users to
review the businesses – reviews & stars.
Traditionally, businesses have focused on
how their rating to assess whether users
like their service or not. But reviews contain
huge amounts of critical data for the
businesses which they can take advantage
of. In this paper, we explore how reviews
can be used to predict the rating of a
business.
1 Introduction
Recommender systems have come a long
way in terms of modeling ratings for various
purposes such as predicting the future
rating of the product/business, identifying
the customer segment who is most
interested in the product and measuring
the success of a product/business. But
interestingly, very little work has been done
in the field of analyzing the reviews which
are provided by the users. These reviews
should not be ignored since they are a rich
source of information for the businesses.
In this paper, we look at these reviews to
predict the rating of the business. We have
focused on restaurants only for the purpose
of this research. Reviews tend to be biased
based on the users’ thinking of what rating
should be for a restaurant. Reviews can be
extremely variable in length, content and
style. We try to remove this bias by
predicting the rating purely from the
content and style of the reviews.
2 Related Work
There has been some previous work in
extracting information from the user
written reviews. The work began when Yelp
started the Dataset Challenge few years
back.
One work was focused towards identifying
the subtopics in the reviews which are
important to the user other than the quality
of food [1]. They used the online LDA,
generative probabilistic model for
collections of discrete data such as text
corpora.
Another interesting work I found in this
area was personalizing the ratings based on
the different topics extracted across
different user reviews [2]. This was done
using a modified, semantic-driven LDA.
The third work which was closest to this
paper focused on predicting the rating using
sentiment analysis [3]. Their research scope
was focused towards only 1 user and close
to 1000 reviews. This didn’t provide a
holistic approach which has been covered in
this research paper.
3 Data Collection
The data for the project was collected was
provided by Yelp themselves for a Yelp
dataset challenge which is conducted to
provide opportunities to explore a real
world dataset.
3. The size of the dataset itself is in millions of
records, but we have focused on specific
section of restaurants for the purpose of
the project.
4 Procedure Outline
The objective of this paper is to train a
classifier that can predict the rating of a
restaurant from reviews written by users.
This section outlines that process. Data
preparation and feature selection are
outlined in Section 5; this section explores
how the data is brought to a form that can
be used to create the models. This section
also discusses how the data is divided into
development, cross-validation and test sets.
In Section 6, exploratory data analysis is
performed on the development data set.
Feature selection is performed here too.
Section 7 presents a baseline performance,
using Naïve Bayes, SVM Logistic Regression
with default settings on the cross-validation
dataset. Parametric optimization is
performed in Section 8; this includes a
comparison of baseline and optimized
performance. Finally, the optimized model
is trained on cross-validation dataset and
used to classify instances in the test
dataset. The results of this are presented in
Section 9.
5 Data Preparation
I divided the data into three sets:
development, used for data exploration;
the cross-validation dataset and the test
set, to be used after optimization.
We have taken close to 20,000 records of
Yelps’ restaurant data. The development set
has close to 4000 records, cross-validation
has close to 14000 and test set has 2000.
Yelp provides 5 entity types: business,
review, user, check-in & tip. We have
focused on the business and the review
entities for the project.
The business entity contains attributes such
as type, business_id, name, full_address,
city, state, latitude, longitude, stars,
review_count, categories, open, hours etc.
The review entity contains attributes such
as type, business_id, user_id, stars, text,
date & votes.
I identified a list of restaurants from the
business entity and then collected all the
reviews for those restaurants from the
review entity.
Also, I converted the numeric columns into
nominal by mapping 1-5 values to its
equivalent nominal values (one, two, three,
four & five).
The final attribute list for the project were
1. business_id
2. stars: overall average stars
3. review_stars: stars for the particular
review
4. nominal_stars
5. nominal_review_stars
6. review_id
We focus on predicting the
nominal_review_stars from the model we
build.
6 Baseline Performance
I performed a baseline analysis using some
modified LightSide settings. The rare
threshold was 25 and the feature selection
was 1000 features. The models discussed in
4. this section, were trained and tested on the
cross-validation dataset.
I first used Naïve Bayes which is a
probabilistic learning method.
Naïve Bayes SVM Logistic Regression
Accuracy 0.4168 0.5235 0.5313
Kappa 0.2209 0.3506 0.3583
Table 1: Baseline Performance
Then, I ran Support Vector Machine (SVM)
as the model, another supervised learning
method
The last model I ran was Logistic Regression.
Among the three models, SVM and Logistic
Regression had comparable performances
with Logistic Regression being slightly
better.
I then used the cross-validation dataset as
my training set and my development
dataset as my test set using Logistic
Regression as my model.
I performed the error analysis for that
which I will talk about in the next section.
7 Data Exploration
I now used the development dataset for
testing the trained model and also to
perform the error analysis.
There were two important considerations I
noticed in the error analysis of the
development dataset.
First, I identified features which had a low
vertical absolute difference, high feature
weight and high frequency between
different classes. But this was not useful,
since four and five, one and two, two and
three etc. had too much similarity and were
hard to distinguish. I then focused on pairs
such as five & three, one & three etc.
There were two interesting results I felt
from the error analysis. One, a good
number of the features were adjectives. An
extremely high feature influence was
exclamation. Exclamation explained the
users’ excitement regarding a restaurant –
positive and negative.
The other aspect I noticed was the
distribution of weights among all the
features was disperse. This made me
speculate whether identifying the top K
features would affect the model’s accuracy
or not.
I tried three feature engineering options
with SVM & Logistic Regression. The first
feature engineering option was to attempt
using word/POS pairs and other was to use
stretchy patterns with the POS adjectives as
a category. I’ve provided the Kappa values
for the efforts below.
5. Baseline Word/POS pairs POS adjectives
Performance Kappa Accuracy Kappa Accuracy Kappa Accuracy
SVM 0.3249 0.5341 0.347 0.5211 0.3474 0.5213
Logistic Regression 0.3305 0.5399 0.353 0.5275 0.353 0.5275
Table 2: Feature Engineering Comparison
From the table, we can see that there’s a
distinct performance improvement with
POS pairs and POS adjectives over Baseline.
The third one was to try out the different
feature selection numbers: 50, 100, 250,
500 & 1000. I’ve tabularized the results for
both the feature engineering efforts. This
table contains Kappa values only.
Top k features 50 100 250 500 1000
SVM 0.2768 0.3068 0.3408 0.3591 0.3474
Logistic Regression 0.2828 0.3115 0.345 0.3579 0.353
Table 3: Feature Selection Performance Comparison
From the table, we can see that the
performance improves with increasing the
number of features till 500, but after that it
starts decreasing. I’ll discuss the results of
these efforts in detail in Section 9.
Next, we will look at how we can
optimize/tune the model better to improve
the performance and accuracy for Logistic
Regression.
8 Optimization
In terms of tuning the performance for
Logistic Regression, there are three main
options are L2 Regularization, L1
Regularization & L2 Regularization (Dual).
Below is a table showing the performance
of the different options.
Accuracy Kappa
L2 Regularization 0.5336 0.3579
L1 Regularization 0.5374 0.3625
L2 Regularization 0.5336 0.3579
Table 4: Logistic Regression Tuning
We can see that L1 Regularization has a
distinct optimization performance
improvement over the other two options.
9 Results
From the different feature engineering and
model optimization, I have come across
some interesting findings.
First, POS pairs/adjectives are better
features than just unigrams themselves.
6. The reasoning behind this is that POS pairs
focus on parts of speech rather than just
the words themselves. Also, adjectives
(positive or negative) have a strong
influence over the rating since they indicate
the sentiment of the user.
The second finding is that 500 features is an
optimal number for selecting the top k
features before the model is trained. This
makes sense as too many features lower
the weights for some of the important
features, whereas too few end up removing
some of the important features.
Logistic Regression and SVM have similar
performances but Logistic Regression
performs slightly better when the model is
tuned.
10 Future Work
In this paper, I have focused on identifying
the features which influence the rating and
also the model which performs based for
predicting the rating for all the users.
In future, I would like to derive sub-genres
from the reviews which users/people
generally care the most about other than
food. By identifying these sub-genres and
giving them individual rating, we can get an
overall review rating which might be even
more accurate.
The other work I am interested in is to focus
on understanding the different types of
users who provide the rating and the
factors they use to decide a particular rating
of a restaurant.
Finally, I would like to extend the research
to other types of businesses as well.
11 References
[1] J. Huang, S. Rogers and E. Joo, "Improving
Restaurants by Extacting Subtopics from
Yelp Reviews," 2013.
[2] J. Linshi, "Personalizing Yelp Star Ratings: a
Semantic Topic Modeling Approach".
[3] C. Li and J. Zhang, "Prediction of Yelp Review
Star Rating using Sentiment Analysis".