We want to recommend restaurants that the user will like based off that users existing reviews. We will do so by building a recommendation system that predicts whether a user will like a restaurant on a binary scale (good or bad).
There are a few scenarios that we're trying to capture when recommending a good restaurant:
We want to recommend restaurants that match scenarios 1 and 4 and filter out for scenarios 2 and 3. We try to do this via the below hypotheses. The general idea is that we can capture the first portion of each scenario, "A restaurant that has [blank] rating", using linguistic tone. This is embodied in Hypothesis 2, where we define linguistic tone to be the word count of the negative and positive word categories in the Hu and Liu (2004) word dictionary. Hypothesis 1 tries to capture the second part of each scenario, "the user feels [blank] about it", by using unsupervised machine learning methods trained on each users specific syntax.
We can interpret the features from Hypothesis 2 as the the linguistic tone of the reviews and the features from Hypothesis 1 as the word choice embedded in the review.
For more information see:
Note: The method presented below is novel in the sense that we do not rely on restaurant attributes (Price, Locaation, Type of Food, Outdoor Seating Availability, etc), but only on the text of each users review. Although the inputs are different, the classification methods are similiar and we should expect similiar results if text is as good of a predictor as restaurant attributes.
We create our recommendation using the following hypotheses:
Hyptohesis 1: The user's word choice in his review are correlated with the kinds of restaurants that the user likes to visit and the types of food that the user likes to consume.
Hyptohesis 2: The tone of a user's review is correlated with the kinds of restaurants that the user likes to visit and the types of food that the user likes to consume.
If the first hypothesis is true, then a user ought to like a restaurant that has reviews that use many of the same word choices. Word choice here means how a word is placed within a document. For example, "The Thai food here is disgusting" has a starkly different word choice than "I love the amazing Thai food here". To capture this, we use LDA to create a matrix representation of each review. The number topic loadings are selected based on an iterative process in which we adjust the number of loadings based on classification results.
If the second hypothesis is true, then a user ought to like a restaurant that has reviews that have a similiar kind of tone that she uses to describe restaurants that she likes. We define "tonally similiar" here to mean that the user's review, represented as a tf-idf matrix, has many of the same positive & negative word counts as the reviews of the new restaurant. Positive & negative words are defined using the Hu & Liu (2004) Yelp Dictionary.
We train a variety of machine learning algorithms (detailed below) on the reviews of new restaurants using the features from the two hypotheses. Next, we use the trained algorithms to classify user reviews into these new restaurants. The end result is that we have a tuple (User Review, User Rating, New Restaurant). Amongst this set of tuples, we filter for only user ratings that have at least 4 stars then take the top 5 restaurants that appear most often in our classification results and recommend those 5 restaurants to the user.
Let B be the set of user reviews. Let $Y_{s}$ be a set of reviews in a given location, s.
The elements of $Y_{s}$ and $B$ are characterized by the tuple $(t, r, R)$ where:
$t = \text{text}, r = \text{rating}, R = \text{Restaurant ID}$
We want a mapping such that $f(B): B \rightarrow \text{Most similiar R in } Y_{s}$
The goal of this algorithm sketch is to create a classification scheme using Hypothesis 1. We create a 25-word topic loading using a combination of top unique words and the top words in a given set of user reviews as defined by tf-idf. We choose 25 arbtirarily and will update the number of topic loadings as a function of how well our algorithms are able to classify user reviews. Next, we use LDA to represent the reviews in $Y_{s}$ and $B$ as dense matrices. We run classifcation algorithms on these matrices.
We propose the following algorithm:
Build:
For B, create the following lists:
Good_User_Reviews = [review for review in B if rating(review) >= 4]
Bad_User_reviews = [review for review in B if rating(review) < 4]
Good_User_Words = [review.split(' ') for review in Good_User_Reviews]
Bad_User_Words = [review.split(' ') for review in Good_User_Reviews]
Good_User_Words = set([word for review in Good_Words for word in review])
Bad_User_Words = set([word for review in Bad_Words for word in review])
Overlap_User_Words = list(Good_Words.intersection(Bad_Words))
For Y_{s}, create the following lists:
Good_Reviews = [review for review in Y_{s} if rating(x) >= 4]
Bad_reviews = [review for review in Y_{s} if rating(x) < 4]
Good_Words = [review.split(' ') for review in Good_User_Reviews]
Bad_Words = [review.split(' ') for review in Good_User_Reviews]
Good_Words = set([word for review in Good_Words for word in review])
Bad_Words = set([word for review in Bad_Words for word in review])
Overlap_Words = list(Good_Words.intersection(Bad_Words))
Remove all words in reviews of $B$ that are in the list Overlap_User_Words
The goal of this algorithm sketch is to create a classification scheme using Hypothesis 2. We represent all the reviews in $B$ and $Y_{s}$ as tf-idf matrices, with the vocabulary defined using the H&L dictionary. Then we classify using the same machine learning algorithms as defined in Hypothesis 1.
More formally, we want a mapping such that $f(B): B \rightarrow \text{Most tonally similiar R in } Y_{s}$
Build:
Create the following lists:
Good_User_Reviews = [review for review in B if rating(review) >= 4]
Bad_User_reviews = [review for review in B if rating(review) < 4]
Good_Reviews = [review for review in Y_{s} if rating(x) >= 4]
Bad_reviews = [review for review in Y_{s} if rating(x) < 4]
Create a bag of words using the H&L dictionary for all reviews in $B$ and $Y_{s}$
Term 1 | ... | Term N | Word Topic 1 | ... | Word Topic M | |
---|---|---|---|---|---|---|
Review 1 | tf-idf weight | ... | tf-idf weight | word topic weight | ... | word topic weight |
... | ... | ... | ... | ... | ... | ... |
Review J | tf-idf weight | .. | tf-idf weight | word topic weight | ... | word topic weight |
Train:
Classification:
Report:
Whether or not a user likes a recommendation is hard to capture because we don't know if they like it or not until after we recommend it. In the above design, we'd be recommending a new restaurant to the user. To see if they actually like it, we'd have to follow up after we make the recommendation and ask them how they felt. But this isn't feasible because we don't have a set of people we can just ask how they felt.
That is, we know $y_{pred}$ but we don't know $y_{actual}$.
But we can get around this because sometimes a user review rating is also that user's restaurant rating. For a given user, if she only has one review for a restaurant then the rating for that review is also her rating for the restaurant.
We can use this to test our recommendation system. We propose the following test design:
Build:
Run:
For each review in $B_{test}$, create the tuple (User Review, Restaurant Rating, Restaurant ID) and replace the instance in $B_{test}$ with the tuple.
For each restaurant in $R_{train}$, find the total set of reviews from the Reviews database. Let this set be $Y$, where each element in $Y$ is a tuple (User Review, Restaurant Rating, Restaurant ID)
We run each of the algorithms above, using $Y$ as the training set and $B_{test}$ as the test set.
Where N is the number of recommended restaurants in $B_{result}$. $y_{i, pred}$ is the predicted restaurant rating, $y_{i, actual}$ is the actual rating that the user gave to the restaurant. A RSME score of 0 is a perfect score and means that the recommendation system did really well. In this case, success means that the recommendation system was able to accurately predict how the user would feel about the restaurant on a binary scale (good or bad).
Note, this function is analagous to the mean squared error loss function used in the 2013 Yelp RecSys challenge with the difference being that $y_{i, pred}$ and $y_{i, actual}$ are discrete categorical variables $\in \{1, 2, 3, 4, 5\}$:
$$RMSE = \sum_{i=1}^{N} \sqrt{\frac{(y_{i, pred} - y_{i, actual})^{2}}{N}}$$We want to create a list of recommendations according to the following schema:
Same Tone | Diff Tone | |
---|---|---|
Same Word Choice | Yes | ML Result(Tone, Word Choice) |
Diff Word Choice | ML Result(Tone , Word Choice) | No |
After running the above algorithms and choosing the algorithm that best classifies using our testing process, we'll have the following list of classification results:
ml_result = [Results from Best Performing ML Algorithm]
With each element in the list containing the following tuple:
(User Review, User Rating, Classified Restaurant Rating, Classified Restaurant)
We create the following lists from our ML results:
good_results = [tup or tup in ml_result if abs(tup[1] - tup[2]) <= 1 & tup[1] >= 4]
restaurants = [tup[3] for tup in good_results]
We run the following process to populate a top 5 recommendation list:
from collections import Counter
restaurant_counter = Counter(restaurants)
try:
recommendation_list = restaurant_counter.most_common(5)
except:
recommendation_list = restaurant_counter.most_common()
The elements in recommendation_list will list the most commonly classified restaurants, based on the users review, in descending order. So the first element in recommendation_list will be the top recommendation, the second will be the second recommendation, etc.
Well, if the RMSE above is low we can say with some confidence that our recommendation system is able to give out some good suggestions. We're going to operate under the assumption that people want to go to places that they'll like, not places that they'll likely hate.
So to adjust for this, we keep only classification results where the user's rating is positive (greater than a 4). We also want to throw out the bad classifications, so we only keep classification results where the absolute difference between the classified restaurant rating and the user's rating is less than or equal to 1. The recommendations from this modification should only return restaurants that the user will probably like.
We can run the above test on a bunch of different user comments and we can generate our own by writing some sample Yelp reviews. We might even be able to test it out in DC using our own reviews of restaurants that we've been to in DC.
Also, we can try scrapping for other cities. We already have a script written for DC, and it might make more sense to use "live" data rather than what's available in the Yelp dataset.