Goal

We want to recommend restaurants that the user will like based off that users existing reviews. We will do so by building a recommendation system that predicts whether a user will like a restaurant on a binary scale (good or bad).

There are a few scenarios that we're trying to capture when recommending a good restaurant:

  1. A restaurant that has a terrible rating but for some reason the user loves it
  2. A restaurant that has a terrible rating and the user also hates it
  3. A restaurant that has a really good rating but for some reason the user doesn't really like it
  4. A restaurant that has a really good rating and the user loves it

We want to recommend restaurants that match scenarios 1 and 4 and filter out for scenarios 2 and 3. We try to do this via the below hypotheses. The general idea is that we can capture the first portion of each scenario, "A restaurant that has [blank] rating", using linguistic tone. This is embodied in Hypothesis 2, where we define linguistic tone to be the word count of the negative and positive word categories in the Hu and Liu (2004) word dictionary. Hypothesis 1 tries to capture the second part of each scenario, "the user feels [blank] about it", by using unsupervised machine learning methods trained on each users specific syntax.

We can interpret the features from Hypothesis 2 as the the linguistic tone of the reviews and the features from Hypothesis 1 as the word choice embedded in the review.

For more information see:

Note: The method presented below is novel in the sense that we do not rely on restaurant attributes (Price, Locaation, Type of Food, Outdoor Seating Availability, etc), but only on the text of each users review. Although the inputs are different, the classification methods are similiar and we should expect similiar results if text is as good of a predictor as restaurant attributes.

Hypotheses

We create our recommendation using the following hypotheses:

Hyptohesis 1: The user's word choice in his review are correlated with the kinds of restaurants that the user likes to visit and the types of food that the user likes to consume.

Hyptohesis 2: The tone of a user's review is correlated with the kinds of restaurants that the user likes to visit and the types of food that the user likes to consume.

If the first hypothesis is true, then a user ought to like a restaurant that has reviews that use many of the same word choices. Word choice here means how a word is placed within a document. For example, "The Thai food here is disgusting" has a starkly different word choice than "I love the amazing Thai food here". To capture this, we use LDA to create a matrix representation of each review. The number topic loadings are selected based on an iterative process in which we adjust the number of loadings based on classification results.

If the second hypothesis is true, then a user ought to like a restaurant that has reviews that have a similiar kind of tone that she uses to describe restaurants that she likes. We define "tonally similiar" here to mean that the user's review, represented as a tf-idf matrix, has many of the same positive & negative word counts as the reviews of the new restaurant. Positive & negative words are defined using the Hu & Liu (2004) Yelp Dictionary.

We train a variety of machine learning algorithms (detailed below) on the reviews of new restaurants using the features from the two hypotheses. Next, we use the trained algorithms to classify user reviews into these new restaurants. The end result is that we have a tuple (User Review, User Rating, New Restaurant). Amongst this set of tuples, we filter for only user ratings that have at least 4 stars then take the top 5 restaurants that appear most often in our classification results and recommend those 5 restaurants to the user.

Hypothesis 1 Algorithm Sketch:

Let B be the set of user reviews. Let $Y_{s}$ be a set of reviews in a given location, s.

The elements of $Y_{s}$ and $B$ are characterized by the tuple $(t, r, R)$ where:

$t = \text{text}, r = \text{rating}, R = \text{Restaurant ID}$

We want a mapping such that $f(B): B \rightarrow \text{Most similiar R in } Y_{s}$

The goal of this algorithm sketch is to create a classification scheme using Hypothesis 1. We create a 25-word topic loading using a combination of top unique words and the top words in a given set of user reviews as defined by tf-idf. We choose 25 arbtirarily and will update the number of topic loadings as a function of how well our algorithms are able to classify user reviews. Next, we use LDA to represent the reviews in $Y_{s}$ and $B$ as dense matrices. We run classifcation algorithms on these matrices.

We propose the following algorithm:

  1. Build:

    1. For B, create the following lists:

      Good_User_Reviews = [review for review in B if rating(review) >= 4]
      Bad_User_reviews = [review for review in B if rating(review) < 4]
      
      Good_User_Words = [review.split(' ') for review in Good_User_Reviews]
      Bad_User_Words = [review.split(' ') for review in Good_User_Reviews]
      Good_User_Words = set([word for review in Good_Words for word in review])
      Bad_User_Words = set([word for review in Bad_Words for word in review])
      
      Overlap_User_Words = list(Good_Words.intersection(Bad_Words))
      
    2. For Y_{s}, create the following lists:

      Good_Reviews = [review for review in Y_{s} if rating(x) >= 4]
      Bad_reviews = [review for review in Y_{s} if rating(x) < 4]
      
      Good_Words = [review.split(' ') for review in Good_User_Reviews]
      Bad_Words = [review.split(' ') for review in Good_User_Reviews]
      Good_Words = set([word for review in Good_Words for word in review])
      Bad_Words = set([word for review in Bad_Words for word in review])
      
      Overlap_Words = list(Good_Words.intersection(Bad_Words))
      
    3. Remove all words in reviews of $B$ that are in the list Overlap_User_Words

    4. Remove all words in reviews of $Y_{s}$ that are in the list Overlap_Words
    5. Represent $B$ and $Y_{s}$ as LDA matrices using a loading of 25 topics over the entire corpus of reviews

Hypothesis 2 Algorithm Sketch

The goal of this algorithm sketch is to create a classification scheme using Hypothesis 2. We represent all the reviews in $B$ and $Y_{s}$ as tf-idf matrices, with the vocabulary defined using the H&L dictionary. Then we classify using the same machine learning algorithms as defined in Hypothesis 1.

More formally, we want a mapping such that $f(B): B \rightarrow \text{Most tonally similiar R in } Y_{s}$

  1. Build:

    1. Create the following lists:

      Good_User_Reviews = [review for review in B if rating(review) >= 4]
      Bad_User_reviews = [review for review in B if rating(review) < 4]
      
      Good_Reviews = [review for review in Y_{s} if rating(x) >= 4]
      Bad_reviews = [review for review in Y_{s} if rating(x) < 4]
      
    2. Create a bag of words using the H&L dictionary for all reviews in $B$ and $Y_{s}$

    3. Run TF-IDF on reviews in $B$ using the entire corpus
    4. Run TF-IDF on reviews in $Y_{s}$ using the entire corpus

Training and Classification

  1. Matrix Representation
    1. For each review in $Y_{s}$ and $B$, we represent them in a feature space with the first N columns being the tf-idf representation (hypothesis 2) and the next M columns being the word topic weights from the LDA representation (hypothesis 1) so that we have a totat of M + N features. The following table is a sketch of the feature space for all J reviews, where $J = |Y_{s}| + |B|$:
Term 1 ... Term N Word Topic 1 ... Word Topic M
Review 1 tf-idf weight ... tf-idf weight word topic weight ... word topic weight
... ... ... ... ... ... ...
Review J tf-idf weight .. tf-idf weight word topic weight ... word topic weight
  1. Train:

    1. We train each of our algorithms on the reviews in $Y_{s}$
    2. Each training point will be characterized by the tuple (Review as Feature Matrix, Restaurant Rating, Restaurant ID)
    3. We use the following algorithms:
      1. Random Forest
      2. Bagged Decision Tree
      3. Linear SVM
      4. SVM
      5. Multinomial Logit Classification
  2. Classification:

    1. We are interested in classifying the set of user reviews, $B$, into the restaurants within $Y_{s}$
      1. Note: we can generate artificial sets of user reviews using combinations of existing user reviews
    2. Try classifying the user reviews using all the different algorithms, pick the one that performs best at classifying the review's rating (using the testing process below)
      1. Each element in the test set will have the form (User Review, Predicted Review Rating, Actual Review Rating, Predicted Restaurant) we choose the classifier that minimizes the RSME of the review rating classifications
  3. Report:

    1. Return the restaurant classification results from the classification algorithm that performs best as the following tuple:
      1. (User Review, Predicted Restaurant Rating, Actual Review Rating, Predicted Restaurant)

Testing

Whether or not a user likes a recommendation is hard to capture because we don't know if they like it or not until after we recommend it. In the above design, we'd be recommending a new restaurant to the user. To see if they actually like it, we'd have to follow up after we make the recommendation and ask them how they felt. But this isn't feasible because we don't have a set of people we can just ask how they felt.

That is, we know $y_{pred}$ but we don't know $y_{actual}$.

But we can get around this because sometimes a user review rating is also that user's restaurant rating. For a given user, if she only has one review for a restaurant then the rating for that review is also her rating for the restaurant.

We can use this to test our recommendation system. We propose the following test design:

Build:

  • As before, let $B$ be the total set of user reviews.
  • Let $R$ be the set of restaurants that the user has reviewed only once.
  • Take some percentage, $p$, of the set $R$ and take the subset of reviews from $B$ that correspond to these restaurants. Let this be the set of test restaurants $B_{test}$.
  • Set the remaining $1-p$ percentage of the set $R$ and call this the training set of restaurants $R_{train}$.
  • Note every restaurant in $R_{train}$ and $B_{test}$ has User's Restaurant Rating = User's Review Rating

Run:

  1. For each review in $B_{test}$, create the tuple (User Review, Restaurant Rating, Restaurant ID) and replace the instance in $B_{test}$ with the tuple.

  2. For each restaurant in $R_{train}$, find the total set of reviews from the Reviews database. Let this set be $Y$, where each element in $Y$ is a tuple (User Review, Restaurant Rating, Restaurant ID)

  3. We run each of the algorithms above, using $Y$ as the training set and $B_{test}$ as the test set.

  4. Step 3 results in a set $B_{result}$ where each element is characterized by (User Review, Actual Restaurant Rating, Predicted Restaurant Rating, Predicted Restaurant). Note the cardinality of $B_{result}$ is the same as that of $B_{test}$
  5. Set $y_{pred} =$ I(Predicted Restaurant) and $y_{actual} =$ I(Restaurant) where the indicator function I() is 1 if the user rated the restaurant at least a 4 and is 0 if the user rated the restaurant less than a 4
  6. The RMSE for the recommended restaurants is given by the following loss function:
$$RMSE = \sum_{i=1}^{N} \sqrt{\frac{(y_{i, pred} - y_{i, actual})^{2}}{N}}$$

Where N is the number of recommended restaurants in $B_{result}$. $y_{i, pred}$ is the predicted restaurant rating, $y_{i, actual}$ is the actual rating that the user gave to the restaurant. A RSME score of 0 is a perfect score and means that the recommendation system did really well. In this case, success means that the recommendation system was able to accurately predict how the user would feel about the restaurant on a binary scale (good or bad).

Note, this function is analagous to the mean squared error loss function used in the 2013 Yelp RecSys challenge with the difference being that $y_{i, pred}$ and $y_{i, actual}$ are discrete categorical variables $\in \{1, 2, 3, 4, 5\}$:

$$RMSE = \sum_{i=1}^{N} \sqrt{\frac{(y_{i, pred} - y_{i, actual})^{2}}{N}}$$

Final Recommendation

We want to create a list of recommendations according to the following schema:

Same Tone Diff Tone
Same Word Choice Yes ML Result(Tone, Word Choice)
Diff Word Choice ML Result(Tone , Word Choice) No

After running the above algorithms and choosing the algorithm that best classifies using our testing process, we'll have the following list of classification results:

ml_result = [Results from Best Performing ML Algorithm]

With each element in the list containing the following tuple:

(User Review, User Rating, Classified Restaurant Rating, Classified Restaurant)

We create the following lists from our ML results:

good_results = [tup or tup in ml_result if abs(tup[1] - tup[2]) <= 1 & tup[1] >= 4]
restaurants = [tup[3] for tup in good_results]

We run the following process to populate a top 5 recommendation list:

from collections import Counter
restaurant_counter = Counter(restaurants)
try:
    recommendation_list = restaurant_counter.most_common(5)
except:
    recommendation_list = restaurant_counter.most_common()

The elements in recommendation_list will list the most commonly classified restaurants, based on the users review, in descending order. So the first element in recommendation_list will be the top recommendation, the second will be the second recommendation, etc.

So what?

Well, if the RMSE above is low we can say with some confidence that our recommendation system is able to give out some good suggestions. We're going to operate under the assumption that people want to go to places that they'll like, not places that they'll likely hate.

So to adjust for this, we keep only classification results where the user's rating is positive (greater than a 4). We also want to throw out the bad classifications, so we only keep classification results where the absolute difference between the classified restaurant rating and the user's rating is less than or equal to 1. The recommendations from this modification should only return restaurants that the user will probably like.

We can run the above test on a bunch of different user comments and we can generate our own by writing some sample Yelp reviews. We might even be able to test it out in DC using our own reviews of restaurants that we've been to in DC.

Also, we can try scrapping for other cities. We already have a script written for DC, and it might make more sense to use "live" data rather than what's available in the Yelp dataset.