Data Ingestion

We gathered raw data from the Yelp Academic dataset challenge, available at the following link: https://www.yelp.com/dataset_challenge

The main JSON files are below:

  1. yelp_academic_dataset_business
  2. yelp_academic_dataset_review
  3. yelp_academic_dataset_customer

We focus on the reviews and business JSON file. We came up with our training dataset by using the following cleaning algorithm.

Further, we have written our own Python scripts to scrap Yelp reviews in the DC area. For each restaurant, we have longitude, latitutde, its rating, and all of its reviews.

Data Cleaning Algorithm

For each business entry:

  • Keep if business is restaurant
  • Keep if location is in state list
  • Output list of valid businesses
  • Output json file for all busineses and business attributes

For each review entry:

  • Keep only businesses in valid business list
  • Create a dictionary with each business as a key
  • For each key, create a list of unique customers in that business
  • Find the unique list of customers for all businesses
  • Output json file for all filtered reviews and review attributes

Data Storage

We set up mongoDB on an Amazon EC2 instance within the Amazon Web Services system. Further, we save all of our raw .JSON files on Amazon S3 and use wget to import all raw files into our EC2 instance. From here, we're able to use pymongo to import data on the fly into our local machines.

Data Analysis

For a first pass, we use the Python modules Pandas, NumPy, and Matplotlib to look at the descriptive statistics for our set of data. Specifically, we look at the following descriptive statistics:

  1. For the overall cleaned reviews dataset, we graph a histogram of reviews in order to get a general idea for how the ratings for reviews are distributed.
  2. For the overall cleaned restaurant dataset, we grapha histogram of restaurant ratings in order to get a general idea of how restaurant ratings are distributed. Ex-ante, we expect both distributions to closely follow each other (i.e. have a low Kullback-Leibler divergence).
  3. For each state, we perform steps 1 & 2 in order to get a general sense of how the ratings distribution for both reviews and restaurants differs across states.
  4. The restaurant dataset contains longitude and latitude coordinates. We are interested in the concentration of restaurant ratings. That is, we set longitude as the Y axis and latitutde as the X axis. Next, we create a mapping from a set of 5 different colors to the rating set {1, 2, 3, 4, 5}. We plot each restaurant on this graph. The result is a "heat map" of restaurants for each state.
  5. For the overall cleaned reviews and cleaned restaurant datsets we find basic statistics such as:
    • Mean
    • Median
    • Mode
    • Lowest Quantile
    • Highest Quantile

Data Modelling

The ultimate goal of this project is to find what makes a restaurant "good" on Yelp. In the parlance of machine learning, we would like to develop a method by which to classify reviews. Then, using restaurant reviews we would like to classify the restaurants rating. We train on a subset of the Yelp Academic data set, identified using the above Data Analysis process. We use our DC-area restaurants and associated reviews as the test set.

We will focus on the following machine learning/econometric models:

  1. Random Forest
  2. Bagged Decision Trees
  3. Multinomial Logistic Regression
  4. Support Vector Machine
  5. Linear Support Vector Machine

Data Product

Our primary data product will be a dataframe, where each tuple in the frame is defined by the following:

(Review Text, Review Rating, Random Forest Prediction, Bagged Decision Tree Prediction, Multinomial Logistic Prediction, Support Vector Machine Prediction, Linear Support Vector Machine Prediction)

That is, our data product will primarily consist of a dataframe containing our test data attributes and the results from each of our classification algorithms.

Furthermore, depending on the classification algorithm used, the top features have interpretative results. For example, with a linear support vector machine the absolute value of the coefficient weights of each feature given an indication of how important each feature was in classification relative to the other features. This arises because a linear SVM relies on a separating hyperplane in separating the classes. Feature weights are the elements of an orthogonal vector to this hyperplane.

Data Visualization

Depending on the interpreability of the top features, we will create visualizations for each classification algorithm. We will also present the "heat" map above as well as the different ratings distributions for reviews and businesses. Finally, we will craft a visualization that summarizes the results of our data product.