We gathered raw data from the Yelp Academic dataset challenge, available at the following link: https://www.yelp.com/dataset_challenge
The main JSON files are below:
We focus on the reviews and business JSON file. We came up with our training dataset by using the following cleaning algorithm.
Further, we have written our own Python scripts to scrap Yelp reviews in the DC area. For each restaurant, we have longitude, latitutde, its rating, and all of its reviews.
For a first pass, we use the Python modules Pandas, NumPy, and Matplotlib to look at the descriptive statistics for our set of data. Specifically, we look at the following descriptive statistics:
The ultimate goal of this project is to find what makes a restaurant "good" on Yelp. In the parlance of machine learning, we would like to develop a method by which to classify reviews. Then, using restaurant reviews we would like to classify the restaurants rating. We train on a subset of the Yelp Academic data set, identified using the above Data Analysis process. We use our DC-area restaurants and associated reviews as the test set.
We will focus on the following machine learning/econometric models:
Our primary data product will be a dataframe, where each tuple in the frame is defined by the following:
(Review Text, Review Rating, Random Forest Prediction, Bagged Decision Tree Prediction, Multinomial Logistic Prediction, Support Vector Machine Prediction, Linear Support Vector Machine Prediction)
That is, our data product will primarily consist of a dataframe containing our test data attributes and the results from each of our classification algorithms.
Furthermore, depending on the classification algorithm used, the top features have interpretative results. For example, with a linear support vector machine the absolute value of the coefficient weights of each feature given an indication of how important each feature was in classification relative to the other features. This arises because a linear SVM relies on a separating hyperplane in separating the classes. Feature weights are the elements of an orthogonal vector to this hyperplane.
Depending on the interpreability of the top features, we will create visualizations for each classification algorithm. We will also present the "heat" map above as well as the different ratings distributions for reviews and businesses. Finally, we will craft a visualization that summarizes the results of our data product.