Yelp is an American multinational corporation headquartered in San Francisco, California. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations and online food-delivery service Eat24. The company also trains small businesses in how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including health inspection scores.
Yelp.com is a crowd-sourced local business review and social networking site. Its user community is primarily active in major metropolitan areas.The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review on their products or services using a one to five star rating system.Businesses can also update contact information, hours and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events or discuss their personal lives. According to Sterling Market Intelligence, Yelp is "one of the most important sites on the Internet." As of Q2 2016 it has 168 million monthly unique visitors and 108 million reviews.
78 percent of businesses listed on the site have a rating of three stars or better, but some negative reviews are very personal or extreme. Many reviews are written in an entertaining or creative manner. Users can give a review a "thumbs-up" if it is "useful, funny or cool." Each day a "Review of the Day" is determined based on a vote by users.
The objective of this project is to unearth the "topics" being talked about in Yelp Reviews, understand their distribution and develop an understanding of Yelp Reviews that will serve as a foundation to tackle more sophisticated questions in the future, such as:
Cultural Trends: What makes a particular city different? What cuisines do Yelpers rave about in different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?
Inferring Categories: Are there any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants")
Detecting Sarcasm in Reviews: Are Yelpers a sarcastic bunch?
Detecting Changepoints and Events: Detecting when things change suddenly (e.g., a business coming under new management or when a city starts going nuts over cronuts)
In [ ]:
### Link to requirements.txt on github
The data for the project was obtained from https://www.yelp.com/dataset_challenge/dataset.
def convert_json_to_csv(json_file_dir, csv_file_dir):
for filename in os.listdir(json_file_dir):
if filename.endswith('.json'):
try:
pd.read_json(os.path.join(json_file_dir, filename), lines=True).to_csv(os.path.join(csv_file_path, filename.replace('.json', '.csv')), encoding='utf-8', index=False)
except:
print filename + 'error\n'
convert_json_to_csv('/home/amlanlimaye/yelp-dataset-challenge/data/raw/',
'/home/amlanlimaye/yelp-dataset-challenge/data/interim/original_csv/')
table_names = ['business', 'review', 'user', 'checkin', 'tip']
original_csv_filepath = '/home/amlanlimaye/yelp-dataset-challenge/data/interim/original_csv/'
for tbl_name in table_names:
globals()[tbl_name] = pd.read_csv(original_csv_filepath + "{}".format(tbl_name) + '.csv')
# Sample row in the attributes column:
# u"[BikeParking: True, BusinessAcceptsBitcoin: False, BusinessAcceptsCreditCards: True, BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}, DogsAllowed: False, RestaurantsPriceRange2: 2, WheelchairAccessible: True]"
# Function to clean the attributes column and use regex to make python understand that business['attributes'] is a json-type dict:
def clean_business(test_str):
test_str = test_str.fillna('[]')
test_str = test_str.map(lambda x: x.replace('[','{'))
test_str = test_str.map(lambda x: x.replace(']','}'))
test_str = test_str.map(lambda x: x.replace('True', 'true'))
test_str = test_str.map(lambda x: x.replace('False', 'false'))
test_str = test_str.map(lambda x: x.replace('\'', '"'))
matches = re.findall("([A-Za-z0-9]+)(?=:)", test_str)
if len(matches):
for match in matches:
test_str = test_str.replace(match, '"%s"' % match)
return test_str
business['attributes'] = business['attributes'].map(regex_match)
# Function to extract attributes from json object and convert them into columns
def expand_features(row):
try:
extracted = json.loads(row['attributes'])
for key, value in extracted.items():
print key, type(value)
if type(value) != dict:
row["attribute_" + key] = value
else:
for attr_key, attr_value in value.items():
row["attribute_" + key + "_" + attr_key] = attr_value
except:
print "could not decode:", row['attributes']
return row
business.apply(expand_features, axis=1).columns
# Cleaning 'categories' column
business['categories'] = business['categories'].fillna(' ')
business['categories'] = business['categories'].map(lambda x: x[1:-1].split(','))
# Cleaning 'hours' column
business['hours'] = business['hours'].fillna(' ')
business['hours'] = business['hours'].map(lambda x: x[1:-1].split(','))
# Cleaning 'neighborhoods' column
business['neighborhood'] = business['neighborhood'].fillna(' ')
business['neighborhood'] = business['neighborhood'].map(lambda x: x[1:-1].split(','))
# Cleaning 'postal_code' column
business['postal_code'] = business['postal_code'].map(lambda x: x[:-2])
In [27]:
business.head(2)
Out[27]:
In [40]:
review.head(2)
Out[40]:
In [30]:
review.text.head(2)
Out[30]:
In [18]:
review_all = pd.read_csv('../../data/interim/original_csv/review.csv')
# Number of reviews by date
# The sharp seasonal falls are Chrismas Day and New Year's Day
# The sharp seasonal spikes are in summer, where people presumably have more free time
review.groupby('date').agg({'review_id': len}).reset_index().plot(x='date', y='review_id', figsize=(10,6))
Out[18]:
# Cleaning 'time' column
checkin['time'] = checkin['time'].map(lambda x: x[1:-1].split(','))
# Making columns aggregating checkins by day of week
checkin['mon_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Mon' in list_item])
checkin['tue_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Tue' in list_item])
checkin['wed_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Wed' in list_item])
checkin['thu_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Thu' in list_item])
checkin['fri_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Fri' in list_item])
checkin['sat_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Sat' in list_item])
checkin['sun_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Sun' in list_item])
# Converting day of week lists to dictionaries so that # of checkins can be looked up by hour
checkin['mon_list'] = checkin['mon_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Mon-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Mon-', '').split(':')[1])
for list_item in x})
checkin['tue_list'] = checkin['tue_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Tue-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Tue-', '').split(':')[1])
for list_item in x})
checkin['wed_list'] = checkin['wed_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Wed-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Wed-', '').split(':')[1])
for list_item in x})
checkin['thu_list'] = checkin['thu_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Thu-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Thu-', '').split(':')[1])
for list_item in x})
checkin['fri_list'] = checkin['fri_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Fri-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Fri-', '').split(':')[1])
for list_item in x})
checkin['sat_list'] = checkin['sat_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Sat-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Sat-', '').split(':')[1])
for list_item in x})
checkin['sun_list'] = checkin['sun_list'].map(lambda x:
{int(list_item.replace(' ', '').replace('Sun-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Sun-', '').split(':')[1])
for list_item in x})
In [34]:
checkin.head(2)
Out[34]:
# Cleaning 'elite' column
user['elite'] = user['elite'].map(lambda x: x[1:-1].split(','))
# Cleaning 'friends' column
user['friends'] = user['friends'].map(lambda x: x[1:-1].split(','))
# Cleaning 'yelping since' column
user['yelping_since'] = pd.to_datetime(user['yelping_since'])
In [36]:
user.head(2)
Out[36]:
In [37]:
tip.head(2)
Out[37]:
In [38]:
tip.text.head(2)
Out[38]:
A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
LDA (Latent Dirichlet Allocation) is an example of a topic model that posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.
LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you:
Decide on the number of words N the document will have (say, according to a Poisson distribution). Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two topics; food and cute animals, you might choose the document to consist of 1/3 food and 2/3 cute animals. Generate each word in the document by: ....First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability). ....Then using the topic to generate the word itself (according to the topic's multinomial distribution). For instance, the food topic might output the word "broccoli" with 30% probability, "bananas" with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
In [3]:
import pandas as pd
import numpy as np
import seaborn as sns # For prettier plots. Seaborn takes over pandas' default plotter
import nltk
import pyLDAvis
import pyLDAvis.sklearn
from gensim import models, matutils
from collections import defaultdict
from gensim import corpora
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
pyLDAvis.enable_notebook()
%matplotlib inline
In [4]:
review = pd.read_csv('../../data/interim/clean_US_cities/2016_review.csv')
review = review.fillna('')
tvec = TfidfVectorizer(stop_words='english', min_df=10, max_df=0.5, max_features=100,
norm='l2',
strip_accents='unicode'
)
review_dtm_tfidf = tvec.fit_transform(review['text'])
cvec = CountVectorizer(stop_words='english', min_df=10, max_df=0.5, max_features=100,
strip_accents='unicode')
review_dtm_cvec = cvec.fit_transform(review['text'])
print review_dtm_tfidf.shape, review_dtm_cvec.shape
In [12]:
# Fitting LDA models
# On cvec DTM
lda_cvec = LatentDirichletAllocation(n_topics=10, random_state=42)
lda_cvec.fit(review_dtm_cvec)
# On tfidf DTM
lda_tfidf = LatentDirichletAllocation(n_topics=10, random_state=42)
lda_tfidf.fit(review_dtm_tfidf)
Out[12]:
In [15]:
lda_viz_10_topics_cvec = pyLDAvis.sklearn.prepare(lda_cvec, review_dtm_cvec, cvec)
lda_viz_10_topics_cvec
# topic labels
Out[15]:
In [ ]:
topics_labels = {
1: "customer_feelings",
2: "customer_actions",
3: "restaurant_related",
4: "compliments",
5: "las_vegas_related",
6: "hotel_related",
7: "location_related",
8: "chicken_related",
9: "superlatives",
10: "ordering_pizza"
}
In [8]:
vocab = {v: k for k, v in cvec.vocabulary_.iteritems()}
vocab
Out[8]:
In [9]:
lda_ = models.LdaModel(
matutils.Sparse2Corpus(review_dtm_cvec, documents_columns=False),
# or use the corpus object created with the dictionary in the previous frame!
# corpus,
num_topics = 10,
passes = 1,
id2word = vocab
# or use the gensim dictionary object!
# id2word = dictionary
)
In [ ]:
stops = stopwords.words()
docs = pd.DataFrame(review_dtm_cvec.toarray(), columns=vectorizer.get_feature_names())
docs.sum()
bow = []
for document in review_dtm_cvec.toarray():
single_document = []
for token_id, token_count in enumerate(document):
if token_count > 0:
single_document.append((token_id, token_count))
bow.append(single_document)
# remove words that appear only once
frequency = defaultdict(int)
for text in documents:
for token in text.split():
frequency[token] += 1
texts = [[token for token in text.split() if frequency[token] > 1 and token not in stops]
for text in documents]
# Create gensim dictionary object
dictionary = corpora.Dictionary(texts)
# Create corpus matrix
corpus = [dictionary.doc2bow(text) for text in texts]
lda_.print_topics(num_topics=3, num_words=5)
lda_.get_document_topics(bow[0])
In [ ]:
doc_topics = [lda_.get_document_topics(doc) for doc in corpus]
topic_data = []
for document_id, topics in enumerate(doc_topics):
document_topics = []
for topic, probability in topics:
topic_data.append({
'document_id': document_id,
'topic_id': topic,
'topic': topics_labels[topic],
'probability': probability
})
topics_df = pd.DataFrame(topic_data.[:5])
topics_df.pivot_table(values="probability", index=["document_id", "topic"]).T
The next step is to generate a matrix of topic probabilities for each review, which I did actually work on but the first time I tried it, it crashed and the 2nd time it wasn't even done after 5 hours, so that definitely needs more work. The huge size of the data, 400K reviews is definitely a challenge, but I'm optimistic on being able to do that soon.
I was able to discover reasonably distinct abstract topics in Yelp Reviews, understand their distribution and develop an understanding of Yelp Reviews that will serve as a foundation to tackle more sophisticated and ambitious questions in the future, such as:
Cultural Trends: What makes a particular city different? What cuisines do Yelpers rave about in different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?
Inferring Categories: Are there any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants")
Detecting Sarcasm in Reviews: Are Yelpers a sarcastic bunch?
Detecting Changepoints and Events: Detecting when things change suddenly (e.g., a business coming under new management or when a city starts going nuts over cronuts)
Chen, E. (2011, August 28). What is a good explanation of Latent Dirichlet Allocation? Retrieved February 8, 2017, from https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation/answer/Edwin-Chen-1
W. (2017, January 07). Topic Modeling. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Topic_model
W. (2017, January 20). Latent Dirichlet Allocation. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
W. (2004, October 20). Yelp. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Yelp
Y. (2017, January 24). Yelp Dataset Challenge. Retrieved February 08, 2017, from https://www.yelp.com/dataset_challenge