Discovering Abstract Topics in Yelp Reviews - Technical Report

1. Background

Yelp is an American multinational corporation headquartered in San Francisco, California. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations and online food-delivery service Eat24. The company also trains small businesses in how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including health inspection scores.

Yelp.com is a crowd-sourced local business review and social networking site. Its user community is primarily active in major metropolitan areas.The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review on their products or services using a one to five star rating system.Businesses can also update contact information, hours and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events or discuss their personal lives. According to Sterling Market Intelligence, Yelp is "one of the most important sites on the Internet." As of Q2 2016 it has 168 million monthly unique visitors and 108 million reviews.

78 percent of businesses listed on the site have a rating of three stars or better, but some negative reviews are very personal or extreme. Many reviews are written in an entertaining or creative manner. Users can give a review a "thumbs-up" if it is "useful, funny or cool." Each day a "Review of the Day" is determined based on a vote by users.

2. Problem Statement

The objective of this project is to unearth the "topics" being talked about in Yelp Reviews, understand their distribution and develop an understanding of Yelp Reviews that will serve as a foundation to tackle more sophisticated questions in the future, such as:

  • Cultural Trends: What makes a particular city different? What cuisines do Yelpers rave about in different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?

  • Inferring Categories: Are there any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants")

  • Detecting Sarcasm in Reviews: Are Yelpers a sarcastic bunch?

  • Detecting Changepoints and Events: Detecting when things change suddenly (e.g., a business coming under new management or when a city starts going nuts over cronuts)

3. Data Collection and Cleaning


In [ ]:
### Link to requirements.txt on github

3.1 Data Dictionary

The data for the project was obtained from https://www.yelp.com/dataset_challenge/dataset.

  • 400K reviews and 100K tips by 120K users for 106K businesses
  • Cities(US): Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland

  1. Businesses table:

    • "business_id":"encrypted business id"
    • "name":"business name"
    • "neighborhood":"hood name"
    • "address":"full address"
    • "city":"city"
    • "state":"state -- if applicable --"
    • "postal code":"postal code"
    • "latitude":latitude
    • "longitude":longitude
    • "stars":star rating, rounded to half-stars
    • "review_count":number of reviews
    • "is_open":0/1 (closed/open)
    • "attributes":["an array of strings: each array element is an attribute"]
    • "categories":["an array of strings of business categories"]
    • "hours":["an array of strings of business hours"]
    • "type": "business"

  2. Reviews table:

    • "review_id":"encrypted review id"
    • "user_id":"encrypted user id"
    • "business_id":"encrypted business id"
    • "stars":star rating, rounded to half-stars
    • "date":"date formatted like 2009-12-19"
    • "text":"review text"
    • "useful":number of useful votes received
    • "funny":number of funny votes received
    • "cool": number of cool review votes received
    • "type": "review"

  3. Users table:

    • "user_id":"encrypted user id"
    • "name":"first name"
    • "review_count":number of reviews
    • "yelping_since": date formatted like "2009-12-19"
    • "friends":["an array of encrypted ids of friends"]
    • "useful":"number of useful votes sent by the user"
    • "funny":"number of funny votes sent by the user"
    • "cool":"number of cool votes sent by the user"
    • "fans":"number of fans the user has"
    • "elite":["an array of years the user was elite"]
    • "average_stars":floating point average like 4.31
    • "compliment_hot":number of hot compliments received by the user
    • "compliment_more":number of more compliments received by the user
    • "compliment_profile": number of profile compliments received by the user
    • "compliment_cute": number of cute compliments received by the user
    • "compliment_list": number of list compliments received by the user
    • "compliment_note": number of note compliments received by the user
    • "compliment_plain": number of plain compliments received by the user
    • "compliment_cool": number of cool compliments received by the user
    • "compliment_funny": number of funny compliments received by the user
    • "compliment_writer": number of writer compliments received by the user
    • "compliment_photos": number of photo compliments received by the user
    • "type":"user"

  4. Checkins table:

    • "time":["an array of check ins with the format day-hour:number of check ins from hour to hour+1"]
    • "business_id":"encrypted business id"
    • "type":"checkin"

  5. Tips table:

    • "text":"text of the tip"
    • "date":"date formatted like 2009-12-19"
    • "likes":compliment count
    • "business_id":"encrypted business id"
    • "user_id":"encrypted user id"
    • "type":"tip"

3.2 Data Cleaning

3.2.1 Converted raw json files obtained from https://www.yelp.com/dataset_challenge/dataset into csv files.

def convert_json_to_csv(json_file_dir, csv_file_dir):

    for filename in os.listdir(json_file_dir):
        if filename.endswith('.json'):
            try:
                pd.read_json(os.path.join(json_file_dir, filename), lines=True).to_csv(os.path.join(csv_file_path, filename.replace('.json', '.csv')), encoding='utf-8', index=False)
            except:
                print filename + 'error\n'

convert_json_to_csv('/home/amlanlimaye/yelp-dataset-challenge/data/raw/', 
                    '/home/amlanlimaye/yelp-dataset-challenge/data/interim/original_csv/')

3.2.2 Reading all data tables

table_names = ['business', 'review', 'user', 'checkin', 'tip']
original_csv_filepath = '/home/amlanlimaye/yelp-dataset-challenge/data/interim/original_csv/'

for tbl_name in table_names:
    globals()[tbl_name] = pd.read_csv(original_csv_filepath + "{}".format(tbl_name) + '.csv')

3.2.3 Cleaning 'business' table

# Sample row in the attributes column:

# u"[BikeParking: True, BusinessAcceptsBitcoin: False, BusinessAcceptsCreditCards: True, BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}, DogsAllowed: False, RestaurantsPriceRange2: 2, WheelchairAccessible: True]"

# Function to clean the attributes column and use regex to make python understand that business['attributes'] is a json-type dict:

def clean_business(test_str):

    test_str = test_str.fillna('[]')
    test_str = test_str.map(lambda x: x.replace('[','{'))
    test_str = test_str.map(lambda x: x.replace(']','}'))
    test_str = test_str.map(lambda x: x.replace('True', 'true'))
    test_str = test_str.map(lambda x: x.replace('False', 'false'))
    test_str = test_str.map(lambda x: x.replace('\'', '"'))

    matches = re.findall("([A-Za-z0-9]+)(?=:)", test_str)

    if len(matches):
        for match in matches:
            test_str = test_str.replace(match, '"%s"' % match)

    return test_str

business['attributes'] = business['attributes'].map(regex_match)

# Function to extract attributes from json object and convert them into columns

def expand_features(row):
    try:
        extracted = json.loads(row['attributes'])
        for key, value in extracted.items():
            print key, type(value)
            if type(value) != dict:
                row["attribute_" + key] = value
            else:
                for attr_key, attr_value in value.items():
                    row["attribute_" + key + "_" + attr_key] = attr_value
    except:
        print "could not decode:", row['attributes']

    return row

business.apply(expand_features, axis=1).columns

# Cleaning 'categories' column

business['categories'] = business['categories'].fillna(' ')
business['categories'] = business['categories'].map(lambda x: x[1:-1].split(','))

# Cleaning 'hours' column

business['hours'] = business['hours'].fillna(' ')
business['hours'] = business['hours'].map(lambda x: x[1:-1].split(','))

# Cleaning 'neighborhoods' column

business['neighborhood'] = business['neighborhood'].fillna(' ')
business['neighborhood'] = business['neighborhood'].map(lambda x: x[1:-1].split(','))

# Cleaning 'postal_code' column

business['postal_code'] = business['postal_code'].map(lambda x: x[:-2])

In [27]:
business.head(2)


Out[27]:
address attributes business_id categories city hours is_open latitude longitude name neighborhood postal_code review_count stars state type
0 227 E Baseline Rd, Ste J2 {"BikeParking": true, "BusinessAcceptsBitcoin"... 0DI8Dt2PJp07XkVvIElIcQ [Tobacco Shops, Nightlife, Vape Shops, Shop... Tempe [Monday 11:0-21:0, Tuesday 11:0-21:0, Wednes... 0 33.378214 -111.936102 Innovative Vapors [] 85283.0 17 4.5 AZ business
1 495 S Grand Central Pkwy {"BusinessAcceptsBitcoin": false, "BusinessAcc... LTlCaCGZE14GuaUXUGbamg [Caterers, Grocery, Food, Event Planning & ... Las Vegas [Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday ... 1 36.192284 -115.159272 Cut and Taste [] 89106.0 9 5.0 NV business

3.2.4 Cleaning 'review' table

# Cleaning the 'date' column

review['date'] = pd.to_datetime(review['date'])

# Cleaning the 'useful' column

review['useful'] = review['useful'].fillna(0)
review['useful'] = review['useful'].map(int)

In [40]:
review.head(2)


Out[40]:
business_id cool date funny review_id stars text type useful user_id
0 2hkQ5S6L_D7yX3pWtPw2Ww 0 2016-05-23 0 ppVujROE-SfvlV4mqbokJQ 5.0 Honestly, if your looking for someone to do yo... review 0 tOplTv2njld2dzMe7qdTvw
1 2hkQ5S6L_D7yX3pWtPw2Ww 0 2016-03-30 0 lOI1deu5emPtr5JnlaFx_g 5.0 Ive had a bad experience with lashes b4 and wa... review 0 ww4tDDd-k2RgsR2Aun9LNQ

In [30]:
review.text.head(2)


Out[30]:
0    Honestly, if your looking for someone to do yo...
1    Ive had a bad experience with lashes b4 and wa...
Name: text, dtype: object

In [18]:
review_all = pd.read_csv('../../data/interim/original_csv/review.csv')

# Number of reviews by date
# The sharp seasonal falls are Chrismas Day and New Year's Day
# The sharp seasonal spikes are in summer, where people presumably have more free time

review.groupby('date').agg({'review_id': len}).reset_index().plot(x='date', y='review_id', figsize=(10,6))


/home/amlanlimaye/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (1,3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7a810ac250>

3.2.5 Cleaning 'checkin' table

# Cleaning 'time' column

checkin['time'] = checkin['time'].map(lambda x: x[1:-1].split(','))

# Making columns aggregating checkins by day of week

checkin['mon_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Mon' in list_item])
checkin['tue_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Tue' in list_item])
checkin['wed_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Wed' in list_item])
checkin['thu_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Thu' in list_item])
checkin['fri_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Fri' in list_item])
checkin['sat_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Sat' in list_item])
checkin['sun_list'] = checkins['time'].map(lambda x: [list_item for list_item in x[1:-1].split(',') if 'Sun' in list_item])

# Converting day of week lists to dictionaries so that # of checkins can be looked up by hour

checkin['mon_list'] = checkin['mon_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Mon-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Mon-', '').split(':')[1]) 
                                               for list_item in x})

checkin['tue_list'] = checkin['tue_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Tue-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Tue-', '').split(':')[1]) 
                                               for list_item in x})

checkin['wed_list'] = checkin['wed_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Wed-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Wed-', '').split(':')[1]) 
                                               for list_item in x})

checkin['thu_list'] = checkin['thu_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Thu-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Thu-', '').split(':')[1]) 
                                               for list_item in x})

checkin['fri_list'] = checkin['fri_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Fri-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Fri-', '').split(':')[1]) 
                                               for list_item in x})

checkin['sat_list'] = checkin['sat_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Sat-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Sat-', '').split(':')[1]) 
                                               for list_item in x})

checkin['sun_list'] = checkin['sun_list'].map(lambda x: 
                                              {int(list_item.replace(' ', '').replace('Sun-', '').split(':')[0]):int(list_item.replace(' ', '').replace('Sun-', '').split(':')[1]) 
                                               for list_item in x})

In [34]:
checkin.head(2)


Out[34]:
business_id time type mon_list tue_list wed_list thu_list fri_list sat_list sun_list
0 7KPBkxAOEtb3QeIL9PEErg [Fri-0:2, Sat-0:1, Sun-0:1, Wed-0:2, Sat-1... checkin {11: 1, 12: 1, 18: 1, 19: 1, 20: 1, 23: 1} {4: 1, 12: 1, 13: 2, 15: 1, 16: 1, 18: 2, 20: ... {0: 2, 1: 1, 2: 1, 6: 1, 11: 2, 13: 2, 14: 1, ... {1: 1, 2: 1, 4: 1, 13: 1, 15: 1, 19: 1, 20: 1,... {0: 2, 3: 1, 10: 1, 14: 2, 15: 1, 16: 1, 18: 1... {0: 1, 1: 2, 2: 1, 10: 1, 12: 1, 13: 2, 14: 1,... {0: 1, 2: 2, 3: 3, 6: 1, 16: 1, 17: 1, 18: 1, ...
1 kREVIrSBbtqBhIYkTccQUg [Mon-13:1, Thu-13:1, Sat-16:1, Wed-17:1, S... checkin {13: 1} {} {17: 1} {20: 1, 13: 1} {} {16: 1, 21: 1} {19: 1}

3.2.6 Cleaning 'user' table

# Cleaning 'elite' column

user['elite'] = user['elite'].map(lambda x: x[1:-1].split(','))

# Cleaning 'friends' column

user['friends'] = user['friends'].map(lambda x: x[1:-1].split(','))

# Cleaning 'yelping since' column

user['yelping_since'] = pd.to_datetime(user['yelping_since'])

In [36]:
user.head(2)


Out[36]:
average_stars compliment_cool compliment_cute compliment_funny compliment_hot compliment_list compliment_more compliment_note compliment_photos compliment_plain ... elite fans friends funny name review_count type useful user_id yelping_since
0 3.59 4192 79 4192 3904 19 305 4705 1347 2617 ... [2017, 2015, 2016, 2014, 2011, 2013, 2012] 298 [iJg9ekPzF9lkMuvjKYX6uA, ctWAuzS04Xu0lke2Rop4... 12316 Rob 761 user 18456 EZmocAborM6z66rTzeZxzQ 2009-09-12
1 4.29 144 11 144 64 1 4 97 24 129 ... [None] 34 [r2UUCzGxqI6WPsiWPgqG2A, qewG3X2O4X6JKskxyyqF... 28 Vivian 80 user 117 myql3o3x22_ygECb8gVo7A 2009-06-27

2 rows × 23 columns

3.2.7 Cleaning 'tip' table

# Cleaning 'date' column

tip['date'] = pd.to_datetime(tip['date'])

In [37]:
tip.head(2)


Out[37]:
business_id date likes text type user_id
0 dAa0hB2yrnHzVmsCkN4YvQ 2014-06-20 0 Nice place. Great staff. A fixture in the tow... tip oaYhjqBbh18ZhU0bpyzSuw
1 dAa0hB2yrnHzVmsCkN4YvQ 2016-10-12 0 Happy hour 5-7 Monday - Friday tip ulQ8Nyj7jCUR8M83SUMoRQ

In [38]:
tip.text.head(2)


Out[38]:
0    Nice place. Great staff.  A fixture in the tow...
1                       Happy hour 5-7 Monday - Friday
Name: text, dtype: object

4. Model - Latent Dirichlet Allocation (LDA)

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

LDA (Latent Dirichlet Allocation) is an example of a topic model that posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you:

Decide on the number of words N the document will have (say, according to a Poisson distribution). Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two topics; food and cute animals, you might choose the document to consist of 1/3 food and 2/3 cute animals. Generate each word in the document by: ....First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability). ....Then using the topic to generate the word itself (according to the topic's multinomial distribution). For instance, the food topic might output the word "broccoli" with 30% probability, "bananas" with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.


In [3]:
import pandas as pd
import numpy as np
import seaborn as sns # For prettier plots. Seaborn takes over pandas' default plotter
import nltk
import pyLDAvis
import pyLDAvis.sklearn

from gensim import models, matutils
from collections import defaultdict
from gensim import corpora
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

pyLDAvis.enable_notebook()
%matplotlib inline


/home/amlanlimaye/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:280: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  'Matplotlib is building the font cache using fc-list. '

In [4]:
review = pd.read_csv('../../data/interim/clean_US_cities/2016_review.csv')
review = review.fillna('')

tvec = TfidfVectorizer(stop_words='english', min_df=10, max_df=0.5, max_features=100,
                       norm='l2', 
                       strip_accents='unicode'
                       )
review_dtm_tfidf = tvec.fit_transform(review['text'])

cvec = CountVectorizer(stop_words='english', min_df=10, max_df=0.5, max_features=100,
                       strip_accents='unicode')
review_dtm_cvec = cvec.fit_transform(review['text'])

print review_dtm_tfidf.shape, review_dtm_cvec.shape


(393275, 100) (393275, 100)

In [12]:
# Fitting LDA models

# On cvec DTM
lda_cvec = LatentDirichletAllocation(n_topics=10, random_state=42)
lda_cvec.fit(review_dtm_cvec)

# On tfidf DTM
lda_tfidf = LatentDirichletAllocation(n_topics=10, random_state=42)
lda_tfidf.fit(review_dtm_tfidf)


Out[12]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [15]:
lda_viz_10_topics_cvec = pyLDAvis.sklearn.prepare(lda_cvec, review_dtm_cvec, cvec)
lda_viz_10_topics_cvec

# topic labels


Out[15]:

In [ ]:
topics_labels = {
   1: "customer_feelings",
   2: "customer_actions",
   3: "restaurant_related",
    4: "compliments",
    5: "las_vegas_related",
    6: "hotel_related",
    7: "location_related",
    8: "chicken_related",
    9: "superlatives",
    10: "ordering_pizza"
}

Generating topic probabilities for each review


In [8]:
vocab = {v: k for k, v in cvec.vocabulary_.iteritems()}
vocab


Out[8]:
{0: u'10',
 1: u'amazing',
 2: u'area',
 3: u'asked',
 4: u'awesome',
 5: u'bad',
 6: u'bar',
 7: u'best',
 8: u'better',
 9: u'called',
 10: u'came',
 11: u'car',
 12: u'check',
 13: u'chicken',
 14: u'clean',
 15: u'come',
 16: u'coming',
 17: u'customer',
 18: u'day',
 19: u'definitely',
 20: u'delicious',
 21: u'did',
 22: u'didn',
 23: u'different',
 24: u'don',
 25: u'drinks',
 26: u'eat',
 27: u'excellent',
 28: u'experience',
 29: u'feel',
 30: u'food',
 31: u'fresh',
 32: u'friendly',
 33: u'going',
 34: u'good',
 35: u'got',
 36: u'great',
 37: u'happy',
 38: u'home',
 39: u'hotel',
 40: u'just',
 41: u'know',
 42: u'las',
 43: u'like',
 44: u'little',
 45: u'll',
 46: u'location',
 47: u'long',
 48: u'looking',
 49: u'lot',
 50: u'love',
 51: u'make',
 52: u'menu',
 53: u'minutes',
 54: u'need',
 55: u'new',
 56: u'nice',
 57: u'night',
 58: u'order',
 59: u'ordered',
 60: u'people',
 61: u'pizza',
 62: u'place',
 63: u'pretty',
 64: u'price',
 65: u'quality',
 66: u'really',
 67: u'recommend',
 68: u'restaurant',
 69: u'right',
 70: u'room',
 71: u'said',
 72: u'sauce',
 73: u'say',
 74: u'server',
 75: u'service',
 76: u'small',
 77: u'staff',
 78: u'stars',
 79: u'strip',
 80: u'super',
 81: u'sure',
 82: u'table',
 83: u'think',
 84: u'time',
 85: u'times',
 86: u'told',
 87: u'took',
 88: u'tried',
 89: u'try',
 90: u've',
 91: u'vegas',
 92: u'wait',
 93: u'want',
 94: u'wanted',
 95: u'wasn',
 96: u'way',
 97: u'went',
 98: u'work',
 99: u'worth'}

In [9]:
lda_ = models.LdaModel(
    matutils.Sparse2Corpus(review_dtm_cvec, documents_columns=False),
    # or use the corpus object created with the dictionary in the previous frame!
    # corpus, 
    num_topics  =  10,
    passes      =  1,
    id2word     =  vocab
    # or use the gensim dictionary object!
    # id2word     =  dictionary
)

In [ ]:
stops = stopwords.words()

docs = pd.DataFrame(review_dtm_cvec.toarray(), columns=vectorizer.get_feature_names())
docs.sum()

bow = []

for document in review_dtm_cvec.toarray():
    
    single_document = []
    
    for token_id, token_count in enumerate(document):

        if token_count > 0:
            single_document.append((token_id, token_count))

    bow.append(single_document)

# remove words that appear only once
frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1

texts = [[token for token in text.split() if frequency[token] > 1 and token not in stops]
          for text in documents]

# Create gensim dictionary object
dictionary = corpora.Dictionary(texts)

# Create corpus matrix
corpus = [dictionary.doc2bow(text) for text in texts]

lda_.print_topics(num_topics=3, num_words=5)

lda_.get_document_topics(bow[0])

In [ ]:
doc_topics = [lda_.get_document_topics(doc) for doc in corpus]

topic_data = []

for document_id, topics in enumerate(doc_topics):
    
    document_topics = []
    
    for topic, probability in topics:
       
        topic_data.append({
            'document_id':  document_id,
            'topic_id':     topic,
            'topic':        topics_labels[topic],
            'probability':  probability
        })

topics_df = pd.DataFrame(topic_data.[:5])
topics_df.pivot_table(values="probability", index=["document_id", "topic"]).T

Conclusions and Next Steps

The next step is to generate a matrix of topic probabilities for each review, which I did actually work on but the first time I tried it, it crashed and the 2nd time it wasn't even done after 5 hours, so that definitely needs more work. The huge size of the data, 400K reviews is definitely a challenge, but I'm optimistic on being able to do that soon.

I was able to discover reasonably distinct abstract topics in Yelp Reviews, understand their distribution and develop an understanding of Yelp Reviews that will serve as a foundation to tackle more sophisticated and ambitious questions in the future, such as:

  • Cultural Trends: What makes a particular city different? What cuisines do Yelpers rave about in different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?

  • Inferring Categories: Are there any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants")

  • Detecting Sarcasm in Reviews: Are Yelpers a sarcastic bunch?

  • Detecting Changepoints and Events: Detecting when things change suddenly (e.g., a business coming under new management or when a city starts going nuts over cronuts)

References

Chen, E. (2011, August 28). What is a good explanation of Latent Dirichlet Allocation? Retrieved February 8, 2017, from https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation/answer/Edwin-Chen-1

W. (2017, January 07). Topic Modeling. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Topic_model

W. (2017, January 20). Latent Dirichlet Allocation. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

W. (2004, October 20). Yelp. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Yelp

Y. (2017, January 24). Yelp Dataset Challenge. Retrieved February 08, 2017, from https://www.yelp.com/dataset_challenge