Discovering Abstract Topics in Yelp Reviews - Amlan Limaye

Background

  • Yelp develops and markets Yelp.com and the Yelp app
  • Publish crowd-sourced reviews about local businesses
  • Yelp Dataset Challenge: Discovering Abstract Topics in Yelp Reviews

Foundation to tackle more sophisticated questions in the future:

  • Cultural Trends: What makes a particular city different? What cuisines do Yelpers rave about in different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?

  • Inferring Categories: Are there any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants")

  • Detecting Sarcasm in Reviews: Are Yelpers a sarcastic bunch?

  • Detecting Changepoints and Events: Detecting when things change suddenly (e.g., a business coming under new management or when a city starts going nuts over cronuts)


In [18]:



/home/amlanlimaye/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (1,3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7a810ac250>

Approach:

  • 400K reviews and 100K tips by 120K users for 106K businesses
  • Weapon of choice: LDA
  • LDA posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.
  • Produces a ranked list of words deemed important for understanding a particular topic.

Steps:

  • Set number of topics, K
  • Count or TFIDF vectorizer?

In [15]:



Out[15]:

In [ ]:
topics_labels = {
   1: "customer_feelings",
   2: "customer_actions",
   3: "restaurant_related",
    4: "compliments",
    5: "las_vegas_related",
    6: "hotel_related",
    7: "location_related",
    8: "chicken_related",
    9: "superlatives",
    10: "ordering_pizza"
}

Conclusions and Next Steps

The next step is to generate a matrix of topic probabilities for each review (optimistic on being able to do that soon)

Discovered reasonably distinct abstract topics in Yelp Reviews, understood their distribution and developed an understanding of Yelp Reviews that will serve as a foundation to tackle more sophisticated and ambitious questions in the future

References

Chen, E. (2011, August 28). What is a good explanation of Latent Dirichlet Allocation? Retrieved February 8, 2017, from https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation/answer/Edwin-Chen-1

W. (2017, January 07). Topic Modeling. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Topic_model

W. (2017, January 20). Latent Dirichlet Allocation. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

W. (2004, October 20). Yelp. Retrieved February 08, 2017, from https://en.wikipedia.org/wiki/Yelp

Y. (2017, January 24). Yelp Dataset Challenge. Retrieved February 08, 2017, from https://www.yelp.com/dataset_challenge