Final Project - NYelpU

Date: 4 May 2017

@authors: Paula Dozsa (pvd233), Ritu Muralidharan (rm3901)

NYelpU

We are analyzing restaurant prices, ratings, and cuisines as listed on Yelp, based on their distances from 70 Washington Square South. In order to do this, we are scraping data from Yelp NYC using the search terms of “Restaurants near New York University”, and analyzing the first 50 pages of results.

Preliminaries

Import needed packages.



In [137]:

    
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  
import requests
from bs4 import BeautifulSoup
%matplotlib inline

Scrape and edit 500 results worth of data



In [138]:

    
total_count = 500
restaurantnames = []
restaurantratings = [] 
restaurantnumberreviews = []
restaurantpriceranges = []
restaurantcuisines = []
restaurantdistances = []

# Loop through 50 pages of results (10 results per page, total of 500 results)
for page in range (0, total_count, 10):
    url = "https://www.yelp.co.uk/search?find_desc=Restaurants&start={}&find_near=new-york-university-new-york-18".format(page)
    yelp = requests.get(url)
    yelp_soup = BeautifulSoup(yelp.content, 'html.parser')
    restaurants = yelp_soup.find_all('li',class_ = "regular-search-result")
    for restaurant in restaurants:
        # Extract all restaurant names
        fullnametag = restaurant.find_all('a', class_ = 'biz-name js-analytics-click')
        if fullnametag != []:
            restaurantnames += [fullnametag[0].find_all('span')[0].get_text()]
        else:
            restaurantnames += ["NaN"]
        # Extract all restaurant ratings
        all_rating_tags = list(restaurant.find_all('div', class_="i-stars"))
        if all_rating_tags != []:
            rating = all_rating_tags[0]['title'][0:4]
            restaurantratings += [rating]
        else:
            restaurantratings += ["NaN"]
        # Extract number of reviews for all restaurants
        all_review_tags = list(restaurant.find_all('span', class_="review-count"))
        if all_review_tags != []:
            restaurantnumberreviews += [all_review_tags[0].get_text().strip("\n ")]
        else:
            restaurantnumberreviews += ["NaN"]
        # Extract all price ranges
        all_price_range_tags = list(restaurant.find_all('span', class_="business-attribute price-range"))
        if all_price_range_tags != []:
            restaurantpriceranges += [all_price_range_tags[0].get_text()]
        else:
            restaurantpriceranges += ["NaN"]
        # Extract all cuisines
        all_cuisine_tags = list(restaurant.find_all('span', class_="category-str-list"))
        if all_cuisine_tags != []:
            for tag in all_cuisine_tags:
                cuisine_type = list(tag.find_all('a'))
                cuisine_type_name = cuisine_type[0].get_text()
                restaurantcuisines += [cuisine_type_name]
        else:
            restaurantcuisines += ["NaN"]
        # Extract all distances from NYU
        distance = False
        all_distance_tags = list(restaurant.find_all('small'))
        for tag in all_distance_tags:
            if ("Miles" in tag.get_text().strip("\n ")):
                restaurantdistances += [tag.get_text().strip("\n ")]
                distance = True
        if distance == False:
            restaurantdistances += ["NaN"]

Remove restaurants that have any missing data.

For some reason the two cells below only compile correctly after being run a few times; we are not sure what is going on, it might have something to do with the amount of data we are working with.



In [140]:

    
for i in range(len(restaurantnames)):
    if restaurantnames[i] == "NaN" or restaurantratings[i] == "NaN" or restaurantnumberreviews[i] == "NaN" or restaurantpriceranges[i] == "NaN" or restaurantcuisines[i] == "NaN" or restaurantdistances[i] == "NaN":
        restaurantnames.pop(i)
        restaurantratings.pop(i)
        restaurantnumberreviews.pop(i)
        restaurantpriceranges.pop(i)
        restaurantcuisines.pop(i)
        restaurantdistances.pop(i)

Remove restaurants that have less than 50 reviews.



In [143]:

    
for number_reviews in range(len(restaurantnames)):
    number_reviews_int = int(restaurantnumberreviews[number_reviews].strip(" reviews"))
    if number_reviews_int < 50:
        restaurantnames.pop(number_reviews)
        restaurantratings.pop(number_reviews)
        restaurantnumberreviews.pop(number_reviews)
        restaurantpriceranges.pop(number_reviews)
        restaurantcuisines.pop(number_reviews)
        restaurantdistances.pop(number_reviews)

We had formatting issues with displaying the $ signs because if you use multiple dollar signs they are interpreted as some formatting method, so we are converting the symbol into words.



In [144]:

    
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == "$":
        restaurantpriceranges[i] = "under 10 dollars"
    elif restaurantpriceranges[i] == "$$":
        restaurantpriceranges[i] = "11-30 dollars"
    elif restaurantpriceranges[i] == "$$$":
        restaurantpriceranges[i] = "31-60 dollars"
    elif restaurantpriceranges[i] == "$$$$":
        restaurantpriceranges[i] = "above 61 dollars"

Create a table displaying all extracted information



In [145]:

    
from IPython.display import display
df = pd.DataFrame({'Name': restaurantnames, 'Rating':restaurantratings, 'Number of Reviews':restaurantnumberreviews, 'Price Range':restaurantpriceranges, 'Cuisine':restaurantcuisines, 'Distance from NYU':restaurantdistances}, columns = ['Name', 'Rating', 'Number of Reviews', 'Price Range', 'Cuisine', 'Distance from NYU'])
display(df)









    






  
    
      
      Name
      Rating
      Number of Reviews
      Price Range
      Cuisine
      Distance from NYU
    
  
  
    
      0
      Cuba
      4.0
      1198 reviews
      11-30 dollars
      Cuban
      0.09 Miles
    
    
      1
      Kopi Ramen
      4.0
      71 reviews
      11-30 dollars
      Ramen
      0.06 Miles
    
    
      2
      Banter
      4.5
      50 reviews
      11-30 dollars
      Breakfast & Brunch
      0.2 Miles
    
    
      3
      Amélie
      4.5
      1593 reviews
      11-30 dollars
      French
      0.2 Miles
    
    
      4
      White Oak Tavern
      4.0
      194 reviews
      11-30 dollars
      Bars
      0.1 Miles
    
    
      5
      La Lanterna di Vittorio
      4.0
      1117 reviews
      11-30 dollars
      Italian
      0.2 Miles
    
    
      6
      Carroll Place
      4.0
      554 reviews
      11-30 dollars
      Gastro Pubs
      0.1 Miles
    
    
      7
      Okinii
      4.0
      155 reviews
      11-30 dollars
      Japanese
      0.1 Miles
    
    
      8
      Saigon Shack
      4.0
      1699 reviews
      under 10 dollars
      Vietnamese
      0.2 Miles
    
    
      9
      Pommes Frites
      4.5
      292 reviews
      under 10 dollars
      Belgian
      0.2 Miles
    
    
      10
      MIGHTY Bowl
      4.0
      129 reviews
      under 10 dollars
      Asian Fusion
      0.2 Miles
    
    
      11
      Babbo
      4.0
      2044 reviews
      above 61 dollars
      Italian
      0.2 Miles
    
    
      12
      Jane
      4.0
      2594 reviews
      11-30 dollars
      American (Traditional)
      0.2 Miles
    
    
      13
      The Boil
      4.0
      264 reviews
      11-30 dollars
      Cajun/Creole
      0.2 Miles
    
    
      14
      Minetta Tavern
      4.0
      1779 reviews
      31-60 dollars
      French
      0.2 Miles
    
    
      15
      Ikinari Steak East Village
      4.0
      174 reviews
      11-30 dollars
      Steakhouses
      0.4 Miles
    
    
      16
      Loring Place
      4.5
      77 reviews
      31-60 dollars
      American (New)
      0.2 Miles
    
    
      17
      Bowllin’
      4.0
      56 reviews
      11-30 dollars
      Korean
      0.1 Miles
    
    
      18
      The Malt House
      4.0
      319 reviews
      11-30 dollars
      Gastro Pubs
      0.1 Miles
    
    
      19
      Toloache
      4.0
      308 reviews
      11-30 dollars
      Mexican
      0.1 Miles
    
    
      20
      Lupa
      4.0
      1234 reviews
      31-60 dollars
      Italian
      0.2 Miles
    
    
      21
      MEW MEN
      4.5
      120 reviews
      11-30 dollars
      Ramen
      0.3 Miles
    
    
      22
      Jack’s Wife Freda
      4.0
      381 reviews
      11-30 dollars
      Mediterranean
      0.3 Miles
    
    
      23
      Favela Cubana
      4.0
      662 reviews
      11-30 dollars
      Cuban
      0.08 Miles
    
    
      24
      Mamoun’s Falafel
      4.0
      2038 reviews
      under 10 dollars
      Middle Eastern
      0.2 Miles
    
    
      25
      by CHLOE
      4.0
      1159 reviews
      11-30 dollars
      Vegan
      0.2 Miles
    
    
      26
      The Folly
      4.0
      110 reviews
      11-30 dollars
      Seafood
      0.2 Miles
    
    
      27
      Blue Hill
      4.5
      799 reviews
      above 61 dollars
      American (New)
      0.2 Miles
    
    
      28
      Manousheh
      4.5
      231 reviews
      under 10 dollars
      Lebanese
      0.2 Miles
    
    
      29
      Vic’s
      4.0
      245 reviews
      11-30 dollars
      Bars
      0.3 Miles
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      376
      Taqueria Diana
      4.0
      530 reviews
      under 10 dollars
      Mexican
      0.5 Miles
    
    
      377
      Wafels & Dinges
      4.5
      1469 reviews
      under 10 dollars
      Desserts
      1.3 Miles
    
    
      378
      Xi’an Famous Foods
      4.0
      76 reviews
      under 10 dollars
      Chinese
      0.6 Miles
    
    
      379
      Smith and Mills
      4.0
      235 reviews
      11-30 dollars
      Lounges
      0.9 Miles
    
    
      380
      Han Dynasty
      4.0
      841 reviews
      11-30 dollars
      Chinese
      0.5 Miles
    
    
      381
      Ootoya Chelsea
      4.0
      1042 reviews
      11-30 dollars
      Japanese
      0.7 Miles
    
    
      382
      Tac N Roll
      4.5
      104 reviews
      under 10 dollars
      Tacos
      0.6 Miles
    
    
      383
      Cafe Espanol
      3.5
      463 reviews
      11-30 dollars
      Spanish
      0.2 Miles
    
    
      384
      Streecha
      4.5
      93 reviews
      under 10 dollars
      Ukranian
      0.4 Miles
    
    
      385
      Gelso & Grand
      4.0
      277 reviews
      11-30 dollars
      Italian
      0.7 Miles
    
    
      386
      Chefs Club
      4.0
      278 reviews
      31-60 dollars
      Wine Bars
      0.4 Miles
    
    
      387
      The Tang
      4.0
      102 reviews
      11-30 dollars
      Noodles
      0.6 Miles
    
    
      388
      Gottino
      4.0
      260 reviews
      11-30 dollars
      Wine Bars
      0.5 Miles
    
    
      389
      Peculier Pub
      3.5
      353 reviews
      under 10 dollars
      Pubs
      0.1 Miles
    
    
      390
      The Mermaid Inn
      4.0
      1081 reviews
      31-60 dollars
      Seafood
      0.5 Miles
    
    
      391
      Rockin’ Raw
      4.0
      260 reviews
      11-30 dollars
      Vegan
      0.2 Miles
    
    
      392
      Hudson Clearwater
      4.0
      434 reviews
      31-60 dollars
      American (New)
      0.5 Miles
    
    
      393
      Emporio
      4.0
      579 reviews
      11-30 dollars
      Italian
      0.5 Miles
    
    
      394
      Benson’s NYC
      4.5
      213 reviews
      11-30 dollars
      American (New)
      0.8 Miles
    
    
      395
      Irvington
      4.0
      101 reviews
      31-60 dollars
      American (New)
      0.7 Miles
    
    
      396
      Kulu Desserts
      4.0
      87 reviews
      under 10 dollars
      Desserts
      0.2 Miles
    
    
      397
      Ed’s Lobster Bar
      4.0
      780 reviews
      31-60 dollars
      Seafood
      0.5 Miles
    
    
      398
      Cooper’s Craft & Kitchen
      4.0
      242 reviews
      11-30 dollars
      Gastro Pubs
      0.9 Miles
    
    
      399
      Feast
      4.0
      332 reviews
      31-60 dollars
      American (New)
      0.5 Miles
    
    
      400
      Trattoria Pesce Pasta
      4.0
      172 reviews
      11-30 dollars
      Italian
      0.3 Miles
    
    
      401
      Divya’s Kitchen
      5.0
      59 reviews
      11-30 dollars
      Vegan
      0.6 Miles
    
    
      402
      Pepe Rosso To Go
      4.0
      311 reviews
      11-30 dollars
      Italian
      0.3 Miles
    
    
      403
      Two Hands Restaurant & Bar
      4.0
      129 reviews
      11-30 dollars
      Bars
      0.9 Miles
    
    
      404
      Ramen Lab
      4.0
      237 reviews
      11-30 dollars
      Ramen
      0.6 Miles
    
    
      405
      Dorado Tacos
      4.0
      478 reviews
      under 10 dollars
      Mexican
      0.4 Miles
    
  

406 rows × 6 columns

Plotting graphs

Import needed packages



In [146]:

    
# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import sys                             # system module
import numpy as np                     # foundation for Pandas
import seaborn.apionly as sns          # fancy matplotlib graphics (no styling)
from pandas_datareader import wb, data as web  # worldbank data

plotly.offline.init_notebook_mode(connected=True)

Convert distances and price ranges into floats/integers so we can use them when plotting



In [147]:

    
for i in range(len(restaurantdistances)):
    restaurantdistances[i] = float(restaurantdistances[i].strip(' Miles'))



In [148]:

    
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == "under 10 dollars":
        restaurantpriceranges[i] = 1
    elif restaurantpriceranges[i] == "11-30 dollars":
        restaurantpriceranges[i] = 2
    elif restaurantpriceranges[i] == "31-60 dollars":
        restaurantpriceranges[i] = 3
    elif restaurantpriceranges[i] == "above 61 dollars":
        restaurantpriceranges[i] = 4

Create four separate arrays which contain the distance of each restaurant within a certain price range.



In [149]:

    
restaurantpricerange1 = []
restaurantpricerange2 = []
restaurantpricerange3 = []
restaurantpricerange4 = []
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == 1:
        restaurantpricerange1 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 2:
        restaurantpricerange2 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 3:
        restaurantpricerange3 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 4:
        restaurantpricerange4 += [restaurantdistances[i]]



In [150]:

    
from IPython.display import display
df = pd.DataFrame({'Price Range':restaurantpriceranges, 'Distance from NYU':restaurantdistances}, columns = ['Price Range','Distance from NYU'])
display(df.head())









    






  
    
      
      Price Range
      Distance from NYU
    
  
  
    
      0
      2
      0.09
    
    
      1
      2
      0.06
    
    
      2
      2
      0.20
    
    
      3
      2
      0.20
    
    
      4
      2
      0.10

Correlation between restaurant prices and distances from NYU



In [151]:

    
ax = sns.swarmplot(x="Price Range", y="Distance from NYU", data=df)
fig_mpl = ax.get_figure()
iplot_mpl(fig_mpl)

Observe that the lowest and the highest price ranges (under 10 and above 61 dollars respectively) have the fewest restaurants near NYU. Most of the restaurants within the given radius have prices ranging between 11 and 30 dollars.

Note 1: plotly modified our price ranges of 1, 2, 3, 4 to 0, 1, 2, 3.

Note 2: We are using the word "dollars" instead of the $ sign because it is distorting the formatting of the text.

Another look at the correlation between restaurant prices and distances from NYU

In the following four graphs, we are creating separate graphs per price range.

Note: Given that plotly did not get the x-labels correct again, ignore the actual numbers and just keep in mind that the distances on the x-axis are increasing the further away we are from the origin.



In [152]:

    
trace = go.Histogram(
    x = restaurantpricerange1,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='blue',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of up to $10',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [153]:

    
trace = go.Histogram(
    x = restaurantpricerange3,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='green',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of $30-$60',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [154]:

    
trace = go.Histogram(
    x = restaurantpricerange2,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='red',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of between $11-30',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [155]:

    
trace = go.Histogram(
    x = restaurantpricerange4,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color="rgba(0,191,191,1.0)"
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of over $60',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1,
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')

If we compare the bar graphs to the swarm plot, we see that they match the data distribution. The swarm plot is better for an overall comparison, while the individual bar graphs give a more magnified view for each price range.

Correlation between restaurant ratings and distances from NYU

Find all of the rating categories in our data and store them in an array.



In [156]:

    
restaurantrating_types = []
for rating in restaurantratings:
    if rating not in restaurantrating_types:
        restaurantrating_types += [rating]
print(restaurantrating_types)









    



['4.0 ', '4.5 ', '3.5 ', '5.0 ', '3.0 ']

In the cell above, we are looking for the types of restaurant ratings so that we can then separate them and display them in correlation to distance from NYU. It seems that within the first 500 results that we have scraped, Yelp's algorithm prioritises higher rated restaurants. Next, we look for the number of restaurants which each type of rating.



In [157]:

    
restaurantrating_numbers = [0]*len(restaurantrating_types)
for rating in range(len(restaurantratings)):
    for type in range(len(restaurantrating_types)):
        if restaurantratings[rating] == restaurantrating_types[type]:
            restaurantrating_numbers[type] += 1
print(restaurantrating_numbers)









    



[275, 86, 40, 4, 1]



In [158]:

    
fig = {
    "data": [
        {
            "values": restaurantrating_numbers,
            "labels": restaurantrating_types,
            "name": "Rating type",
            "hoverinfo":"label+percent+name",
            "type": "pie"
        }
    ],
    "layout": {
        "title": "Ratings separated by type",
    }
}
iplot(fig)

Intuitively, it makes sense that most restaurants received a 4-star rating. A restaurant would have to be truly exceptional to receive a full 5-star rating, which only 4 out of our 500 restaurants have received.

Next, we create arrays for the 3 most popular ratings (i.e. 3.5, 4.0, 4.5) storing the distances of the restaurants with the respective rating.



In [159]:

    
distances_35 = []
distances_40 = []
distances_45 = []
for i in range(len(restaurantratings)):
    if restaurantratings[i] == '3.5 ':
        distances_35 += [restaurantdistances[i]]
    elif restaurantratings[i] == '4.0 ':
        distances_40 += [restaurantdistances[i]]
    elif restaurantratings[i] == '4.5 ':
        distances_45 += [restaurantdistances[i]]
print(distances_35)
print(distances_40)
print(distances_45)









    



[0.3, 0.1, 0.2, 0.1, 0.06, 0.3, 0.2, 0.07, 0.3, 0.1, 0.2, 0.2, 0.4, 0.3, 0.3, 0.3, 0.07, 0.2, 0.2, 0.2, 0.3, 0.2, 0.4, 0.3, 0.2, 0.3, 0.2, 0.4, 0.1, 0.2, 0.06, 0.2, 0.3, 0.2, 0.5, 0.3, 0.6, 0.6, 0.2, 0.1]
[0.09, 0.06, 0.1, 0.2, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.1, 0.1, 0.1, 0.2, 0.3, 0.08, 0.2, 0.2, 0.2, 0.3, 0.1, 0.2, 0.09, 0.2, 0.3, 0.4, 0.3, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3, 0.3, 0.2, 0.2, 0.4, 0.2, 0.5, 0.4, 0.4, 0.1, 0.1, 0.3, 0.4, 0.4, 0.5, 0.5, 0.2, 0.4, 0.3, 0.2, 0.4, 0.2, 0.2, 0.2, 0.3, 0.3, 0.5, 0.4, 0.3, 0.4, 1.0, 0.4, 0.3, 0.2, 0.5, 0.3, 0.5, 0.2, 0.4, 0.3, 0.06, 0.6, 0.2, 0.4, 0.3, 0.3, 0.3, 0.2, 0.6, 0.3, 0.2, 0.6, 0.3, 0.4, 0.2, 0.4, 0.3, 0.5, 0.1, 0.4, 0.4, 0.4, 0.2, 0.3, 0.5, 0.6, 0.5, 0.1, 0.6, 0.3, 0.5, 0.2, 0.5, 0.5, 0.4, 0.6, 0.5, 0.5, 0.3, 0.5, 0.7, 0.4, 0.6, 0.5, 0.3, 0.3, 0.5, 0.4, 0.1, 0.4, 0.6, 0.5, 0.3, 0.7, 0.7, 0.4, 0.3, 0.4, 0.6, 0.4, 0.3, 0.4, 0.2, 0.4, 0.3, 0.2, 0.5, 0.4, 0.2, 0.5, 0.7, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2, 0.7, 0.6, 0.3, 0.3, 0.5, 0.3, 0.7, 0.4, 0.4, 0.2, 0.6, 0.5, 0.5, 0.5, 0.7, 0.5, 0.8, 0.8, 0.7, 0.5, 0.4, 0.5, 0.5, 0.4, 0.6, 0.4, 0.6, 0.5, 0.6, 0.7, 0.4, 0.2, 0.8, 0.7, 0.4, 0.4, 0.2, 0.3, 0.4, 0.2, 0.8, 0.7, 0.9, 0.4, 0.5, 0.5, 0.5, 0.5, 0.4, 0.3, 0.2, 0.5, 0.9, 0.4, 0.6, 0.9, 0.6, 0.2, 0.3, 0.3, 0.4, 0.5, 0.2, 0.3, 0.5, 0.5, 0.9, 0.6, 0.6, 0.5, 0.7, 0.5, 0.6, 0.8, 0.5, 0.4, 0.7, 0.4, 0.5, 0.3, 0.7, 0.4, 0.5, 1.0, 0.9, 0.7, 0.6, 0.7, 0.5, 0.2, 0.5, 0.5, 0.8, 0.5, 0.4, 0.7, 0.3, 0.5, 0.8, 0.5, 0.6, 0.9, 0.5, 0.7, 0.7, 0.4, 0.6, 0.5, 0.5, 0.2, 0.5, 0.5, 0.7, 0.2, 0.5, 0.9, 0.5, 0.3, 0.3, 0.9, 0.6, 0.4]
[0.2, 0.2, 0.2, 0.2, 0.3, 0.2, 0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.4, 0.08, 0.4, 0.5, 0.5, 0.4, 0.4, 0.3, 0.5, 0.5, 0.2, 0.4, 0.6, 1.2, 0.6, 0.6, 0.6, 0.6, 0.2, 0.7, 0.6, 0.8, 0.2, 0.4, 0.7, 0.6, 0.5, 0.7, 0.8, 0.4, 0.4, 0.3, 0.6, 0.3, 0.6, 0.4, 0.3, 0.4, 0.6, 0.7, 0.5, 0.4, 0.2, 0.7, 0.5, 0.3, 0.6, 0.9, 0.3, 0.2, 0.8, 0.8, 0.4, 0.2, 0.8, 0.5, 0.6, 0.9, 0.9, 0.3, 0.9, 1.0, 0.6, 0.7, 1.0, 0.5, 1.0, 0.4, 0.9, 0.2, 1.3, 0.6, 0.4, 0.8]



In [160]:

    
distances_35_types = []
for rating in distances_35:
    if rating not in distances_35_types:
        distances_35_types += [rating]
print(distances_35_types)









    



[0.3, 0.1, 0.2, 0.06, 0.07, 0.4, 0.5, 0.6]



In [161]:

    
distances_35_numbers = [0]*len(distances_35_types)
for distance in range(len(distances_35)):
    for type in range(len(distances_35_types)):
        if distances_35[distance] == distances_35_types[type]:
            distances_35_numbers[type] += 1
print(distances_35_numbers)









    



[11, 5, 14, 2, 2, 3, 1, 2]



In [162]:

    
distances_40_types = []
for rating in distances_40:
    if rating not in distances_40_types:
        distances_40_types += [rating]
print(distances_40_types)









    



[0.09, 0.06, 0.1, 0.2, 0.4, 0.3, 0.08, 0.5, 1.0, 0.6, 0.7, 0.8, 0.9]



In [163]:

    
distances_40_numbers = [0]*len(distances_40_types)
for distance in range(len(distances_40)):
    for type in range(len(distances_40_types)):
        if distances_40[distance] == distances_40_types[type]:
            distances_40_numbers[type] += 1
print(distances_40_numbers)









    



[2, 2, 13, 47, 49, 43, 1, 56, 2, 24, 21, 7, 8]



In [164]:

    
distances_45_types = []
for rating in distances_45:
    if rating not in distances_45_types:
        distances_45_types += [rating]
print(distances_45_types)









    



[0.2, 0.3, 0.1, 0.4, 0.08, 0.5, 0.6, 1.2, 0.7, 0.8, 0.9, 1.0, 1.3]



In [165]:

    
distances_45_numbers = [0]*len(distances_45_types)
for distance in range(len(distances_45)):
    for type in range(len(distances_45_types)):
        if distances_45[distance] == distances_45_types[type]:
            distances_45_numbers[type] += 1
print(distances_45_numbers)









    



[14, 10, 2, 14, 1, 9, 14, 1, 6, 6, 5, 3, 1]



In [166]:

    
fig = {
    "data": [
        {
            "values": distances_35_numbers,
            "labels": distances_35_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "hole": 0.4,
            "domain": {"x": [0, .38]},
            "type": "pie"
        },
        {
            "values": distances_45_numbers,
            "labels": distances_45_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "hole": 0.4,
            "domain": {"x": [.52, 0.9]},
            "type": "pie"
        }
    ],
    "layout": {
        "title": "Restaurants with different ratings separated by distance",
        "annotations": [
            {
                "text":"3.5 stars",
                "showarrow": False,
                "x": 0.16,
                "y": 0.5
            },
            {
                "text":"4.5 stars",
                "showarrow": False,
                "x": 0.74,
                "y": 0.5
            }
        ]
    }
}
iplot(fig)



In [167]:

    
fig = {
    "data": [
        {
            "values": distances_40_numbers,
            "labels": distances_40_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "type": "pie",
            "hole": "0.4"
        }
    ],
    "layout": {
        "title": "Restaurants with 4.0 rating separated by distance",
        "annotations": [
            {
                "text":"4.0 stars",
                "showarrow": False,
            }]
    }
}
iplot(fig)

In displaying these donut charts, we chose to separate the 4.0 rated restaurants from the 3.5 and 4.5 rated restaurants due to both spatial issues in displaying the charts and because we thought we should give higher importance to the former (given that most restaurants were given 4 stars).

Number of restaurants per cuisine

Create an array containing all of the cuisine types in the data set.



In [168]:

    
restaurantcuisine_types = []
for cuisine in restaurantcuisines:
    if cuisine not in restaurantcuisine_types:
        restaurantcuisine_types += [cuisine]
print(restaurantcuisine_types)









    



['Cuban', 'Ramen', 'Breakfast & Brunch', 'French', 'Bars', 'Italian', 'Gastro Pubs', 'Japanese', 'Vietnamese', 'Belgian', 'Asian Fusion', 'American (Traditional)', 'Cajun/Creole', 'Steakhouses', 'American (New)', 'Korean', 'Mexican', 'Mediterranean', 'Middle Eastern', 'Vegan', 'Seafood', 'Lebanese', 'Takeaway & Fast Food', 'Coffee & Tea Shops', 'Vegetarian', 'Specialty Food', 'Indian', 'Halal', 'Malaysian', 'Gluten Free', 'Burgers', 'Tapas & Small Plates', 'Cocktail Bars', 'Chinese', 'Sandwiches', 'Noodles', 'Desserts', 'Greek', 'Wine Bars', 'Poke', 'Dim Sum', 'Pizza', 'Tapas Bars', 'Caribbean', 'Spanish', 'Thai', 'Sushi', 'Filipino', 'Crepes', 'Jazz & Blues', 'Lounges', 'Latin American', 'Ice Cream & Frozen Yoghurt', 'Cheese Shops', 'Izakaya', 'Salad', 'Cambodian', 'Bookshops', 'Moroccan', 'Delis', 'Comfort Food', 'Markets', 'Cafes', 'Southern', 'Fondue', 'Shanghainese', 'BBQ & Barbecue', 'Modern European', 'Bakeries', 'Tacos', 'Ukranian', 'Pubs']

Create an array containing number of restaurants of each cuisine type.



In [169]:

    
restaurantcuisine_numbers = [0]*len(restaurantcuisine_types)
for cuisine in range(len(restaurantcuisines)):
    for type in range(len(restaurantcuisine_types)):
        if restaurantcuisines[cuisine] == restaurantcuisine_types[type]:
            restaurantcuisine_numbers[type] += 1
print(restaurantcuisine_numbers)









    



[5, 8, 6, 18, 8, 56, 5, 30, 5, 1, 5, 15, 3, 1, 44, 5, 12, 11, 6, 5, 19, 1, 2, 5, 3, 2, 9, 1, 1, 1, 7, 4, 4, 8, 4, 2, 4, 4, 7, 1, 3, 12, 1, 2, 3, 6, 7, 1, 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Given that it was impossible to display such a high number of cuisine types in one graph, we decided to merge all of the cuisine types represented by three or less restaurants into the "Other" category, and then display them in a separate graph.



In [170]:

    
restaurantcuisine_numbers_2 = []
restaurantcuisine_types_2 = []
restaurantcuisine_numbers_other = []
restaurantcuisine_types_other = []
other = 0
for number in range(len(restaurantcuisine_numbers)) :
    if restaurantcuisine_numbers[number] > 3:
        restaurantcuisine_numbers_2 += [restaurantcuisine_numbers[number]]
        restaurantcuisine_types_2 += [restaurantcuisine_types[number]]
    else:
        restaurantcuisine_types_other += [restaurantcuisine_types[number]]
        restaurantcuisine_numbers_other += [restaurantcuisine_numbers[number]]
        other += restaurantcuisine_numbers[number]
restaurantcuisine_numbers_2 += [other]
restaurantcuisine_types_2 += ["Other ("+str(other)+")"]
print(restaurantcuisine_numbers_2)









    



[5, 8, 6, 18, 8, 56, 5, 30, 5, 5, 15, 44, 5, 12, 11, 6, 5, 19, 5, 9, 7, 4, 4, 8, 4, 4, 4, 7, 12, 6, 7, 4, 4, 54]

While working on the bar graph for the cuisines, we noticed that some of the strings representing the cuisines were very long and weren't displaying fully on the graph, and decided to manually abbreviate these names.



In [171]:

    
for cuisine in range(len(restaurantcuisine_types_2)):
    if restaurantcuisine_types_2[cuisine] == "Breakfast & Brunch":
        restaurantcuisine_types_2[cuisine] = "Breakfast"
    elif restaurantcuisine_types_2[cuisine] == "American (Traditional)":
        restaurantcuisine_types_2[cuisine] = "Amer. (Trad)"
    elif restaurantcuisine_types_2[cuisine] == "American (New)":
        restaurantcuisine_types_2[cuisine] = "Amer. (New)"
    elif restaurantcuisine_types_2[cuisine] == "Mediterranean":
        restaurantcuisine_types_2[cuisine] = "Mediter."
    elif restaurantcuisine_types_2[cuisine] == "Middle Eastern":
        restaurantcuisine_types_2[cuisine] = "M. Eastern"
    elif restaurantcuisine_types_2[cuisine] == "Tapas & Small Plates":
        restaurantcuisine_types_2[cuisine] = "Tapas"
    elif restaurantcuisine_types_2[cuisine] == "Latin American":
        restaurantcuisine_types_2[cuisine] = "Lat. Amer."
print(restaurantcuisine_types_2)









    



['Cuban', 'Ramen', 'Breakfast', 'French', 'Bars', 'Italian', 'Gastro Pubs', 'Japanese', 'Vietnamese', 'Asian Fusion', 'Amer. (Trad)', 'Amer. (New)', 'Korean', 'Mexican', 'Mediter.', 'M. Eastern', 'Vegan', 'Seafood', 'Coffee & Tea Shops', 'Indian', 'Burgers', 'Tapas', 'Cocktail Bars', 'Chinese', 'Sandwiches', 'Desserts', 'Greek', 'Wine Bars', 'Pizza', 'Thai', 'Sushi', 'Lat. Amer.', 'Delis', 'Other (54)']



In [172]:

    
for cuisine in range(len(restaurantcuisine_types_other)):
    if restaurantcuisine_types_other[cuisine] == "Takeaway & Fast Food":
        restaurantcuisine_types_other[cuisine] = "Fast Food"
    elif restaurantcuisine_types_other[cuisine] == "Specialty Food":
        restaurantcuisine_types_other[cuisine] = "Specialty"
    elif restaurantcuisine_types_other[cuisine] == "Ice Cream & Frozen Yoghurt":
        restaurantcuisine_types_other[cuisine] = "Ice Cream"
    elif restaurantcuisine_types_other[cuisine] == "BBQ & Barbecue":
        restaurantcuisine_types_other[cuisine] = "BBQ"
    elif restaurantcuisine_types_other[cuisine] == "Modern European":
        restaurantcuisine_types_other[cuisine] = "European"
    elif restaurantcuisine_types_other[cuisine] == "Comfort Food":
        restaurantcuisine_types_other[cuisine] = "Comfort"
print(restaurantcuisine_types_other)









    



['Belgian', 'Cajun/Creole', 'Steakhouses', 'Lebanese', 'Fast Food', 'Vegetarian', 'Specialty', 'Halal', 'Malaysian', 'Gluten Free', 'Noodles', 'Poke', 'Dim Sum', 'Tapas Bars', 'Caribbean', 'Spanish', 'Filipino', 'Crepes', 'Jazz & Blues', 'Lounges', 'Ice Cream', 'Cheese Shops', 'Izakaya', 'Salad', 'Cambodian', 'Bookshops', 'Moroccan', 'Comfort', 'Markets', 'Cafes', 'Southern', 'Fondue', 'Shanghainese', 'BBQ', 'European', 'Bakeries', 'Tacos', 'Ukranian', 'Pubs']

Graph with all cuisines (less than three restaurants merged into "Other")



In [173]:

    
import plotly.graph_objs as go
data = [go.Bar(
            y=restaurantcuisine_numbers_2,
            x=restaurantcuisine_types_2,
            orientation = 'v',
)]

iplot(data, filename='horizontal-bar')

Graph of "Other" cuisines



In [174]:

    
import plotly.graph_objs as go
data = [go.Bar(
            y=restaurantcuisine_numbers_other,
            x=restaurantcuisine_types_other,
            orientation = 'v'
)]

iplot(data, filename='horizontal-bar')

Ending Remarks

Scraping data

We originally tried using a csv file provided by Yelp for their Annual Dataset Challenge, but then realized that this dataset, although containing hundreds of thousands of businesses, actually contained only a handful of New York-based restaurants, as it was just a subset of their whole dataset. Therefore, we decided to scrape the data directly from their website and found that this was more enjoyable and efficient for us and the goals of this project.

Creating graphs

We did realize early on that we wouldn't be making any groundbreaking discoveries with our data analysis, but the purpose of this project for us was to familiarize ourselves and experiment with the techniques learned in-class. The primary purpose of our graphs is to display the scraped data in a more coherent way and process it visually, as well as try to make correlations.

	Name	Rating	Number of Reviews	Price Range	Cuisine	Distance from NYU
0	Cuba	4.0	1198 reviews	11-30 dollars	Cuban	0.09 Miles
1	Kopi Ramen	4.0	71 reviews	11-30 dollars	Ramen	0.06 Miles
2	Banter	4.5	50 reviews	11-30 dollars	Breakfast & Brunch	0.2 Miles
3	Amélie	4.5	1593 reviews	11-30 dollars	French	0.2 Miles
4	White Oak Tavern	4.0	194 reviews	11-30 dollars	Bars	0.1 Miles
5	La Lanterna di Vittorio	4.0	1117 reviews	11-30 dollars	Italian	0.2 Miles
6	Carroll Place	4.0	554 reviews	11-30 dollars	Gastro Pubs	0.1 Miles
7	Okinii	4.0	155 reviews	11-30 dollars	Japanese	0.1 Miles
8	Saigon Shack	4.0	1699 reviews	under 10 dollars	Vietnamese	0.2 Miles
9	Pommes Frites	4.5	292 reviews	under 10 dollars	Belgian	0.2 Miles
10	MIGHTY Bowl	4.0	129 reviews	under 10 dollars	Asian Fusion	0.2 Miles
11	Babbo	4.0	2044 reviews	above 61 dollars	Italian	0.2 Miles
12	Jane	4.0	2594 reviews	11-30 dollars	American (Traditional)	0.2 Miles
13	The Boil	4.0	264 reviews	11-30 dollars	Cajun/Creole	0.2 Miles
14	Minetta Tavern	4.0	1779 reviews	31-60 dollars	French	0.2 Miles
15	Ikinari Steak East Village	4.0	174 reviews	11-30 dollars	Steakhouses	0.4 Miles
16	Loring Place	4.5	77 reviews	31-60 dollars	American (New)	0.2 Miles
17	Bowllin’	4.0	56 reviews	11-30 dollars	Korean	0.1 Miles
18	The Malt House	4.0	319 reviews	11-30 dollars	Gastro Pubs	0.1 Miles
19	Toloache	4.0	308 reviews	11-30 dollars	Mexican	0.1 Miles
20	Lupa	4.0	1234 reviews	31-60 dollars	Italian	0.2 Miles
21	MEW MEN	4.5	120 reviews	11-30 dollars	Ramen	0.3 Miles
22	Jack’s Wife Freda	4.0	381 reviews	11-30 dollars	Mediterranean	0.3 Miles
23	Favela Cubana	4.0	662 reviews	11-30 dollars	Cuban	0.08 Miles
24	Mamoun’s Falafel	4.0	2038 reviews	under 10 dollars	Middle Eastern	0.2 Miles
25	by CHLOE	4.0	1159 reviews	11-30 dollars	Vegan	0.2 Miles
26	The Folly	4.0	110 reviews	11-30 dollars	Seafood	0.2 Miles
27	Blue Hill	4.5	799 reviews	above 61 dollars	American (New)	0.2 Miles
28	Manousheh	4.5	231 reviews	under 10 dollars	Lebanese	0.2 Miles
29	Vic’s	4.0	245 reviews	11-30 dollars	Bars	0.3 Miles
...	...	...	...	...	...	...
376	Taqueria Diana	4.0	530 reviews	under 10 dollars	Mexican	0.5 Miles
377	Wafels & Dinges	4.5	1469 reviews	under 10 dollars	Desserts	1.3 Miles
378	Xi’an Famous Foods	4.0	76 reviews	under 10 dollars	Chinese	0.6 Miles
379	Smith and Mills	4.0	235 reviews	11-30 dollars	Lounges	0.9 Miles
380	Han Dynasty	4.0	841 reviews	11-30 dollars	Chinese	0.5 Miles
381	Ootoya Chelsea	4.0	1042 reviews	11-30 dollars	Japanese	0.7 Miles
382	Tac N Roll	4.5	104 reviews	under 10 dollars	Tacos	0.6 Miles
383	Cafe Espanol	3.5	463 reviews	11-30 dollars	Spanish	0.2 Miles
384	Streecha	4.5	93 reviews	under 10 dollars	Ukranian	0.4 Miles
385	Gelso & Grand	4.0	277 reviews	11-30 dollars	Italian	0.7 Miles
386	Chefs Club	4.0	278 reviews	31-60 dollars	Wine Bars	0.4 Miles
387	The Tang	4.0	102 reviews	11-30 dollars	Noodles	0.6 Miles
388	Gottino	4.0	260 reviews	11-30 dollars	Wine Bars	0.5 Miles
389	Peculier Pub	3.5	353 reviews	under 10 dollars	Pubs	0.1 Miles
390	The Mermaid Inn	4.0	1081 reviews	31-60 dollars	Seafood	0.5 Miles
391	Rockin’ Raw	4.0	260 reviews	11-30 dollars	Vegan	0.2 Miles
392	Hudson Clearwater	4.0	434 reviews	31-60 dollars	American (New)	0.5 Miles
393	Emporio	4.0	579 reviews	11-30 dollars	Italian	0.5 Miles
394	Benson’s NYC	4.5	213 reviews	11-30 dollars	American (New)	0.8 Miles
395	Irvington	4.0	101 reviews	31-60 dollars	American (New)	0.7 Miles
396	Kulu Desserts	4.0	87 reviews	under 10 dollars	Desserts	0.2 Miles
397	Ed’s Lobster Bar	4.0	780 reviews	31-60 dollars	Seafood	0.5 Miles
398	Cooper’s Craft & Kitchen	4.0	242 reviews	11-30 dollars	Gastro Pubs	0.9 Miles
399	Feast	4.0	332 reviews	31-60 dollars	American (New)	0.5 Miles
400	Trattoria Pesce Pasta	4.0	172 reviews	11-30 dollars	Italian	0.3 Miles
401	Divya’s Kitchen	5.0	59 reviews	11-30 dollars	Vegan	0.6 Miles
402	Pepe Rosso To Go	4.0	311 reviews	11-30 dollars	Italian	0.3 Miles
403	Two Hands Restaurant & Bar	4.0	129 reviews	11-30 dollars	Bars	0.9 Miles
404	Ramen Lab	4.0	237 reviews	11-30 dollars	Ramen	0.6 Miles
405	Dorado Tacos	4.0	478 reviews	under 10 dollars	Mexican	0.4 Miles