Final Project - NYelpU

Date: 4 May 2017

@authors: Paula Dozsa (pvd233), Ritu Muralidharan (rm3901)

We are analyzing restaurant prices, ratings, and cuisines as listed on Yelp, based on their distances from 70 Washington Square South. In order to do this, we are scraping data from Yelp NYC using the search terms of “Restaurants near New York University”, and analyzing the first 50 pages of results.

Preliminaries

Import needed packages.


In [137]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  
import requests
from bs4 import BeautifulSoup
%matplotlib inline

Scrape and edit 500 results worth of data


In [138]:
total_count = 500
restaurantnames = []
restaurantratings = [] 
restaurantnumberreviews = []
restaurantpriceranges = []
restaurantcuisines = []
restaurantdistances = []

# Loop through 50 pages of results (10 results per page, total of 500 results)
for page in range (0, total_count, 10):
    url = "https://www.yelp.co.uk/search?find_desc=Restaurants&start={}&find_near=new-york-university-new-york-18".format(page)
    yelp = requests.get(url)
    yelp_soup = BeautifulSoup(yelp.content, 'html.parser')
    restaurants = yelp_soup.find_all('li',class_ = "regular-search-result")
    for restaurant in restaurants:
        # Extract all restaurant names
        fullnametag = restaurant.find_all('a', class_ = 'biz-name js-analytics-click')
        if fullnametag != []:
            restaurantnames += [fullnametag[0].find_all('span')[0].get_text()]
        else:
            restaurantnames += ["NaN"]
        # Extract all restaurant ratings
        all_rating_tags = list(restaurant.find_all('div', class_="i-stars"))
        if all_rating_tags != []:
            rating = all_rating_tags[0]['title'][0:4]
            restaurantratings += [rating]
        else:
            restaurantratings += ["NaN"]
        # Extract number of reviews for all restaurants
        all_review_tags = list(restaurant.find_all('span', class_="review-count"))
        if all_review_tags != []:
            restaurantnumberreviews += [all_review_tags[0].get_text().strip("\n ")]
        else:
            restaurantnumberreviews += ["NaN"]
        # Extract all price ranges
        all_price_range_tags = list(restaurant.find_all('span', class_="business-attribute price-range"))
        if all_price_range_tags != []:
            restaurantpriceranges += [all_price_range_tags[0].get_text()]
        else:
            restaurantpriceranges += ["NaN"]
        # Extract all cuisines
        all_cuisine_tags = list(restaurant.find_all('span', class_="category-str-list"))
        if all_cuisine_tags != []:
            for tag in all_cuisine_tags:
                cuisine_type = list(tag.find_all('a'))
                cuisine_type_name = cuisine_type[0].get_text()
                restaurantcuisines += [cuisine_type_name]
        else:
            restaurantcuisines += ["NaN"]
        # Extract all distances from NYU
        distance = False
        all_distance_tags = list(restaurant.find_all('small'))
        for tag in all_distance_tags:
            if ("Miles" in tag.get_text().strip("\n ")):
                restaurantdistances += [tag.get_text().strip("\n ")]
                distance = True
        if distance == False:
            restaurantdistances += ["NaN"]

Remove restaurants that have any missing data.

For some reason the two cells below only compile correctly after being run a few times; we are not sure what is going on, it might have something to do with the amount of data we are working with.


In [140]:
for i in range(len(restaurantnames)):
    if restaurantnames[i] == "NaN" or restaurantratings[i] == "NaN" or restaurantnumberreviews[i] == "NaN" or restaurantpriceranges[i] == "NaN" or restaurantcuisines[i] == "NaN" or restaurantdistances[i] == "NaN":
        restaurantnames.pop(i)
        restaurantratings.pop(i)
        restaurantnumberreviews.pop(i)
        restaurantpriceranges.pop(i)
        restaurantcuisines.pop(i)
        restaurantdistances.pop(i)

Remove restaurants that have less than 50 reviews.


In [143]:
for number_reviews in range(len(restaurantnames)):
    number_reviews_int = int(restaurantnumberreviews[number_reviews].strip(" reviews"))
    if number_reviews_int < 50:
        restaurantnames.pop(number_reviews)
        restaurantratings.pop(number_reviews)
        restaurantnumberreviews.pop(number_reviews)
        restaurantpriceranges.pop(number_reviews)
        restaurantcuisines.pop(number_reviews)
        restaurantdistances.pop(number_reviews)

We had formatting issues with displaying the $ signs because if you use multiple dollar signs they are interpreted as some formatting method, so we are converting the symbol into words.


In [144]:
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == "$":
        restaurantpriceranges[i] = "under 10 dollars"
    elif restaurantpriceranges[i] == "$$":
        restaurantpriceranges[i] = "11-30 dollars"
    elif restaurantpriceranges[i] == "$$$":
        restaurantpriceranges[i] = "31-60 dollars"
    elif restaurantpriceranges[i] == "$$$$":
        restaurantpriceranges[i] = "above 61 dollars"

Create a table displaying all extracted information


In [145]:
from IPython.display import display
df = pd.DataFrame({'Name': restaurantnames, 'Rating':restaurantratings, 'Number of Reviews':restaurantnumberreviews, 'Price Range':restaurantpriceranges, 'Cuisine':restaurantcuisines, 'Distance from NYU':restaurantdistances}, columns = ['Name', 'Rating', 'Number of Reviews', 'Price Range', 'Cuisine', 'Distance from NYU'])
display(df)


Name Rating Number of Reviews Price Range Cuisine Distance from NYU
0 Cuba 4.0 1198 reviews 11-30 dollars Cuban 0.09 Miles
1 Kopi Ramen 4.0 71 reviews 11-30 dollars Ramen 0.06 Miles
2 Banter 4.5 50 reviews 11-30 dollars Breakfast & Brunch 0.2 Miles
3 Amélie 4.5 1593 reviews 11-30 dollars French 0.2 Miles
4 White Oak Tavern 4.0 194 reviews 11-30 dollars Bars 0.1 Miles
5 La Lanterna di Vittorio 4.0 1117 reviews 11-30 dollars Italian 0.2 Miles
6 Carroll Place 4.0 554 reviews 11-30 dollars Gastro Pubs 0.1 Miles
7 Okinii 4.0 155 reviews 11-30 dollars Japanese 0.1 Miles
8 Saigon Shack 4.0 1699 reviews under 10 dollars Vietnamese 0.2 Miles
9 Pommes Frites 4.5 292 reviews under 10 dollars Belgian 0.2 Miles
10 MIGHTY Bowl 4.0 129 reviews under 10 dollars Asian Fusion 0.2 Miles
11 Babbo 4.0 2044 reviews above 61 dollars Italian 0.2 Miles
12 Jane 4.0 2594 reviews 11-30 dollars American (Traditional) 0.2 Miles
13 The Boil 4.0 264 reviews 11-30 dollars Cajun/Creole 0.2 Miles
14 Minetta Tavern 4.0 1779 reviews 31-60 dollars French 0.2 Miles
15 Ikinari Steak East Village 4.0 174 reviews 11-30 dollars Steakhouses 0.4 Miles
16 Loring Place 4.5 77 reviews 31-60 dollars American (New) 0.2 Miles
17 Bowllin’ 4.0 56 reviews 11-30 dollars Korean 0.1 Miles
18 The Malt House 4.0 319 reviews 11-30 dollars Gastro Pubs 0.1 Miles
19 Toloache 4.0 308 reviews 11-30 dollars Mexican 0.1 Miles
20 Lupa 4.0 1234 reviews 31-60 dollars Italian 0.2 Miles
21 MEW MEN 4.5 120 reviews 11-30 dollars Ramen 0.3 Miles
22 Jack’s Wife Freda 4.0 381 reviews 11-30 dollars Mediterranean 0.3 Miles
23 Favela Cubana 4.0 662 reviews 11-30 dollars Cuban 0.08 Miles
24 Mamoun’s Falafel 4.0 2038 reviews under 10 dollars Middle Eastern 0.2 Miles
25 by CHLOE 4.0 1159 reviews 11-30 dollars Vegan 0.2 Miles
26 The Folly 4.0 110 reviews 11-30 dollars Seafood 0.2 Miles
27 Blue Hill 4.5 799 reviews above 61 dollars American (New) 0.2 Miles
28 Manousheh 4.5 231 reviews under 10 dollars Lebanese 0.2 Miles
29 Vic’s 4.0 245 reviews 11-30 dollars Bars 0.3 Miles
... ... ... ... ... ... ...
376 Taqueria Diana 4.0 530 reviews under 10 dollars Mexican 0.5 Miles
377 Wafels & Dinges 4.5 1469 reviews under 10 dollars Desserts 1.3 Miles
378 Xi’an Famous Foods 4.0 76 reviews under 10 dollars Chinese 0.6 Miles
379 Smith and Mills 4.0 235 reviews 11-30 dollars Lounges 0.9 Miles
380 Han Dynasty 4.0 841 reviews 11-30 dollars Chinese 0.5 Miles
381 Ootoya Chelsea 4.0 1042 reviews 11-30 dollars Japanese 0.7 Miles
382 Tac N Roll 4.5 104 reviews under 10 dollars Tacos 0.6 Miles
383 Cafe Espanol 3.5 463 reviews 11-30 dollars Spanish 0.2 Miles
384 Streecha 4.5 93 reviews under 10 dollars Ukranian 0.4 Miles
385 Gelso & Grand 4.0 277 reviews 11-30 dollars Italian 0.7 Miles
386 Chefs Club 4.0 278 reviews 31-60 dollars Wine Bars 0.4 Miles
387 The Tang 4.0 102 reviews 11-30 dollars Noodles 0.6 Miles
388 Gottino 4.0 260 reviews 11-30 dollars Wine Bars 0.5 Miles
389 Peculier Pub 3.5 353 reviews under 10 dollars Pubs 0.1 Miles
390 The Mermaid Inn 4.0 1081 reviews 31-60 dollars Seafood 0.5 Miles
391 Rockin’ Raw 4.0 260 reviews 11-30 dollars Vegan 0.2 Miles
392 Hudson Clearwater 4.0 434 reviews 31-60 dollars American (New) 0.5 Miles
393 Emporio 4.0 579 reviews 11-30 dollars Italian 0.5 Miles
394 Benson’s NYC 4.5 213 reviews 11-30 dollars American (New) 0.8 Miles
395 Irvington 4.0 101 reviews 31-60 dollars American (New) 0.7 Miles
396 Kulu Desserts 4.0 87 reviews under 10 dollars Desserts 0.2 Miles
397 Ed’s Lobster Bar 4.0 780 reviews 31-60 dollars Seafood 0.5 Miles
398 Cooper’s Craft & Kitchen 4.0 242 reviews 11-30 dollars Gastro Pubs 0.9 Miles
399 Feast 4.0 332 reviews 31-60 dollars American (New) 0.5 Miles
400 Trattoria Pesce Pasta 4.0 172 reviews 11-30 dollars Italian 0.3 Miles
401 Divya’s Kitchen 5.0 59 reviews 11-30 dollars Vegan 0.6 Miles
402 Pepe Rosso To Go 4.0 311 reviews 11-30 dollars Italian 0.3 Miles
403 Two Hands Restaurant & Bar 4.0 129 reviews 11-30 dollars Bars 0.9 Miles
404 Ramen Lab 4.0 237 reviews 11-30 dollars Ramen 0.6 Miles
405 Dorado Tacos 4.0 478 reviews under 10 dollars Mexican 0.4 Miles

406 rows × 6 columns

Plotting graphs

Import needed packages


In [146]:
# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import sys                             # system module
import numpy as np                     # foundation for Pandas
import seaborn.apionly as sns          # fancy matplotlib graphics (no styling)
from pandas_datareader import wb, data as web  # worldbank data

plotly.offline.init_notebook_mode(connected=True)


Convert distances and price ranges into floats/integers so we can use them when plotting


In [147]:
for i in range(len(restaurantdistances)):
    restaurantdistances[i] = float(restaurantdistances[i].strip(' Miles'))

In [148]:
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == "under 10 dollars":
        restaurantpriceranges[i] = 1
    elif restaurantpriceranges[i] == "11-30 dollars":
        restaurantpriceranges[i] = 2
    elif restaurantpriceranges[i] == "31-60 dollars":
        restaurantpriceranges[i] = 3
    elif restaurantpriceranges[i] == "above 61 dollars":
        restaurantpriceranges[i] = 4

Create four separate arrays which contain the distance of each restaurant within a certain price range.


In [149]:
restaurantpricerange1 = []
restaurantpricerange2 = []
restaurantpricerange3 = []
restaurantpricerange4 = []
for i in range(len(restaurantpriceranges)):
    if restaurantpriceranges[i] == 1:
        restaurantpricerange1 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 2:
        restaurantpricerange2 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 3:
        restaurantpricerange3 += [restaurantdistances[i]]
    elif restaurantpriceranges[i] == 4:
        restaurantpricerange4 += [restaurantdistances[i]]

In [150]:
from IPython.display import display
df = pd.DataFrame({'Price Range':restaurantpriceranges, 'Distance from NYU':restaurantdistances}, columns = ['Price Range','Distance from NYU'])
display(df.head())


Price Range Distance from NYU
0 2 0.09
1 2 0.06
2 2 0.20
3 2 0.20
4 2 0.10

Correlation between restaurant prices and distances from NYU


In [151]:
ax = sns.swarmplot(x="Price Range", y="Distance from NYU", data=df)
fig_mpl = ax.get_figure()
iplot_mpl(fig_mpl)


Observe that the lowest and the highest price ranges (under 10 and above 61 dollars respectively) have the fewest restaurants near NYU. Most of the restaurants within the given radius have prices ranging between 11 and 30 dollars.

Note 1: plotly modified our price ranges of 1, 2, 3, 4 to 0, 1, 2, 3.

Note 2: We are using the word "dollars" instead of the $ sign because it is distorting the formatting of the text.

Another look at the correlation between restaurant prices and distances from NYU

In the following four graphs, we are creating separate graphs per price range.

Note: Given that plotly did not get the x-labels correct again, ignore the actual numbers and just keep in mind that the distances on the x-axis are increasing the further away we are from the origin.


In [152]:
trace = go.Histogram(
    x = restaurantpricerange1,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='blue',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of up to $10',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [153]:
trace = go.Histogram(
    x = restaurantpricerange3,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='green',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of $30-$60',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [154]:
trace = go.Histogram(
    x = restaurantpricerange2,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color='red',
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of between $11-30',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')



In [155]:
trace = go.Histogram(
    x = restaurantpricerange4,
    histnorm = 'count',
    name = 'control',
    autobinx = False,
    xbins = dict(
        start = 0,
        end = 1.4,
        size = 0.1),
    marker = dict(
        color="rgba(0,191,191,1.0)"
    ),
    opacity = 0.75
    )
data = [trace]
layout = go.Layout(
    title = 'Number of restaurants per distance bracket in price range of over $60',
    xaxis = dict(title = 'Distance'),
    yaxis = dict(title = 'Number of restaurants'),
    bargap = 0.01,
    bargroupgap = 0.1,
    )
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')


If we compare the bar graphs to the swarm plot, we see that they match the data distribution. The swarm plot is better for an overall comparison, while the individual bar graphs give a more magnified view for each price range.

Correlation between restaurant ratings and distances from NYU

Find all of the rating categories in our data and store them in an array.


In [156]:
restaurantrating_types = []
for rating in restaurantratings:
    if rating not in restaurantrating_types:
        restaurantrating_types += [rating]
print(restaurantrating_types)


['4.0 ', '4.5 ', '3.5 ', '5.0 ', '3.0 ']

In the cell above, we are looking for the types of restaurant ratings so that we can then separate them and display them in correlation to distance from NYU. It seems that within the first 500 results that we have scraped, Yelp's algorithm prioritises higher rated restaurants. Next, we look for the number of restaurants which each type of rating.


In [157]:
restaurantrating_numbers = [0]*len(restaurantrating_types)
for rating in range(len(restaurantratings)):
    for type in range(len(restaurantrating_types)):
        if restaurantratings[rating] == restaurantrating_types[type]:
            restaurantrating_numbers[type] += 1
print(restaurantrating_numbers)


[275, 86, 40, 4, 1]

In [158]:
fig = {
    "data": [
        {
            "values": restaurantrating_numbers,
            "labels": restaurantrating_types,
            "name": "Rating type",
            "hoverinfo":"label+percent+name",
            "type": "pie"
        }
    ],
    "layout": {
        "title": "Ratings separated by type",
    }
}
iplot(fig)


Intuitively, it makes sense that most restaurants received a 4-star rating. A restaurant would have to be truly exceptional to receive a full 5-star rating, which only 4 out of our 500 restaurants have received.

Next, we create arrays for the 3 most popular ratings (i.e. 3.5, 4.0, 4.5) storing the distances of the restaurants with the respective rating.


In [159]:
distances_35 = []
distances_40 = []
distances_45 = []
for i in range(len(restaurantratings)):
    if restaurantratings[i] == '3.5 ':
        distances_35 += [restaurantdistances[i]]
    elif restaurantratings[i] == '4.0 ':
        distances_40 += [restaurantdistances[i]]
    elif restaurantratings[i] == '4.5 ':
        distances_45 += [restaurantdistances[i]]
print(distances_35)
print(distances_40)
print(distances_45)


[0.3, 0.1, 0.2, 0.1, 0.06, 0.3, 0.2, 0.07, 0.3, 0.1, 0.2, 0.2, 0.4, 0.3, 0.3, 0.3, 0.07, 0.2, 0.2, 0.2, 0.3, 0.2, 0.4, 0.3, 0.2, 0.3, 0.2, 0.4, 0.1, 0.2, 0.06, 0.2, 0.3, 0.2, 0.5, 0.3, 0.6, 0.6, 0.2, 0.1]
[0.09, 0.06, 0.1, 0.2, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.1, 0.1, 0.1, 0.2, 0.3, 0.08, 0.2, 0.2, 0.2, 0.3, 0.1, 0.2, 0.09, 0.2, 0.3, 0.4, 0.3, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3, 0.3, 0.2, 0.2, 0.4, 0.2, 0.5, 0.4, 0.4, 0.1, 0.1, 0.3, 0.4, 0.4, 0.5, 0.5, 0.2, 0.4, 0.3, 0.2, 0.4, 0.2, 0.2, 0.2, 0.3, 0.3, 0.5, 0.4, 0.3, 0.4, 1.0, 0.4, 0.3, 0.2, 0.5, 0.3, 0.5, 0.2, 0.4, 0.3, 0.06, 0.6, 0.2, 0.4, 0.3, 0.3, 0.3, 0.2, 0.6, 0.3, 0.2, 0.6, 0.3, 0.4, 0.2, 0.4, 0.3, 0.5, 0.1, 0.4, 0.4, 0.4, 0.2, 0.3, 0.5, 0.6, 0.5, 0.1, 0.6, 0.3, 0.5, 0.2, 0.5, 0.5, 0.4, 0.6, 0.5, 0.5, 0.3, 0.5, 0.7, 0.4, 0.6, 0.5, 0.3, 0.3, 0.5, 0.4, 0.1, 0.4, 0.6, 0.5, 0.3, 0.7, 0.7, 0.4, 0.3, 0.4, 0.6, 0.4, 0.3, 0.4, 0.2, 0.4, 0.3, 0.2, 0.5, 0.4, 0.2, 0.5, 0.7, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2, 0.7, 0.6, 0.3, 0.3, 0.5, 0.3, 0.7, 0.4, 0.4, 0.2, 0.6, 0.5, 0.5, 0.5, 0.7, 0.5, 0.8, 0.8, 0.7, 0.5, 0.4, 0.5, 0.5, 0.4, 0.6, 0.4, 0.6, 0.5, 0.6, 0.7, 0.4, 0.2, 0.8, 0.7, 0.4, 0.4, 0.2, 0.3, 0.4, 0.2, 0.8, 0.7, 0.9, 0.4, 0.5, 0.5, 0.5, 0.5, 0.4, 0.3, 0.2, 0.5, 0.9, 0.4, 0.6, 0.9, 0.6, 0.2, 0.3, 0.3, 0.4, 0.5, 0.2, 0.3, 0.5, 0.5, 0.9, 0.6, 0.6, 0.5, 0.7, 0.5, 0.6, 0.8, 0.5, 0.4, 0.7, 0.4, 0.5, 0.3, 0.7, 0.4, 0.5, 1.0, 0.9, 0.7, 0.6, 0.7, 0.5, 0.2, 0.5, 0.5, 0.8, 0.5, 0.4, 0.7, 0.3, 0.5, 0.8, 0.5, 0.6, 0.9, 0.5, 0.7, 0.7, 0.4, 0.6, 0.5, 0.5, 0.2, 0.5, 0.5, 0.7, 0.2, 0.5, 0.9, 0.5, 0.3, 0.3, 0.9, 0.6, 0.4]
[0.2, 0.2, 0.2, 0.2, 0.3, 0.2, 0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.4, 0.08, 0.4, 0.5, 0.5, 0.4, 0.4, 0.3, 0.5, 0.5, 0.2, 0.4, 0.6, 1.2, 0.6, 0.6, 0.6, 0.6, 0.2, 0.7, 0.6, 0.8, 0.2, 0.4, 0.7, 0.6, 0.5, 0.7, 0.8, 0.4, 0.4, 0.3, 0.6, 0.3, 0.6, 0.4, 0.3, 0.4, 0.6, 0.7, 0.5, 0.4, 0.2, 0.7, 0.5, 0.3, 0.6, 0.9, 0.3, 0.2, 0.8, 0.8, 0.4, 0.2, 0.8, 0.5, 0.6, 0.9, 0.9, 0.3, 0.9, 1.0, 0.6, 0.7, 1.0, 0.5, 1.0, 0.4, 0.9, 0.2, 1.3, 0.6, 0.4, 0.8]

In [160]:
distances_35_types = []
for rating in distances_35:
    if rating not in distances_35_types:
        distances_35_types += [rating]
print(distances_35_types)


[0.3, 0.1, 0.2, 0.06, 0.07, 0.4, 0.5, 0.6]

In [161]:
distances_35_numbers = [0]*len(distances_35_types)
for distance in range(len(distances_35)):
    for type in range(len(distances_35_types)):
        if distances_35[distance] == distances_35_types[type]:
            distances_35_numbers[type] += 1
print(distances_35_numbers)


[11, 5, 14, 2, 2, 3, 1, 2]

In [162]:
distances_40_types = []
for rating in distances_40:
    if rating not in distances_40_types:
        distances_40_types += [rating]
print(distances_40_types)


[0.09, 0.06, 0.1, 0.2, 0.4, 0.3, 0.08, 0.5, 1.0, 0.6, 0.7, 0.8, 0.9]

In [163]:
distances_40_numbers = [0]*len(distances_40_types)
for distance in range(len(distances_40)):
    for type in range(len(distances_40_types)):
        if distances_40[distance] == distances_40_types[type]:
            distances_40_numbers[type] += 1
print(distances_40_numbers)


[2, 2, 13, 47, 49, 43, 1, 56, 2, 24, 21, 7, 8]

In [164]:
distances_45_types = []
for rating in distances_45:
    if rating not in distances_45_types:
        distances_45_types += [rating]
print(distances_45_types)


[0.2, 0.3, 0.1, 0.4, 0.08, 0.5, 0.6, 1.2, 0.7, 0.8, 0.9, 1.0, 1.3]

In [165]:
distances_45_numbers = [0]*len(distances_45_types)
for distance in range(len(distances_45)):
    for type in range(len(distances_45_types)):
        if distances_45[distance] == distances_45_types[type]:
            distances_45_numbers[type] += 1
print(distances_45_numbers)


[14, 10, 2, 14, 1, 9, 14, 1, 6, 6, 5, 3, 1]

In [166]:
fig = {
    "data": [
        {
            "values": distances_35_numbers,
            "labels": distances_35_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "hole": 0.4,
            "domain": {"x": [0, .38]},
            "type": "pie"
        },
        {
            "values": distances_45_numbers,
            "labels": distances_45_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "hole": 0.4,
            "domain": {"x": [.52, 0.9]},
            "type": "pie"
        }
    ],
    "layout": {
        "title": "Restaurants with different ratings separated by distance",
        "annotations": [
            {
                "text":"3.5 stars",
                "showarrow": False,
                "x": 0.16,
                "y": 0.5
            },
            {
                "text":"4.5 stars",
                "showarrow": False,
                "x": 0.74,
                "y": 0.5
            }
        ]
    }
}
iplot(fig)



In [167]:
fig = {
    "data": [
        {
            "values": distances_40_numbers,
            "labels": distances_40_types,
            "name": "Distance",
            "hoverinfo":"label+percent+name",
            "type": "pie",
            "hole": "0.4"
        }
    ],
    "layout": {
        "title": "Restaurants with 4.0 rating separated by distance",
        "annotations": [
            {
                "text":"4.0 stars",
                "showarrow": False,
            }]
    }
}
iplot(fig)


In displaying these donut charts, we chose to separate the 4.0 rated restaurants from the 3.5 and 4.5 rated restaurants due to both spatial issues in displaying the charts and because we thought we should give higher importance to the former (given that most restaurants were given 4 stars).

Number of restaurants per cuisine

Create an array containing all of the cuisine types in the data set.


In [168]:
restaurantcuisine_types = []
for cuisine in restaurantcuisines:
    if cuisine not in restaurantcuisine_types:
        restaurantcuisine_types += [cuisine]
print(restaurantcuisine_types)


['Cuban', 'Ramen', 'Breakfast & Brunch', 'French', 'Bars', 'Italian', 'Gastro Pubs', 'Japanese', 'Vietnamese', 'Belgian', 'Asian Fusion', 'American (Traditional)', 'Cajun/Creole', 'Steakhouses', 'American (New)', 'Korean', 'Mexican', 'Mediterranean', 'Middle Eastern', 'Vegan', 'Seafood', 'Lebanese', 'Takeaway & Fast Food', 'Coffee & Tea Shops', 'Vegetarian', 'Specialty Food', 'Indian', 'Halal', 'Malaysian', 'Gluten Free', 'Burgers', 'Tapas & Small Plates', 'Cocktail Bars', 'Chinese', 'Sandwiches', 'Noodles', 'Desserts', 'Greek', 'Wine Bars', 'Poke', 'Dim Sum', 'Pizza', 'Tapas Bars', 'Caribbean', 'Spanish', 'Thai', 'Sushi', 'Filipino', 'Crepes', 'Jazz & Blues', 'Lounges', 'Latin American', 'Ice Cream & Frozen Yoghurt', 'Cheese Shops', 'Izakaya', 'Salad', 'Cambodian', 'Bookshops', 'Moroccan', 'Delis', 'Comfort Food', 'Markets', 'Cafes', 'Southern', 'Fondue', 'Shanghainese', 'BBQ & Barbecue', 'Modern European', 'Bakeries', 'Tacos', 'Ukranian', 'Pubs']

Create an array containing number of restaurants of each cuisine type.


In [169]:
restaurantcuisine_numbers = [0]*len(restaurantcuisine_types)
for cuisine in range(len(restaurantcuisines)):
    for type in range(len(restaurantcuisine_types)):
        if restaurantcuisines[cuisine] == restaurantcuisine_types[type]:
            restaurantcuisine_numbers[type] += 1
print(restaurantcuisine_numbers)


[5, 8, 6, 18, 8, 56, 5, 30, 5, 1, 5, 15, 3, 1, 44, 5, 12, 11, 6, 5, 19, 1, 2, 5, 3, 2, 9, 1, 1, 1, 7, 4, 4, 8, 4, 2, 4, 4, 7, 1, 3, 12, 1, 2, 3, 6, 7, 1, 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Given that it was impossible to display such a high number of cuisine types in one graph, we decided to merge all of the cuisine types represented by three or less restaurants into the "Other" category, and then display them in a separate graph.


In [170]:
restaurantcuisine_numbers_2 = []
restaurantcuisine_types_2 = []
restaurantcuisine_numbers_other = []
restaurantcuisine_types_other = []
other = 0
for number in range(len(restaurantcuisine_numbers)) :
    if restaurantcuisine_numbers[number] > 3:
        restaurantcuisine_numbers_2 += [restaurantcuisine_numbers[number]]
        restaurantcuisine_types_2 += [restaurantcuisine_types[number]]
    else:
        restaurantcuisine_types_other += [restaurantcuisine_types[number]]
        restaurantcuisine_numbers_other += [restaurantcuisine_numbers[number]]
        other += restaurantcuisine_numbers[number]
restaurantcuisine_numbers_2 += [other]
restaurantcuisine_types_2 += ["Other ("+str(other)+")"]
print(restaurantcuisine_numbers_2)


[5, 8, 6, 18, 8, 56, 5, 30, 5, 5, 15, 44, 5, 12, 11, 6, 5, 19, 5, 9, 7, 4, 4, 8, 4, 4, 4, 7, 12, 6, 7, 4, 4, 54]

While working on the bar graph for the cuisines, we noticed that some of the strings representing the cuisines were very long and weren't displaying fully on the graph, and decided to manually abbreviate these names.


In [171]:
for cuisine in range(len(restaurantcuisine_types_2)):
    if restaurantcuisine_types_2[cuisine] == "Breakfast & Brunch":
        restaurantcuisine_types_2[cuisine] = "Breakfast"
    elif restaurantcuisine_types_2[cuisine] == "American (Traditional)":
        restaurantcuisine_types_2[cuisine] = "Amer. (Trad)"
    elif restaurantcuisine_types_2[cuisine] == "American (New)":
        restaurantcuisine_types_2[cuisine] = "Amer. (New)"
    elif restaurantcuisine_types_2[cuisine] == "Mediterranean":
        restaurantcuisine_types_2[cuisine] = "Mediter."
    elif restaurantcuisine_types_2[cuisine] == "Middle Eastern":
        restaurantcuisine_types_2[cuisine] = "M. Eastern"
    elif restaurantcuisine_types_2[cuisine] == "Tapas & Small Plates":
        restaurantcuisine_types_2[cuisine] = "Tapas"
    elif restaurantcuisine_types_2[cuisine] == "Latin American":
        restaurantcuisine_types_2[cuisine] = "Lat. Amer."
print(restaurantcuisine_types_2)


['Cuban', 'Ramen', 'Breakfast', 'French', 'Bars', 'Italian', 'Gastro Pubs', 'Japanese', 'Vietnamese', 'Asian Fusion', 'Amer. (Trad)', 'Amer. (New)', 'Korean', 'Mexican', 'Mediter.', 'M. Eastern', 'Vegan', 'Seafood', 'Coffee & Tea Shops', 'Indian', 'Burgers', 'Tapas', 'Cocktail Bars', 'Chinese', 'Sandwiches', 'Desserts', 'Greek', 'Wine Bars', 'Pizza', 'Thai', 'Sushi', 'Lat. Amer.', 'Delis', 'Other (54)']

In [172]:
for cuisine in range(len(restaurantcuisine_types_other)):
    if restaurantcuisine_types_other[cuisine] == "Takeaway & Fast Food":
        restaurantcuisine_types_other[cuisine] = "Fast Food"
    elif restaurantcuisine_types_other[cuisine] == "Specialty Food":
        restaurantcuisine_types_other[cuisine] = "Specialty"
    elif restaurantcuisine_types_other[cuisine] == "Ice Cream & Frozen Yoghurt":
        restaurantcuisine_types_other[cuisine] = "Ice Cream"
    elif restaurantcuisine_types_other[cuisine] == "BBQ & Barbecue":
        restaurantcuisine_types_other[cuisine] = "BBQ"
    elif restaurantcuisine_types_other[cuisine] == "Modern European":
        restaurantcuisine_types_other[cuisine] = "European"
    elif restaurantcuisine_types_other[cuisine] == "Comfort Food":
        restaurantcuisine_types_other[cuisine] = "Comfort"
print(restaurantcuisine_types_other)


['Belgian', 'Cajun/Creole', 'Steakhouses', 'Lebanese', 'Fast Food', 'Vegetarian', 'Specialty', 'Halal', 'Malaysian', 'Gluten Free', 'Noodles', 'Poke', 'Dim Sum', 'Tapas Bars', 'Caribbean', 'Spanish', 'Filipino', 'Crepes', 'Jazz & Blues', 'Lounges', 'Ice Cream', 'Cheese Shops', 'Izakaya', 'Salad', 'Cambodian', 'Bookshops', 'Moroccan', 'Comfort', 'Markets', 'Cafes', 'Southern', 'Fondue', 'Shanghainese', 'BBQ', 'European', 'Bakeries', 'Tacos', 'Ukranian', 'Pubs']

Graph with all cuisines (less than three restaurants merged into "Other")


In [173]:
import plotly.graph_objs as go
data = [go.Bar(
            y=restaurantcuisine_numbers_2,
            x=restaurantcuisine_types_2,
            orientation = 'v',
)]

iplot(data, filename='horizontal-bar')


Graph of "Other" cuisines


In [174]:
import plotly.graph_objs as go
data = [go.Bar(
            y=restaurantcuisine_numbers_other,
            x=restaurantcuisine_types_other,
            orientation = 'v'
)]

iplot(data, filename='horizontal-bar')


Ending Remarks

Scraping data

We originally tried using a csv file provided by Yelp for their Annual Dataset Challenge, but then realized that this dataset, although containing hundreds of thousands of businesses, actually contained only a handful of New York-based restaurants, as it was just a subset of their whole dataset. Therefore, we decided to scrape the data directly from their website and found that this was more enjoyable and efficient for us and the goals of this project.

Creating graphs

We did realize early on that we wouldn't be making any groundbreaking discoveries with our data analysis, but the purpose of this project for us was to familiarize ourselves and experiment with the techniques learned in-class. The primary purpose of our graphs is to display the scraped data in a more coherent way and process it visually, as well as try to make correlations.

Thank y(elp)ou!