We are analyzing restaurant prices, ratings, and cuisines as listed on Yelp, based on their distances from 70 Washington Square South. In order to do this, we are scraping data from Yelp NYC using the search terms of “Restaurants near New York University”, and analyzing the first 50 pages of results.
In [137]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import datetime as dt # date tools, used to note current date
import requests
from bs4 import BeautifulSoup
%matplotlib inline
In [138]:
total_count = 500
restaurantnames = []
restaurantratings = []
restaurantnumberreviews = []
restaurantpriceranges = []
restaurantcuisines = []
restaurantdistances = []
# Loop through 50 pages of results (10 results per page, total of 500 results)
for page in range (0, total_count, 10):
url = "https://www.yelp.co.uk/search?find_desc=Restaurants&start={}&find_near=new-york-university-new-york-18".format(page)
yelp = requests.get(url)
yelp_soup = BeautifulSoup(yelp.content, 'html.parser')
restaurants = yelp_soup.find_all('li',class_ = "regular-search-result")
for restaurant in restaurants:
# Extract all restaurant names
fullnametag = restaurant.find_all('a', class_ = 'biz-name js-analytics-click')
if fullnametag != []:
restaurantnames += [fullnametag[0].find_all('span')[0].get_text()]
else:
restaurantnames += ["NaN"]
# Extract all restaurant ratings
all_rating_tags = list(restaurant.find_all('div', class_="i-stars"))
if all_rating_tags != []:
rating = all_rating_tags[0]['title'][0:4]
restaurantratings += [rating]
else:
restaurantratings += ["NaN"]
# Extract number of reviews for all restaurants
all_review_tags = list(restaurant.find_all('span', class_="review-count"))
if all_review_tags != []:
restaurantnumberreviews += [all_review_tags[0].get_text().strip("\n ")]
else:
restaurantnumberreviews += ["NaN"]
# Extract all price ranges
all_price_range_tags = list(restaurant.find_all('span', class_="business-attribute price-range"))
if all_price_range_tags != []:
restaurantpriceranges += [all_price_range_tags[0].get_text()]
else:
restaurantpriceranges += ["NaN"]
# Extract all cuisines
all_cuisine_tags = list(restaurant.find_all('span', class_="category-str-list"))
if all_cuisine_tags != []:
for tag in all_cuisine_tags:
cuisine_type = list(tag.find_all('a'))
cuisine_type_name = cuisine_type[0].get_text()
restaurantcuisines += [cuisine_type_name]
else:
restaurantcuisines += ["NaN"]
# Extract all distances from NYU
distance = False
all_distance_tags = list(restaurant.find_all('small'))
for tag in all_distance_tags:
if ("Miles" in tag.get_text().strip("\n ")):
restaurantdistances += [tag.get_text().strip("\n ")]
distance = True
if distance == False:
restaurantdistances += ["NaN"]
In [140]:
for i in range(len(restaurantnames)):
if restaurantnames[i] == "NaN" or restaurantratings[i] == "NaN" or restaurantnumberreviews[i] == "NaN" or restaurantpriceranges[i] == "NaN" or restaurantcuisines[i] == "NaN" or restaurantdistances[i] == "NaN":
restaurantnames.pop(i)
restaurantratings.pop(i)
restaurantnumberreviews.pop(i)
restaurantpriceranges.pop(i)
restaurantcuisines.pop(i)
restaurantdistances.pop(i)
In [143]:
for number_reviews in range(len(restaurantnames)):
number_reviews_int = int(restaurantnumberreviews[number_reviews].strip(" reviews"))
if number_reviews_int < 50:
restaurantnames.pop(number_reviews)
restaurantratings.pop(number_reviews)
restaurantnumberreviews.pop(number_reviews)
restaurantpriceranges.pop(number_reviews)
restaurantcuisines.pop(number_reviews)
restaurantdistances.pop(number_reviews)
We had formatting issues with displaying the $ signs because if you use multiple dollar signs they are interpreted as some formatting method, so we are converting the symbol into words.
In [144]:
for i in range(len(restaurantpriceranges)):
if restaurantpriceranges[i] == "$":
restaurantpriceranges[i] = "under 10 dollars"
elif restaurantpriceranges[i] == "$$":
restaurantpriceranges[i] = "11-30 dollars"
elif restaurantpriceranges[i] == "$$$":
restaurantpriceranges[i] = "31-60 dollars"
elif restaurantpriceranges[i] == "$$$$":
restaurantpriceranges[i] = "above 61 dollars"
In [145]:
from IPython.display import display
df = pd.DataFrame({'Name': restaurantnames, 'Rating':restaurantratings, 'Number of Reviews':restaurantnumberreviews, 'Price Range':restaurantpriceranges, 'Cuisine':restaurantcuisines, 'Distance from NYU':restaurantdistances}, columns = ['Name', 'Rating', 'Number of Reviews', 'Price Range', 'Cuisine', 'Distance from NYU'])
display(df)
In [146]:
# plotly imports
from plotly.offline import iplot, iplot_mpl # plotting functions
import plotly.graph_objs as go # ditto
import plotly # just to print version and init notebook
import sys # system module
import numpy as np # foundation for Pandas
import seaborn.apionly as sns # fancy matplotlib graphics (no styling)
from pandas_datareader import wb, data as web # worldbank data
plotly.offline.init_notebook_mode(connected=True)
Convert distances and price ranges into floats/integers so we can use them when plotting
In [147]:
for i in range(len(restaurantdistances)):
restaurantdistances[i] = float(restaurantdistances[i].strip(' Miles'))
In [148]:
for i in range(len(restaurantpriceranges)):
if restaurantpriceranges[i] == "under 10 dollars":
restaurantpriceranges[i] = 1
elif restaurantpriceranges[i] == "11-30 dollars":
restaurantpriceranges[i] = 2
elif restaurantpriceranges[i] == "31-60 dollars":
restaurantpriceranges[i] = 3
elif restaurantpriceranges[i] == "above 61 dollars":
restaurantpriceranges[i] = 4
Create four separate arrays which contain the distance of each restaurant within a certain price range.
In [149]:
restaurantpricerange1 = []
restaurantpricerange2 = []
restaurantpricerange3 = []
restaurantpricerange4 = []
for i in range(len(restaurantpriceranges)):
if restaurantpriceranges[i] == 1:
restaurantpricerange1 += [restaurantdistances[i]]
elif restaurantpriceranges[i] == 2:
restaurantpricerange2 += [restaurantdistances[i]]
elif restaurantpriceranges[i] == 3:
restaurantpricerange3 += [restaurantdistances[i]]
elif restaurantpriceranges[i] == 4:
restaurantpricerange4 += [restaurantdistances[i]]
In [150]:
from IPython.display import display
df = pd.DataFrame({'Price Range':restaurantpriceranges, 'Distance from NYU':restaurantdistances}, columns = ['Price Range','Distance from NYU'])
display(df.head())
In [151]:
ax = sns.swarmplot(x="Price Range", y="Distance from NYU", data=df)
fig_mpl = ax.get_figure()
iplot_mpl(fig_mpl)
Observe that the lowest and the highest price ranges (under 10 and above 61 dollars respectively) have the fewest restaurants near NYU. Most of the restaurants within the given radius have prices ranging between 11 and 30 dollars.
Note 1: plotly modified our price ranges of 1, 2, 3, 4 to 0, 1, 2, 3.
Note 2: We are using the word "dollars" instead of the $ sign because it is distorting the formatting of the text.
In the following four graphs, we are creating separate graphs per price range.
Note: Given that plotly did not get the x-labels correct again, ignore the actual numbers and just keep in mind that the distances on the x-axis are increasing the further away we are from the origin.
In [152]:
trace = go.Histogram(
x = restaurantpricerange1,
histnorm = 'count',
name = 'control',
autobinx = False,
xbins = dict(
start = 0,
end = 1.4,
size = 0.1),
marker = dict(
color='blue',
),
opacity = 0.75
)
data = [trace]
layout = go.Layout(
title = 'Number of restaurants per distance bracket in price range of up to $10',
xaxis = dict(title = 'Distance'),
yaxis = dict(title = 'Number of restaurants'),
bargap = 0.01,
bargroupgap = 0.1
)
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')
In [153]:
trace = go.Histogram(
x = restaurantpricerange3,
histnorm = 'count',
name = 'control',
autobinx = False,
xbins = dict(
start = 0,
end = 1.4,
size = 0.1),
marker = dict(
color='green',
),
opacity = 0.75
)
data = [trace]
layout = go.Layout(
title = 'Number of restaurants per distance bracket in price range of $30-$60',
xaxis = dict(title = 'Distance'),
yaxis = dict(title = 'Number of restaurants'),
bargap = 0.01,
bargroupgap = 0.1
)
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')
In [154]:
trace = go.Histogram(
x = restaurantpricerange2,
histnorm = 'count',
name = 'control',
autobinx = False,
xbins = dict(
start = 0,
end = 1.4,
size = 0.1),
marker = dict(
color='red',
),
opacity = 0.75
)
data = [trace]
layout = go.Layout(
title = 'Number of restaurants per distance bracket in price range of between $11-30',
xaxis = dict(title = 'Distance'),
yaxis = dict(title = 'Number of restaurants'),
bargap = 0.01,
bargroupgap = 0.1
)
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')
In [155]:
trace = go.Histogram(
x = restaurantpricerange4,
histnorm = 'count',
name = 'control',
autobinx = False,
xbins = dict(
start = 0,
end = 1.4,
size = 0.1),
marker = dict(
color="rgba(0,191,191,1.0)"
),
opacity = 0.75
)
data = [trace]
layout = go.Layout(
title = 'Number of restaurants per distance bracket in price range of over $60',
xaxis = dict(title = 'Distance'),
yaxis = dict(title = 'Number of restaurants'),
bargap = 0.01,
bargroupgap = 0.1,
)
fig = go.Figure (data = data, layout = layout)
iplot(fig, filename = 'styled histogram')
If we compare the bar graphs to the swarm plot, we see that they match the data distribution. The swarm plot is better for an overall comparison, while the individual bar graphs give a more magnified view for each price range.
Find all of the rating categories in our data and store them in an array.
In [156]:
restaurantrating_types = []
for rating in restaurantratings:
if rating not in restaurantrating_types:
restaurantrating_types += [rating]
print(restaurantrating_types)
In the cell above, we are looking for the types of restaurant ratings so that we can then separate them and display them in correlation to distance from NYU. It seems that within the first 500 results that we have scraped, Yelp's algorithm prioritises higher rated restaurants. Next, we look for the number of restaurants which each type of rating.
In [157]:
restaurantrating_numbers = [0]*len(restaurantrating_types)
for rating in range(len(restaurantratings)):
for type in range(len(restaurantrating_types)):
if restaurantratings[rating] == restaurantrating_types[type]:
restaurantrating_numbers[type] += 1
print(restaurantrating_numbers)
In [158]:
fig = {
"data": [
{
"values": restaurantrating_numbers,
"labels": restaurantrating_types,
"name": "Rating type",
"hoverinfo":"label+percent+name",
"type": "pie"
}
],
"layout": {
"title": "Ratings separated by type",
}
}
iplot(fig)
Intuitively, it makes sense that most restaurants received a 4-star rating. A restaurant would have to be truly exceptional to receive a full 5-star rating, which only 4 out of our 500 restaurants have received.
Next, we create arrays for the 3 most popular ratings (i.e. 3.5, 4.0, 4.5) storing the distances of the restaurants with the respective rating.
In [159]:
distances_35 = []
distances_40 = []
distances_45 = []
for i in range(len(restaurantratings)):
if restaurantratings[i] == '3.5 ':
distances_35 += [restaurantdistances[i]]
elif restaurantratings[i] == '4.0 ':
distances_40 += [restaurantdistances[i]]
elif restaurantratings[i] == '4.5 ':
distances_45 += [restaurantdistances[i]]
print(distances_35)
print(distances_40)
print(distances_45)
In [160]:
distances_35_types = []
for rating in distances_35:
if rating not in distances_35_types:
distances_35_types += [rating]
print(distances_35_types)
In [161]:
distances_35_numbers = [0]*len(distances_35_types)
for distance in range(len(distances_35)):
for type in range(len(distances_35_types)):
if distances_35[distance] == distances_35_types[type]:
distances_35_numbers[type] += 1
print(distances_35_numbers)
In [162]:
distances_40_types = []
for rating in distances_40:
if rating not in distances_40_types:
distances_40_types += [rating]
print(distances_40_types)
In [163]:
distances_40_numbers = [0]*len(distances_40_types)
for distance in range(len(distances_40)):
for type in range(len(distances_40_types)):
if distances_40[distance] == distances_40_types[type]:
distances_40_numbers[type] += 1
print(distances_40_numbers)
In [164]:
distances_45_types = []
for rating in distances_45:
if rating not in distances_45_types:
distances_45_types += [rating]
print(distances_45_types)
In [165]:
distances_45_numbers = [0]*len(distances_45_types)
for distance in range(len(distances_45)):
for type in range(len(distances_45_types)):
if distances_45[distance] == distances_45_types[type]:
distances_45_numbers[type] += 1
print(distances_45_numbers)
In [166]:
fig = {
"data": [
{
"values": distances_35_numbers,
"labels": distances_35_types,
"name": "Distance",
"hoverinfo":"label+percent+name",
"hole": 0.4,
"domain": {"x": [0, .38]},
"type": "pie"
},
{
"values": distances_45_numbers,
"labels": distances_45_types,
"name": "Distance",
"hoverinfo":"label+percent+name",
"hole": 0.4,
"domain": {"x": [.52, 0.9]},
"type": "pie"
}
],
"layout": {
"title": "Restaurants with different ratings separated by distance",
"annotations": [
{
"text":"3.5 stars",
"showarrow": False,
"x": 0.16,
"y": 0.5
},
{
"text":"4.5 stars",
"showarrow": False,
"x": 0.74,
"y": 0.5
}
]
}
}
iplot(fig)
In [167]:
fig = {
"data": [
{
"values": distances_40_numbers,
"labels": distances_40_types,
"name": "Distance",
"hoverinfo":"label+percent+name",
"type": "pie",
"hole": "0.4"
}
],
"layout": {
"title": "Restaurants with 4.0 rating separated by distance",
"annotations": [
{
"text":"4.0 stars",
"showarrow": False,
}]
}
}
iplot(fig)
In displaying these donut charts, we chose to separate the 4.0 rated restaurants from the 3.5 and 4.5 rated restaurants due to both spatial issues in displaying the charts and because we thought we should give higher importance to the former (given that most restaurants were given 4 stars).
Create an array containing all of the cuisine types in the data set.
In [168]:
restaurantcuisine_types = []
for cuisine in restaurantcuisines:
if cuisine not in restaurantcuisine_types:
restaurantcuisine_types += [cuisine]
print(restaurantcuisine_types)
Create an array containing number of restaurants of each cuisine type.
In [169]:
restaurantcuisine_numbers = [0]*len(restaurantcuisine_types)
for cuisine in range(len(restaurantcuisines)):
for type in range(len(restaurantcuisine_types)):
if restaurantcuisines[cuisine] == restaurantcuisine_types[type]:
restaurantcuisine_numbers[type] += 1
print(restaurantcuisine_numbers)
Given that it was impossible to display such a high number of cuisine types in one graph, we decided to merge all of the cuisine types represented by three or less restaurants into the "Other" category, and then display them in a separate graph.
In [170]:
restaurantcuisine_numbers_2 = []
restaurantcuisine_types_2 = []
restaurantcuisine_numbers_other = []
restaurantcuisine_types_other = []
other = 0
for number in range(len(restaurantcuisine_numbers)) :
if restaurantcuisine_numbers[number] > 3:
restaurantcuisine_numbers_2 += [restaurantcuisine_numbers[number]]
restaurantcuisine_types_2 += [restaurantcuisine_types[number]]
else:
restaurantcuisine_types_other += [restaurantcuisine_types[number]]
restaurantcuisine_numbers_other += [restaurantcuisine_numbers[number]]
other += restaurantcuisine_numbers[number]
restaurantcuisine_numbers_2 += [other]
restaurantcuisine_types_2 += ["Other ("+str(other)+")"]
print(restaurantcuisine_numbers_2)
While working on the bar graph for the cuisines, we noticed that some of the strings representing the cuisines were very long and weren't displaying fully on the graph, and decided to manually abbreviate these names.
In [171]:
for cuisine in range(len(restaurantcuisine_types_2)):
if restaurantcuisine_types_2[cuisine] == "Breakfast & Brunch":
restaurantcuisine_types_2[cuisine] = "Breakfast"
elif restaurantcuisine_types_2[cuisine] == "American (Traditional)":
restaurantcuisine_types_2[cuisine] = "Amer. (Trad)"
elif restaurantcuisine_types_2[cuisine] == "American (New)":
restaurantcuisine_types_2[cuisine] = "Amer. (New)"
elif restaurantcuisine_types_2[cuisine] == "Mediterranean":
restaurantcuisine_types_2[cuisine] = "Mediter."
elif restaurantcuisine_types_2[cuisine] == "Middle Eastern":
restaurantcuisine_types_2[cuisine] = "M. Eastern"
elif restaurantcuisine_types_2[cuisine] == "Tapas & Small Plates":
restaurantcuisine_types_2[cuisine] = "Tapas"
elif restaurantcuisine_types_2[cuisine] == "Latin American":
restaurantcuisine_types_2[cuisine] = "Lat. Amer."
print(restaurantcuisine_types_2)
In [172]:
for cuisine in range(len(restaurantcuisine_types_other)):
if restaurantcuisine_types_other[cuisine] == "Takeaway & Fast Food":
restaurantcuisine_types_other[cuisine] = "Fast Food"
elif restaurantcuisine_types_other[cuisine] == "Specialty Food":
restaurantcuisine_types_other[cuisine] = "Specialty"
elif restaurantcuisine_types_other[cuisine] == "Ice Cream & Frozen Yoghurt":
restaurantcuisine_types_other[cuisine] = "Ice Cream"
elif restaurantcuisine_types_other[cuisine] == "BBQ & Barbecue":
restaurantcuisine_types_other[cuisine] = "BBQ"
elif restaurantcuisine_types_other[cuisine] == "Modern European":
restaurantcuisine_types_other[cuisine] = "European"
elif restaurantcuisine_types_other[cuisine] == "Comfort Food":
restaurantcuisine_types_other[cuisine] = "Comfort"
print(restaurantcuisine_types_other)
In [173]:
import plotly.graph_objs as go
data = [go.Bar(
y=restaurantcuisine_numbers_2,
x=restaurantcuisine_types_2,
orientation = 'v',
)]
iplot(data, filename='horizontal-bar')
In [174]:
import plotly.graph_objs as go
data = [go.Bar(
y=restaurantcuisine_numbers_other,
x=restaurantcuisine_types_other,
orientation = 'v'
)]
iplot(data, filename='horizontal-bar')
We originally tried using a csv file provided by Yelp for their Annual Dataset Challenge, but then realized that this dataset, although containing hundreds of thousands of businesses, actually contained only a handful of New York-based restaurants, as it was just a subset of their whole dataset. Therefore, we decided to scrape the data directly from their website and found that this was more enjoyable and efficient for us and the goals of this project.
We did realize early on that we wouldn't be making any groundbreaking discoveries with our data analysis, but the purpose of this project for us was to familiarize ourselves and experiment with the techniques learned in-class. The primary purpose of our graphs is to display the scraped data in a more coherent way and process it visually, as well as try to make correlations.