Submission details: Submit a single zipped file, using the file name “HW1-{YOUR_LAST_NAME}-{YOUR_FIRST_NAME}.zip”. The zip file should contain this Ipython Notebook and two folders, named “Part1” and “Part2”, corresponding to the results from the two parts of this homework.
If you have collaborated with other students, write down their names in the text box on the T-square submission site. Each student must write his/her own code and answers.
In this homework, we will first collect Yelp data through APIs (Part 1). Due to the limitation in these APIs, we will scrape a few pages on Yelp to collect more information that is unavailable through its APIs (Part 2).
For this part, you will use the Yelp API to find local restaurants information. The goal is to find 5 highly-rated restaurants in Atlanta with most reviews (largest numbers of reviews) on Yelp.
Below we provide a function yelp_req, which you use to make requests to the Yelp API, and the return of this function a JSON object or error messages, including the information returned from Yelp API.
For example, when url is 'http://api.yelp.com/v2/search?term=food&location=San+Francisco', yelp_req(url) will return a JSON object from the Search API.
To use the function, first put the values of CONSUMER_KEY, CONSUMER_SECRET, TOKEN, and TOKEN_SECRET, using what you get from step 2.
In [1]:
import urllib2
import json
import oauth2
# Please assign following values with the credentials found in your Yelp account
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
TOKEN = ''
TOKEN_SECRET = ''
def yelp_req(url):
""" Pass in a url that follows the format of Yelp API,
and this function will return either a JSON object or error messages.
"""
oauth_request = oauth2.Request('GET', url, {})
oauth_request.update(
{
'oauth_nonce': oauth2.generate_nonce(),
'oauth_timestamp': oauth2.generate_timestamp(),
'oauth_token': TOKEN,
'oauth_consumer_key': CONSUMER_KEY
}
)
consumer = oauth2.Consumer(CONSUMER_KEY, CONSUMER_SECRET)
token = oauth2.Token(TOKEN, TOKEN_SECRET)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()
conn = urllib2.urlopen(signed_url, None)
try:
response = json.loads(conn.read())
finally:
conn.close()
return response
Your task is to find the 40 highest rated restaurants in Atlanta using the Search API. Use “restaurants” as the search term, and “Atlanta, GA” as the location parameter. After forming the URL, you need to feed it to the yelp_req function in the starter code to get the API response.
Hints:
a. (10 points) Save the body of the HTTP response (a JSON string) containing the restaurants ranked 1~20 in rating into a file “first20.json”.
In [ ]:
b. (10 points) Save the body of the HTTP response (a JSON string) containing the restaurants ranked 21~40 in rating into a file “next20.json”.
In [ ]:
(5 points) For each of the 40 highest rated restaurants you collected, get the number of reviews it has received. Create a text file named “40restaurants.txt”, and write in this file the restaurant names and the numbers of reviews, one line for each restaurant, higher ratings first, comma-delimited.
For example:
Aviva by Kameel,138
Purnima,43
......
In [ ]:
(5 points) From the 40 restaurants you collected, get the 5 restaurants with most reviews. Create a text file named ""40restaurants_top_review_count.txt"", and write in this file the 5 restaurant names with most reviews (in descending order of their numbers of reviews) as well as the number of reviews, one line for each restaurant, comma-delimited.
For example:
Antico Pizza,1622
Fox Bros. Bar-B-Q,1168
......
In [ ]:
In the last part, we collected the 40 highest rated restaurants in Atlanta. What if we’d like to know more? In this part, we will collect the 100 highest rated restaurants in Atlanta by extracting information directly from the search result pages on Yelp.
This is the first page of search results of Atlanta restaurants on Yelp, sorted in descending order of their ratings. Browse this page to get familiar with its structure and available information, and inspect the relevant elements that render the search results.
Get the 100 highest rated restaurants in Atlanta from the search results (page 1 ~ page 10). You can figure out the URLs for the 2~10 pages of search results from the buttons on the first page or from the pattern in the above URL.
Below, we show an example that reads a page into a string and calls the preprocess_yelp_page function to preprocess the string before proceeding to BeautifulSoup. Feel free to modify these code or write your own code, but do preprocess the page content for every web page you read. Otherwise, there might be issues when you try to find the HTML tag containing relevant information.
You can consider either downloading the 10 web pages once and saving them into files for debugging your BeautifulSoup code, or downloading the web pages and analyzing them on-the-fly.
In [2]:
import sys
import requests
from bs4 import BeautifulSoup
def preprocess_yelp_page(content):
''' Remove extra spaces between HTML tags. '''
content = ''.join([line.strip() for line in content.split('\n')])
return content
# Example code to illustrate the use of preprocess_yelp_page
url = 'http://www.yelp.com/search?find_desc=restaurants&find_loc=Atlanta%2C+GA&sortby=rating&start=0'
content = requests.get(url).text
content = preprocess_yelp_page(content) # Now *content* is a string containing the first page of search results, ready for processing with BeautifulSoup
soup = BeautifulSoup(content)
a. (10 points) Create a text file named “10restaurants.txt”, and write in this file the 10 restaurant names on the first result page, one line for each restaurant, in the original order in the search results (higher ratings first).
Note that a search result page may contain advertised results on top of the actual search result. Please do a sanity check that confirms the number of restaurants in the submission file is 10, and figure out how to identify the advertised results and remove them from the list of 10 highest rated restaurants.
In [ ]:
b. (5 points) Create a text file named “100restaurants.txt”, and write in this file the 100 restaurant names appearing on page 1,2,...,10, one line for each restaurant, in the original order in the search results (higher ratings first).
In [ ]:
(10 points) Get the restaurant names and their numbers of reviews for each of the 100 restaurants you collected in Problem 2.1.
Create a text file named “100restaurants_review_count.txt”, and write in this file the 100 restaurants names and their numbers of reviews, one line for each restaurant, comma-delimited, in the original order in the search results (higher ratings first).
For example:
Aviva by Kameel,143
Canoe,689
......
In [ ]:
(1 point) From the 100 restaurants you collected, get the 5 restaurants with most reviews. Create a text file named “100restaurants_top_review_count.txt”, and write in this file the 5 restaurants names as well as the number of reviews, one line for each restaurant, comma-delimited.
For example:
Antico Pizza,1636
Flip Burger Boutique,1203
...
In [ ]:
(4 points) Compare the two files “40restaurants.txt” and “100restaurants.txt”. Also compare the two files “40restaurants_top_review_count.txt” and “100restaurants_top_review_count.txt”. Any difference between the API results and the search results directly scraped on Yelp? Explain your findings (within 100 words) below.