CSE 6040 Computing for Data Analysis, Fall 2015

Homework 1: Collecting Yelp Data Using Python

Due Sep 18, 2015 (Friday), 11:55 pm EDT

Submission details: Submit a single zipped file, using the file name “HW1-{YOUR_LAST_NAME}-{YOUR_FIRST_NAME}.zip”. The zip file should contain this Ipython Notebook and two folders, named “Part1” and “Part2”, corresponding to the results from the two parts of this homework.

If you have collaborated with other students, write down their names in the text box on the T-square submission site. Each student must write his/her own code and answers.

In this homework, we will first collect Yelp data through APIs (Part 1). Due to the limitation in these APIs, we will scrape a few pages on Yelp to collect more information that is unavailable through its APIs (Part 2).

Part 1 [30 points]: Collecting Yelp Data through Yelp API

For this part, you will use the Yelp API to find local restaurants information. The goal is to find 5 highly-rated restaurants in Atlanta with most reviews (largest numbers of reviews) on Yelp.

Preparation:

  1. Go to http://www.yelp.com/developers and create an account. You can use your existing Yelp account or create a new account by providing your name, email address, and zip code.
  2. Go to http://www.yelp.com/developers/manage_api_keys to generate your app key/secret and a token, by providing a website URL (can be anything, for example a dummy URL or the course page http://cse6040.gatech.edu/fa15/) and describing the purpose to use the APIs (e.g. for homework). Copy the “Consumer Key”, “Consumer Secret”, “Token”, and “Token Secret” for your own records, and we use them to fetch Yelp data in this homework.
  3. Go to http://www.yelp.com/developers/documentation, learn how to build the URLs in order to use Yelp Search API and Business API.
  4. In this homework, you will learn how to install a package in python. We will use the tool “pip”, which should already be installed with Anaconda. If not, Follow this link: https://pip.pypa.io/en/latest/installing.html, to install “pip”.
  5. For any package you want to install in python, type “pip install {package_name}” in the command prompt. For this part, you have to install the oauth2 package.

Instructions:

Below we provide a function yelp_req, which you use to make requests to the Yelp API, and the return of this function a JSON object or error messages, including the information returned from Yelp API.

For example, when url is 'http://api.yelp.com/v2/search?term=food&location=San+Francisco', yelp_req(url) will return a JSON object from the Search API.

To use the function, first put the values of CONSUMER_KEY, CONSUMER_SECRET, TOKEN, and TOKEN_SECRET, using what you get from step 2.


In [1]:
import urllib2
import json
import oauth2

# Please assign following values with the credentials found in your Yelp account 
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
TOKEN = ''
TOKEN_SECRET = ''


def yelp_req(url):
    """ Pass in a url that follows the format of Yelp API,
        and this function will return either a JSON object or error messages.
    """
    
    oauth_request = oauth2.Request('GET', url, {})
    oauth_request.update(
        {
            'oauth_nonce': oauth2.generate_nonce(),
            'oauth_timestamp': oauth2.generate_timestamp(),
            'oauth_token': TOKEN,
            'oauth_consumer_key': CONSUMER_KEY
        }
    )
    consumer = oauth2.Consumer(CONSUMER_KEY, CONSUMER_SECRET)
    token = oauth2.Token(TOKEN, TOKEN_SECRET)
    oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
    signed_url = oauth_request.to_url()

    conn = urllib2.urlopen(signed_url, None)
    try:
        response = json.loads(conn.read())
    finally:
        conn.close()

    return response

Problem 1.1

Your task is to find the 40 highest rated restaurants in Atlanta using the Search API. Use “restaurants” as the search term, and “Atlanta, GA” as the location parameter. After forming the URL, you need to feed it to the yelp_req function in the starter code to get the API response.

Hints:

  1. Look at the parameters limit, offset, and sort in the Yelp’s API documentation to build the URL;
  2. Percent-encode the parameters using urllib.urlencode
  3. Many additional parameters are needed to be appended to the URL you formed for authentication’s purpose. Try printing the full URL in the HTTP request for yourself and see what parameters are included.

a. (10 points) Save the body of the HTTP response (a JSON string) containing the restaurants ranked 1~20 in rating into a file “first20.json”.


In [ ]:

b. (10 points) Save the body of the HTTP response (a JSON string) containing the restaurants ranked 21~40 in rating into a file “next20.json”.


In [ ]:

Problem 1.2

(5 points) For each of the 40 highest rated restaurants you collected, get the number of reviews it has received. Create a text file named “40restaurants.txt”, and write in this file the restaurant names and the numbers of reviews, one line for each restaurant, higher ratings first, comma-delimited.

For example:

Aviva by Kameel,138
Purnima,43
......

In [ ]:

Problem 1.3

(5 points) From the 40 restaurants you collected, get the 5 restaurants with most reviews. Create a text file named ""40restaurants_top_review_count.txt"", and write in this file the 5 restaurant names with most reviews (in descending order of their numbers of reviews) as well as the number of reviews, one line for each restaurant, comma-delimited.

For example:

Antico Pizza,1622
Fox Bros. Bar-B-Q,1168
......

In [ ]:

Deliverables:

  1. Please remove your credential information (key/secret, tokens)
  2. In this part, 4 text files should be created (programmatically). Put all of them in the folder "Part1".

Part 2 [30 points]: Scraping Yelp Pages to Collect More Data

In the last part, we collected the 40 highest rated restaurants in Atlanta. What if we’d like to know more? In this part, we will collect the 100 highest rated restaurants in Atlanta by extracting information directly from the search result pages on Yelp.

Preparation:

  1. Install the BeautifulSoup package using the command “pip install beautifulsoup4”.
  2. Open the link: http://www.yelp.com/search?find_desc=restaurants&find_loc=Atlanta%2C+GA&sortby=rating&start=0

This is the first page of search results of Atlanta restaurants on Yelp, sorted in descending order of their ratings. Browse this page to get familiar with its structure and available information, and inspect the relevant elements that render the search results.

Problem 2.1

Get the 100 highest rated restaurants in Atlanta from the search results (page 1 ~ page 10). You can figure out the URLs for the 2~10 pages of search results from the buttons on the first page or from the pattern in the above URL.

Below, we show an example that reads a page into a string and calls the preprocess_yelp_page function to preprocess the string before proceeding to BeautifulSoup. Feel free to modify these code or write your own code, but do preprocess the page content for every web page you read. Otherwise, there might be issues when you try to find the HTML tag containing relevant information.

You can consider either downloading the 10 web pages once and saving them into files for debugging your BeautifulSoup code, or downloading the web pages and analyzing them on-the-fly.


In [2]:
import sys
import requests
from bs4 import BeautifulSoup

def preprocess_yelp_page(content):
    ''' Remove extra spaces between HTML tags. '''
    content = ''.join([line.strip() for line in content.split('\n')])
    return content

# Example code to illustrate the use of preprocess_yelp_page
url = 'http://www.yelp.com/search?find_desc=restaurants&find_loc=Atlanta%2C+GA&sortby=rating&start=0'
content = requests.get(url).text
content = preprocess_yelp_page(content) # Now *content* is a string containing the first page of search results, ready for processing with BeautifulSoup
soup = BeautifulSoup(content)

a. (10 points) Create a text file named “10restaurants.txt”, and write in this file the 10 restaurant names on the first result page, one line for each restaurant, in the original order in the search results (higher ratings first).

Note that a search result page may contain advertised results on top of the actual search result. Please do a sanity check that confirms the number of restaurants in the submission file is 10, and figure out how to identify the advertised results and remove them from the list of 10 highest rated restaurants.


In [ ]:

b. (5 points) Create a text file named “100restaurants.txt”, and write in this file the 100 restaurant names appearing on page 1,2,...,10, one line for each restaurant, in the original order in the search results (higher ratings first).


In [ ]:

Problem 2.2

(10 points) Get the restaurant names and their numbers of reviews for each of the 100 restaurants you collected in Problem 2.1.

Create a text file named “100restaurants_review_count.txt”, and write in this file the 100 restaurants names and their numbers of reviews, one line for each restaurant, comma-delimited, in the original order in the search results (higher ratings first).

For example:

Aviva by Kameel,143
Canoe,689
......

In [ ]:

Problem 2.3

(1 point) From the 100 restaurants you collected, get the 5 restaurants with most reviews. Create a text file named “100restaurants_top_review_count.txt”, and write in this file the 5 restaurants names as well as the number of reviews, one line for each restaurant, comma-delimited.

For example:

Antico Pizza,1636
Flip Burger Boutique,1203
...

In [ ]:

Problem 2.4

(4 points) Compare the two files “40restaurants.txt” and “100restaurants.txt”. Also compare the two files “40restaurants_top_review_count.txt” and “100restaurants_top_review_count.txt”. Any difference between the API results and the search results directly scraped on Yelp? Explain your findings (within 100 words) below.

Deliverables:

In this part, 4 text files should be created (programmatically). Put all of them in the folder "Part2".