Prepare the connection

  1. Apply an oauth key on yelp.com
  2. Store your oauth key in a file for safety
  3. Read the key with a credential method

In [60]:
import json  #for reading oauth info and save the results
import io  #for credential read 
from yelp.client import Client
from yelp.oauth1_authenticator import Oauth1Authenticator
from pprint import pprint  #to better understand the result format

with io.open('yelp_oauth.json') as cred:
    creds = json.load(cred)
    auth = Oauth1Authenticator(**creds)
    client = Client(auth)

Just to show the content of the authenticator:

auth = Oauth1Authenticator(
consumer_key = 'your consumer key'
consumer_secret = 'your consumer secret',
token = auth_info'your token',
token_secret = 'your token secret']
)

client = Client(auth)

Prepare the terms we'd like to search for

The example is Boston and Boston is too huge so we search the results based on zipcodes to narraow it.

I just copyed and pasted the zipcodes from a website and found it starts by 0 in Boston so I saved them as list of strings.


In [114]:
zipstr = '02108, 02109, 02110, 02111, 02113, 02114, 02115, 02116, 02118, 02119, 02120, 02121, 02122, 02124, 02125, 02126, 02127, 02128, 02129, 02130, 02131, 02132, 02134, 02135, 02136, 02151, 02152, 02163, 02199, 02203, 02210, 02215, 02467'
zips = zipstr.split(', ')

Then set up parameters in the search:

For the details of search parameters, go to Search API


In [123]:
params = {
    'lang': 'en',
    'sort': 0  #Sort mode: 0=Best matched (default), 1=Distance, 2=Highest Rated
    #'limit': 20   limit can be 1 to 20
    #'offset': 21  we will use this parameter later in the loop
}

Start to retrieve the data!

We begin with a test search for Boston and add the parameters we set up above:


In [126]:
response = client.search('Boston', **params)

Then see how many restaurants we get in the search:


In [127]:
print 'The numbers of restaurants in Boston on Yelp: {}'.format(response.total)


The numbers of restaurants in Boston on Yelp: 22476

Since there are 22476 restaurents in Boston on yelp, we can only get 20 restaurant a time, and 1000 in total of a search criteria. That's why we try to use zipcode to narrow down the scope to get all the results of Boston.

That's take a look on the data.

The responses we get are objects so we try to parse it to a readble format here with pprint and vars():

To see all the response values and their definitinos in business, go to Search API-Business


In [128]:
b = vars(response.businesses[0])
b['location'] = vars(response.businesses[0].location)
b['location']['coordinate'] = vars(response.businesses[0].location['coordinate'])
pprint(b)


{'categories': [Category(name=u'Coffee & Tea', alias=u'coffee')],
 'deals': None,
 'display_phone': u'+1-617-227-0786',
 'distance': None,
 'eat24_url': None,
 'gift_certificates': None,
 'id': u'polcaris-coffee-boston',
 'image_url': u'https://s3-media3.fl.yelpcdn.com/bphoto/6LSOrfE4nfNVp3omuOfWfw/ms.jpg',
 'is_claimed': True,
 'is_closed': False,
 'location': {'address': [u'105 Salem St'],
              'city': u'Boston',
              'coordinate': {'latitude': 42.36401, 'longitude': -71.0555},
              'country_code': u'US',
              'cross_streets': u'Bartlett Pl & Cooper St',
              'display_address': [u'105 Salem St',
                                  u'North End',
                                  u'Boston, MA 02113'],
              'geo_accuracy': 8.0,
              'neighborhoods': [u'North End'],
              'postal_code': u'02113',
              'state_code': u'MA'},
 'menu_date_updated': 1441950074,
 'menu_provider': u'single_platform',
 'mobile_url': u'http://m.yelp.com/biz/polcaris-coffee-boston?utm_campaign=yelp_api&utm_medium=api_v2_search&utm_source=KA1JpBymy1oww4cX0G_WLg',
 'name': u"Polcari's Coffee",
 'phone': u'6172270786',
 'rating': 5.0,
 'rating_img_url': u'https://s3-media1.fl.yelpcdn.com/assets/2/www/img/f1def11e4e79/ico/stars/v1/stars_5.png',
 'rating_img_url_large': u'https://s3-media3.fl.yelpcdn.com/assets/2/www/img/22affc4e6c38/ico/stars/v1/stars_large_5.png',
 'rating_img_url_small': u'https://s3-media1.fl.yelpcdn.com/assets/2/www/img/c7623205d5cd/ico/stars/v1/stars_small_5.png',
 'reservation_url': None,
 'review_count': 138,
 'reviews': None,
 'snippet_image_url': u'http://s3-media1.fl.yelpcdn.com/photo/RqVPcDf5owMMaCZ5_9Vb-Q/ms.jpg',
 'snippet_text': u'An absolutely phenomenal store. As an avid cook and tea-lover, I had been searching for a place to buy spices and teas. After reading all of the wonderful...',
 'url': u'http://www.yelp.com/biz/polcaris-coffee-boston?utm_campaign=yelp_api&utm_medium=api_v2_search&utm_source=KA1JpBymy1oww4cX0G_WLg'}

Start to crawl all the data we want

Since we can only fetch 20 results from yelp, we need to set up offset parameter to get the rest. Also the limit of search results we can fetch is 1000, so we set up a loop of 50 times and check if we need to break the loop when the results are less than 1000.


In [110]:
results = []

In [ ]:
for zipcode in zips:
    for i in range(50):
        n = i * 20 + 1
        params['offset'] = n
        response = client.search(zipcode, **params)
        bizs = response.businesses
        for biz in bizs:
            b = vars(biz)
            b['location'] = vars(biz.location)
            b['location']['coordinate'] = vars(biz.location['coordinate'])
            results.append(b)
        break  #Here I break when i == 1 for test. Please delete the 'break' when you start collection.
            
        if len(response.businesses) < 20:
            #Check if there are other results
            break
        else:
            continue

Save the data to file

Finally we save the data to a json file.


In [113]:
with open('my_boston_restaurants_yelp.json', 'wb') as f:
    results_json = json.dumps(results, indent=4, skipkeys=True, sort_keys=True)
    f.write(results_json)

It's done! Let's take some snacks and play with the data!