Sushi Bars and Buffets: Measuring local culture with the Yelp API

Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar

Sociologists are often interested in how lived experiences vary by social class. This includes topics like variation in the chances of going to prison and life expectancies as well as the kinds of music people like or what type of sports their kids do.

For one project I'm working on, a team of sociologists and pediatricians are looking at variation in the local food environment as a possible explanation for the link between socioeconomic status and youth obesity. In the relatively poor county we studied, images of fast food, especially the "chicken 'n biscuits" provider Bojangles, dominated the roadside.

As a first look at a broader range of food environments, I decided to use the Yelp API to collect data on the mix of restaurants across the country. In this case, I'm not using any of the user provided review data that Yelp is known for, but rather using Yelp as a categorized directory of eating establishments. Since Yelp provides a wide variety of restaurant categories, it is possible to get more information than just the presence of fast food chains, but also look at how prominent buffets, sushi bars, or pizzerias are in different parts of the country.

Most of the APIs of interest to social scientists weren't designed for our use. They are primarily for web or mobile app developers who want to include the content on their pages. So while I might use the MapQuest API to look at how often intra-city trips involve highways, the target audience is business owners trying to help people get to their store. Similarly, scores of researchers have used data from Twitter APIs to study politics, but it was developed so that you could put a custom Twitter widget on your home page.

The good news is that since these services want you to use their data, the APIs are often well documented, especially for languages like Python that are popular in Silicon Valley. The bad news is that APIs don't always make available the parts of the service, like historical data, that are of most interest to researchers. The worst news is research using APIs frequently violates the providers Terms of Service, so it can be an ethical grey zone.

When you sign up as a developer to use an API, you usually agree to only use the API to facilitate other people using the service (e.g. customer's finding their way to your store) and that you won't store the data. API providers usually enforce this through rate limiting, meaning you can only access the service so many times per minute or per day. For example, you can only search status updates 180 times every 15 minutes according to Twitter guidelines. Yelp limits you to 10,000 calls per day. If you go over your limit, you won't be able to access the service for a bit. You will also get in trouble if you redistribute the data, so don't plan on doing that.

I began setting up my program by importing seven libraries. Four of these are part of any standard Python installation. Two others (numpy and requests) are used frequently and included in most Python bundles, like the free anacoda bundle.

The one package that you might need to install is oauth2. The Yelp API requires that requests be authenticated using the OAuth protocol.



In [2]:

    
from __future__ import division
import json
import csv
from collections import Counter

import numpy as np
import requests

import oauth2

Two of the major reasons that web services require API authentication is so that they know who you are and so they can make sure that you don't go over their rate limits. Since you shouldn't be giving your password to random people on the internet, API authentication works a little bit differently. Like many other places, in order to use the Yelp API you have to sign up as developer. After telling them a little bit about what you plan to do--feel free to be honest; they aren't going to deny you access if you put "research on food cultures" as the purpose--you will get a Consumer Key, Consumer Secret, Token, and Token Secret. Copy and paste them somewhere special.

Using the Yelp API goes something like this. First, you tell Yelp who you are and what you want. Assuming you are authorized to have this information, they respond with a URL where you can retrieve the data. The coding for this in practice is a little bit complicated, so there are often single use tools for accessing APIs, like Tweepy for Twitter.

There's no module to install for the Yelp API, but Yelp does provide some sample Python code. I've slightly modified the code below to show a sample search for restaurants near Chapel Hill, NC, sorted by distance. You can find more options in the search documentation. The API's search options include things like location and type of business, and allows you to sort either by distance or popularity.



In [3]:

    
consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'

consumer = oauth2.Consumer(consumer_key, consumer_secret)

category_filter = 'restaurants'
location = 'Chapel Hill, NC'
options =  'category_filter=%s&location=%s&sort=1' % (category_filter, location)
url = 'http://api.yelp.com/v2/search?' + options

oauth_request = oauth2.Request('GET', url, {})
oauth_request.update({'oauth_nonce'      : oauth2.generate_nonce(),
                      'oauth_timestamp'  : oauth2.generate_timestamp(),
                      'oauth_token'       : token,
                      'oauth_consumer_key': consumer_key})
    
token = oauth2.Token(token, token_secret)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()

print signed_url









    



http://api.yelp.com/v2/search?sort=1&oauth_body_hash=2jmj7l5rSw0yVb%2FvlWAYkK%2FYBwk%3D&oauth_nonce=76998400&oauth_timestamp=1385485666&oauth_consumer_key=qDBPo9c_szHVrZwxzo-zDw&oauth_signature_method=HMAC-SHA1&category_filter=restaurants&oauth_token=jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF&location=Chapel+Hill%2C+NC&oauth_signature=ZaqLbf4VhvJ71cAxnoyFkKgKKD0%3D

The URL returned expires after a couple of seconds, so don't expect for the above link to work. The results are provided in the JSON file format, so I'm going to use the already imported requests module to download them.



In [26]:

    
resp = requests.get(url=signed_url)
chapel_hill_restaurants = resp.json()

chapel_hill_restaurants now stores a JSON object that can be treated quite similarly to a dictionary. Two of the entries are of interest, 'total' and 'businesses'. Total shows the maximum number of results that Yelp could return for that specific search. At most, Yelp will return 40 results, presumably to stop people from scraping their entire database.



In [37]:

    
print chapel_hill_restaurants['total']

The first page of results only returns a maximum of 20 results, so I would expect that is how items are in 'businesses'. To check what results were returned, I'll print out the rating and name of each of the restaurants, followed by the number of reviews.



In [51]:

    
for business in chapel_hill_restaurants['businesses']:
    print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])









    



4.0 - Buns (173)
3.5 - Cosmic Cantina (61)
3.5 - R&R Grill (48)
4.0 - Mei Asian (22)
4.0 - Artisan Pizza Kitchen (36)
4.0 - Sutton's Drug Store (50)
4.0 - Bandido's Mexican Cafe (40)
3.5 - Ye Olde Waffle Shoppe (74)
3.5 - 35 Chinese Restaurant (50)
4.5 - Linda's Bar and Grill (58)
3.5 - Time-Out Restaurant (87)
4.5 - TRU Deli & Wine (15)
4.0 - Carolina Crossroads Restaurant (37)
4.0 - Cholanad (103)
4.5 - Mediterranean Deli (228)
4.0 - Sandwhich (150)
3.5 - 411 West (130)
4.0 - Lantern (162)
4.0 - Elaine's On Franklin (77)
4.0 - Talullas (80)

Our graduate students recently sponsored drinks at TRU, one of the highest rated places. It was good, but I suspect as the number of reviews gets higher, their rating will drop a bit.

I found out that 'ratings' held the Yelp rating by looking at the documentation, but the information can also be found by looking at the keys of a specific entry. In this case, I'll take advantage of the fact that information about the last restaurant is still stored in business.



In [42]:

    
business.keys()









    Out[42]:





[u'is_claimed',
 u'rating',
 u'mobile_url',
 u'rating_img_url',
 u'review_count',
 u'name',
 u'rating_img_url_small',
 u'url',
 u'is_closed',
 u'id',
 u'phone',
 u'snippet_text',
 u'image_url',
 u'categories',
 u'display_phone',
 u'rating_img_url_large',
 u'menu_provider',
 u'menu_date_updated',
 u'snippet_image_url',
 u'location']

For this project, I'm only interested in the kind of food that restaurants serve, which is stored in 'categories'



In [43]:

    
business['categories']









    Out[43]:





[[u'Middle Eastern', u'mideastern'],
 [u'Turkish', u'turkish'],
 [u'Mediterranean', u'mediterranean']]

Somewhat inconveniently, this is a list of lists, but since I only really care about the first item, I can extract that with some list comprehension:



In [47]:

    
categories = [category[0] for category in business['categories']]
categories









    Out[47]:





[u'Middle Eastern', u'Turkish', u'Mediterranean']

In case you were wondering, the 'u'' in front of the restaurant categories, and all the other strings, means that the text is stored as unicode, which contains more characters than the vanilla ASCII set. Storing text as unicode can be quite useful, but can causes some difficulties when functions expect ASCII, like when you use the csv writer to output an à without properly encoding the document.

Since I want to gather restaurant information for lots of different areas, I put a slightly modified version of the code above in a new function that I call get_yelp_businesses.



In [48]:

    
def get_yelp_businesses(location, category_filter = 'restaurants'):
    # from https://github.com/Yelp/yelp-api/tree/master/v2/python
    consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
    consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
    token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
    token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'
    
    consumer = oauth2.Consumer(consumer_key, consumer_secret)
    
    url = 'http://api.yelp.com/v2/search?category_filter=%s&location=%s&sort=1' % (category_filter,location)
    
    oauth_request = oauth2.Request('GET', url, {})
    oauth_request.update({'oauth_nonce': oauth2.generate_nonce(),
                          'oauth_timestamp': oauth2.generate_timestamp(),
                          'oauth_token': token,
                          'oauth_consumer_key': consumer_key})
    
    token = oauth2.Token(token, token_secret)
    
    oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
    
    signed_url = oauth_request.to_url()
    resp = requests.get(url=signed_url)
    return resp.json()

The function expects that the first thing you input will be a location. Taking advantage of both oath2's ability to clean up the text so that it is functional when put in a URL (e.g., escape spaces) and Yelp's savvy ability to parse locations, the value for location can be fairly wide (e.g., "Chapel Hill" or "90210"). You can also add a category of business to search for from the list of acceptable values. If you don't provide a value, category_filter = 'restaurants' provides a default value of 'restaurants'. This function returns the JSON formatted results. Note that this doesn't have any mechanism for handling errors, which will need to happen elsewhere.



In [50]:

    
beverly_hills_restaurants = get_yelp_businesses('90210')
for business in beverly_hills_restaurants['businesses']:
    print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])









    



4.0 - Sprinklesmobile (35)
4.0 - The Lounge at L'Ermitage (16)
4.0 - The Polo Lounge (222)
4.0 - The Fountain Coffee Shop (40)
4.0 - Mastro's Penthouse (93)
4.5 - The Larder at Maple Drive (14)
4.5 - Harajuku Crepe (573)
4.0 - Nic's Beverly Hills (445)
4.0 - Il Tramezzino (256)
4.0 - Euro Caffe (52)
4.0 - Sayuri (60)
4.0 - Sfixio (108)
4.5 - Da Pasquale Restaurant (230)
4.5 - DOMA (42)
4.5 - Ferrarini (24)
4.0 - Walter's 2 Go (27)
4.5 - Mastro's Steakhouse (2303)
4.0 - La Dolce Vita (136)
4.0 - Scarpetta (515)
4.0 - Fleming's Prime Steakhouse & Wine Bar (157)

I hate to admit it, but it appears that Beverly Hills has nicer restaurants than Chapel Hill. Also, based on the number of reviews, more popular ones.

Like many sociologists, I have a comma separated text file sitting around of big counties (population 50,000+) sorted by educational attainment. It is possible to read this in as you would a normal text file, but python's csv module is better at parsing the data. While the object it produces is list-like, it isn't iterable, but you can convert it with list:



In [83]:

    
big_counties = csv.reader(open('county_college.csv', 'rU'))
big_counties = list(big_counties)
print big_counties[0]


#remove header row
big_counties = big_counties[1:]

#How many counties?
print len(big_counties)

#first county in the list
print big_counties[0]

#Last county in the list
print big_counties[-1]









    



['fips', 'name', 'state', 'college_degree', 'population_2012', 'education']
973
['06037', 'Los Angeles County', 'CA', '29.2', '1.0e+07', 'Middle']
['05035', 'Crittenden County', 'AR', '13.5', '50021', 'Low']

Since this is a quick and dirty analysis, I tossed the top row, which has the column labels, and just made a mental note of the order of the columns. A more serious analysis would probably use pandas, which has a better ability to import csv files directly for use in statistical analysis. The last column has a grouping variable that I had previously used, where the top quarter of counties in terms of % college graduates are coded as "High", the bottom quarter coded as "Low" and everything else coded as "Middle". The counties were already sorted by population size, but that doesn't really matter.

To get all the restaurant data, I loop through each county and store the Yelp results in a new dictionary. To make sure it's plugging along, I have it print out some information along the way.

Things might go wrong half way through, and I don't want to look up counties twice as that is slow, rude to Yelp, and could lead me to hitting my API limit. One solution is to check to see if you have the results stored locally first, and only call the API when the information is missing. So I make sure to only call the Yelp API when I don't already have the county in in my 'county_restaurants' dictionary.



In [85]:

    
county_restaurants = {}



In [86]:

    
for county in big_counties:
    full_name = '%s, %s' % (county[1],county[2])
    if full_name not in county_restaurants:
        county_restaurants[full_name] = get_yelp_businesses(full_name)
        print county_restaurants[full_name]['total'], full_name









    



3 Los Angeles County, CA
40 Cook County, IL
40 Harris County, TX
7 Maricopa County, AZ
23 San Diego County, CA
40 Orange County, CA
40 Miami, FL
40 Kings County, NY
40 Dallas County, TX
40 Queens County, NY
1 Riverside County, CA
0 San Bernardino County, CA
40 King County, WA
40 Clark County, NV
40 Tarrant County, TX
40 Santa Clara County, CA
40 Broward County, FL
40 Wayne County, MI
40 Bexar County, TX
40 New York County, NY
40 Alameda County, CA
40 Philadelphia County, PA
40 Middlesex County, MA
40 Suffolk County, NY
40 Sacramento County, CA
40 Bronx County, NY
40 Palm Beach County, FL
40 Nassau County, NY
40 Hillsborough County, FL
40 Cuyahoga County, OH
40 Allegheny County, PA
40 Oakland County, MI
40 Orange County, FL
40 Franklin County, OH
40 Hennepin County, MN
40 Fairfax County, VA
40 Travis County, TX
40 Contra Costa County, CA
40 Salt Lake County, UT
40 Montgomery County, MD
40 St. Louis County, MO
0 Pima County, AZ
40 Fulton County, GA
40 Honolulu County, HI
40 Mecklenburg County, NC
40 Westchester County, NY
40 Milwaukee County, WI
40 Wake County, NC
0 Fresno County, CA
40 Shelby County, TN
40 Fairfield County, CT
40 DuPage County, IL
40 Pinellas County, FL
40 Erie County, NY
40 Marion County, IN
40 Bergen County, NJ
40 Hartford County, CT
40 Prince George's County, MD
40 Duval County, FL
40 New Haven County, CT
1 Kern County, CA
40 Macomb County, MI
40 Gwinnett County, GA
40 Ventura County, CA
13 Collin County, TX
40 El Paso County, TX
40 San Francisco County, CA
40 Middlesex County, NJ
40 Baltimore County, MD
15 Pierce County, WA
40 Montgomery County, PA
0 Hidalgo County, TX
40 Worcester County, MA
40 Hamilton County, OH
40 Essex County, NJ
40 Multnomah County, OR
40 Essex County, MA
40 Jefferson County, KY
40 Monroe County, NY
40 Suffolk County, MA
14 Oklahoma County, OK
40 San Mateo County, CA
0 Snohomish County, WA
40 Cobb County, GA
40 Denton County, TX
40 DeKalb County, GA
40 San Joaquin County, CA
40 Lake County, IL
40 Will County, IL
40 Norfolk County, MA
40 Jackson County, MO
40 Bernalillo County, NM
40 Jefferson County, AL
40 Hudson County, NJ
40 Davidson County, TN
40 Lee County, FL
0 El Paso County, CO
40 Denver County, CO
40 District of Columbia, DC
40 Monmouth County, NJ
40 Providence County, RI
40 Fort Bend County, TX
40 Bucks County, PA
40 Baltimore city, MD
16 Polk County, FL
40 Kent County, MI
40 Tulsa County, OK
0 Arapahoe County, CO
40 Ocean County, NJ
40 Delaware County, PA
40 Johnson County, KS
40 Bristol County, MA
40 Anne Arundel County, MD
40 Washington County, OR
40 Brevard County, FL
40 New Castle County, DE
40 Jefferson County, CO
40 Union County, NJ
40 Summit County, OH
40 Utah County, UT
18 Montgomery County, OH
40 Douglas County, NE
40 Lancaster County, PA
40 Kane County, IL
24 Stanislaus County, CA
40 Ramsey County, MN
40 Camden County, NJ
40 Chester County, PA
40 Sedgwick County, KS
40 Dane County, WI
40 Passaic County, NJ
40 Guilford County, NC
40 Plymouth County, MA
40 Morris County, NJ
36 Volusia County, FL
40 Lake County, IN
40 Sonoma County, CA
40 Montgomery County, TX
40 Spokane County, WA
40 Richmond County, NY
40 Pasco County, FL
40 Greenville County, SC
40 Onondaga County, NY
40 Hampden County, MA
0 Adams County, CO
20 Williamson County, TX
6 Tulare County, CA
19 Burlington County, NJ
40 Virginia Beach city, VA
40 East Baton Rouge Parish, LA
40 Polk County, IA
40 Knox County, TN
40 Clark County, WA
25 Lucas County, OH
40 York County, PA
34 Jefferson Parish, LA
40 Santa Barbara County, CA
40 Seminole County, FL
40 Prince William County, VA
0 Washoe County, NV
1 Monterey County, CA
40 Solano County, CA
40 Genesee County, MI
15 Cameron County, TX
40 Mobile County, AL
40 Berks County, PA
11 Ada County, ID
40 Dakota County, MN
40 Hillsborough County, NH
40 Richland County, SC
40 Waukesha County, WI
40 Pulaski County, AR
0 Pinal County, AZ
2 Sarasota County, FL
1 Clackamas County, OR
23 Stark County, OH
40 Orange County, NY
40 Butler County, OH
40 Orleans Parish, LA
40 St. Charles County, MO
40 Mercer County, NJ
40 Charleston County, SC
40 Westmoreland County, PA
6 Placer County, CA
40 Allen County, IN
40 Forsyth County, NC
40 Lehigh County, PA
0 Lane County, OR
40 Washtenaw County, MI
40 Nueces County, TX
40 Hamilton County, TN
40 Madison County, AL
40 Loudoun County, VA
6 Anoka County, MN
40 Marion County, FL
40 Manatee County, FL
0 Collier County, FL
40 Somerset County, NJ
25 Brazoria County, TX
15 Cumberland County, NC
40 Chesterfield County, VA
40 Bell County, TX
36 Luzerne County, PA
5 Marion County, OR
40 St. Louis city, MO
40 Rockland County, NY
0 Davis County, UT
40 Henrico County, VA
1 Larimer County, CO
40 McHenry County, IL
40 Fayette County, KY
22 Albany County, NY
2 Boulder County, CO
13 Lake County, FL
9 Escambia County, FL
40 Lorain County, OH
40 Galveston County, TX
40 Howard County, MD
40 Northampton County, PA
4 Douglas County, CO
40 Rockingham County, NH
23 Dutchess County, NY
40 Lancaster County, NE
40 Winnebago County, IL
40 Gloucester County, NJ
40 Hamilton County, IN
40 Spartanburg County, SC
2 Osceola County, FL
29 Lubbock County, TX
30 Cumberland County, ME
0 St. Lucie County, FL
40 Leon County, FL
0 Horry County, SC
40 Ingham County, MI
40 Erie County, PA
12 Greene County, MO
40 Durham County, NC
40 Chatham County, GA
33 Atlantic County, NJ
1 San Luis Obispo County, CA
40 Rutherford County, TN
40 New London County, CT
40 Lexington County, SC
40 Dauphin County, PA
8 Ottawa County, MI
19 St. Clair County, IL
5 Madison County, IL
40 Santa Cruz County, CA
40 St. Joseph County, IN
40 Clayton County, GA
40 Cleveland County, OK
0 Weld County, CO
0 Merced County, CA
0 Webb County, TX
13 Thurston County, WA
40 Caddo Parish, LA
40 Marin County, CA
40 Kitsap County, WA
40 Kalamazoo County, MI
40 Brown County, WI
1 Jefferson County, TX
40 Alachua County, FL
40 Hinds County, MS
40 Harford County, MD
3 Yakima County, WA
40 Norfolk city, VA
40 Buncombe County, NC
40 Washington County, MN
40 Frederick County, MD
1 St. Tammany Parish, LA
40 McLennan County, TX
40 Cumberland County, PA
40 Weber County, UT
40 Mahoning County, OH
14 York County, SC
40 Oneida County, NY
9 Benton County, AR
1 Montgomery County, AL
40 Lake County, OH
40 Chesapeake city, VA
40 Clay County, MO
40 Lafayette Parish, LA
38 Saratoga County, NY
0 Butte County, CA
40 Cherokee County, GA
40 Arlington County, VA
9 Jefferson County, MO
40 Warren County, OH
40 Barnstable County, MA
40 Linn County, IA
15 Niagara County, NY
40 Smith County, TX
40 Lackawanna County, PA
1 Dona Ana County, NM
40 Yavapai County, AZ
3 Washington County, AR
40 Richmond city, VA
40 New Hanover County, NC
40 Henry County, GA
40 Washington County, PA
40 Union County, NC
40 Gaston County, NC
40 Trumbull County, OH
40 Jackson County, OR
5 Whatcom County, WA
0 Yolo County, CA
17 Sussex County, DE
1 Mohave County, AZ
40 Champaign County, IL
18 Richmond County, GA
40 St. Johns County, FL
26 Shelby County, AL
40 Brazos County, TX
3 St. Louis County, MN
2 Yuma County, AZ
40 Elkhart County, IN
40 Sangamon County, IL
0 Clermont County, OH
29 York County, ME
5 Tuscaloosa County, AL
40 Muscogee County, GA
40 Saginaw County, MI
40 Broome County, NY
17 Racine County, WI
40 Calcasieu Parish, LA
32 Clay County, FL
40 Harrison County, MS
40 Canyon County, ID
40 Williamson County, TN
4 Kanawha County, WV
40 Baldwin County, AL
1 Okaloosa County, FL
27 Berkeley County, SC
40 Anderson County, SC
0 Hawaii County, HI
40 Forsyth County, GA
31 Litchfield County, CT
22 Peoria County, IL
40 Hall County, GA
40 Butler County, PA
40 Cabarrus County, NC
40 Montgomery County, TN
4 Onslow County, NC
40 Livingston County, MI
6 Benton County, WA
5 Ulster County, NY
40 Delaware County, OH
40 Vanderburgh County, IN
40 Newport News city, VA
4 El Dorado County, CA
40 Shawnee County, KS
2 Outagamie County, WI
2 Shasta County, CA
40 Tippecanoe County, IN
24 Imperial County, CA
2 Minnehaha County, SD
40 Johnston County, NC
40 Medina County, OH
40 Hernando County, FL
40 Pitt County, NC
40 McLean County, IL
40 Bay County, FL
40 Beaver County, PA
20 Muskegon County, MI
18 Hays County, TX
32 Scott County, IA
40 Monroe County, PA
3 Winnebago County, WI
29 Boone County, MO
40 Beaufort County, SC
32 Kenosha County, WI
40 Kent County, DE
33 Licking County, OH
40 Carroll County, MD
40 DeSoto County, MS
10 Sumner County, TN
13 Sarpy County, NE
40 Porter County, IN
35 Middlesex County, CT
9 Kent County, RI
36 Greene County, OH
40 Davidson County, NC
8 Aiken County, SC
40 Iredell County, NC
9 Charlotte County, FL
0 Deschutes County, OR
40 Kenton County, KY
40 Portage County, OH
8 Pueblo County, CO
6 St. Clair County, MI
40 Rock County, WI
40 Jackson County, MI
40 Rensselaer County, NY
40 Hampshire County, MA
40 Wyandotte County, KS
1 Santa Rosa County, FL
7 Chittenden County, VT
31 Maui County, HI
9 Cumberland County, NJ
18 Sullivan County, TN
40 Bibb County, GA
0 Cass County, ND
0 Berrien County, MI
40 Ouachita Parish, LA
40 Centre County, PA
40 Schenectady County, NY
40 Catawba County, NC
9 Ellis County, TX
40 Alamance County, NC
5 Penobscot County, ME
40 Johnson County, TX
39 Madera County, CA
40 Yellowstone County, MT
3 Stearns County, MN
40 Tolland County, CT
0 Kings County, CA
40 Franklin County, PA
40 Monroe County, MI
13 Charles County, MD
23 Hendricks County, IN
40 Washington County, MD
2 Martin County, FL
0 Mesa County, CO
16 Fairfield County, OH
40 Rock Island County, IL
40 Sussex County, NJ
40 Lee County, AL
40 Olmsted County, MN
40 Schuylkill County, PA
18 Merrimack County, NH
0 Midland County, TX
0 Santa Fe County, NM
40 Alexandria city, VA
12 Houston County, GA
20 Rankin County, MS
1 Washington County, UT
21 Paulding County, GA
4 Ector County, TX
40 Johnson County, IN
9 Dorchester County, SC
40 Randolph County, NC
1 Kootenai County, ID
23 Cambria County, PA
40 Monroe County, IN
39 Indian River County, FL
40 Jackson County, MS
40 Guadalupe County, TX
40 Citrus County, FL
4 Napa County, CA
40 Rowan County, NC
3 Florence County, SC
40 Orange County, NC
12 Clark County, OH
40 Hampton city, VA
40 Johnson County, IA
0 Coconino County, AZ
30 Tazewell County, IL
40 Fayette County, PA
0 Sandoval County, NM
35 Robeson County, NC
40 Lebanon County, PA
12 Scott County, MN
19 Calhoun County, MI
0 Humboldt County, CA
0 Marathon County, WI
40 Stafford County, VA
40 Douglas County, GA
19 Chautauqua County, NY
8 Taylor County, TX
40 Washington County, WI
1 Rapides Parish, LA
0 Cochise County, AZ
8 Livingston Parish, LA
40 Black Hawk County, IA
21 Columbia County, GA
7 Wichita County, TX
40 Coweta County, GA
40 Madison County, IN
40 Berkshire County, MA
0 San Juan County, NM
40 Wood County, OH
0 Wright County, MN
16 Sebastian County, AR
40 Blair County, PA
40 Hunterdon County, NJ
40 Comanche County, OK
23 Washington County, RI
40 Spotsylvania County, VA
21 Washington County, TN
0 Randall County, TX
40 Wayne County, NC
22 Blount County, TN
40 Strafford County, NH
4 Tangipahoa Parish, LA
40 Boone County, KY
40 Richland County, OH
40 Gregg County, TX
20 Canadian County, OK
0 Potter County, TX
3 Bossier Parish, LA
17 Harnett County, NC
40 Grayson County, TX
0 Kennebec County, ME
11 Oswego County, NY
23 Morgan County, AL
40 Clarke County, GA
7 Jefferson County, NY
40 Parker County, TX
40 Pickens County, SC
40 Wilson County, TN
40 Faulkner County, AR
0 Linn County, OR
0 Skagit County, WA
3 Kendall County, IL
12 Windham County, CT
40 Delaware County, IN
32 Calhoun County, AL
40 Lycoming County, PA
40 Warren County, KY
40 La Crosse County, WI
15 Mercer County, PA
40 Cache County, UT
22 Jasper County, MO
21 Sheboygan County, WI
40 Wayne County, OH
40 Lowndes County, GA
27 Comal County, TX
40 Tom Green County, TX
40 Kankakee County, IL
7 LaSalle County, IL
40 Douglas County, KS
40 Ascension Parish, LA
2 Brunswick County, NC
0 St. Lawrence County, NY
9 Allegan County, MI
17 Clark County, IN
0 Terrebonne Parish, LA
40 Saline County, AR
40 Carroll County, GA
6 LaPorte County, IN
0 Missoula County, MT
40 Macon County, IL
5 St. Mary's County, MD
40 Ontario County, NY
40 Vigo County, IN
40 Henderson County, NC
40 Sumter County, SC
23 Eaton County, MI
40 Warren County, NJ
21 Androscoggin County, ME
40 Fayette County, GA
1 Douglas County, OR
40 Berkeley County, WV
0 Navajo County, AZ
0 Hardin County, KY
8 Bay County, MI
40 Kaufman County, TX
0 Bonneville County, ID
9 Columbiana County, OH
40 Allen County, OH
40 Craven County, NC
13 DeKalb County, IL
40 Etowah County, AL
0 Pennington County, SD
30 Houston County, AL
40 Whitfield County, GA
40 Miami County, OH
40 Walworth County, WI
40 Tompkins County, NY
0 Woodbury County, IA
40 Albemarle County, VA
0 Cowlitz County, WA
3 Fond du Lac County, WI
40 Cecil County, MD
16 Sumter County, FL
40 Newton County, GA
40 Adams County, PA
0 Franklin County, MO
40 Bradley County, TN
1 Eau Claire County, WI
40 Hanover County, VA
24 Bartow County, GA
40 Wicomico County, MD
30 Ashtabula County, OH
22 Cass County, MO
40 Monongalia County, WV
40 Yamhill County, OR
40 Craighead County, AR
40 Putnam County, NY
15 Steuben County, NY
40 Lenawee County, MI
40 Madison County, TN
15 Madison County, MS
26 Flagler County, FL
17 Nevada County, CA
20 Highlands County, FL
40 Daviess County, KY
25 Cleveland County, NC
40 Roanoke city, VA
8 Lafourche Parish, LA
40 Cabell County, WV
31 Garland County, AR
40 Portsmouth city, VA
40 Cape May County, NJ
40 Floyd County, GA
14 Nash County, NC
40 Montgomery County, VA
13 Dubuque County, IA
6 Sutter County, CA
27 Marshall County, AL
40 Dougherty County, GA
2 Laramie County, WY
9 Northumberland County, PA
15 Carver County, MN
16 Geauga County, OH
11 Bowie County, TX
16 Wayne County, NY
0 Pottawattamie County, IA
40 Roanoke County, VA
24 Rockingham County, NC
6 Gallatin County, MT
40 Lauderdale County, AL
40 Sevier County, TN
12 Tuscarawas County, OH
40 Platte County, MO
33 Grant County, WA
18 Flathead County, MT
2 Orangeburg County, SC
4 Story County, IA
10 Campbell County, KY
40 Burke County, NC
5 Moore County, NC
40 Lawrence County, PA
5 Buchanan County, MO
36 Calvert County, MD
2 Sherburne County, MN
40 Victoria County, TX
1 Grafton County, NH
40 Grand Traverse County, MI
40 Chemung County, NY
11 Dodge County, WI
27 Rogers County, OK
40 Indiana County, PA
19 Lapeer County, MI
38 Limestone County, AL
16 Crawford County, PA
40 Angelina County, TX
1 Mendocino County, CA
22 Hunt County, TX
40 Ozaukee County, WI
40 Wood County, WV
7 Benton County, OR
40 Muskingum County, OH
0 Franklin County, WA
40 Rockdale County, GA
0 Burleigh County, ND
2 St. Croix County, WI
17 St. Clair County, AL
40 Suffolk city, VA
40 Lee County, MS
40 Madison County, KY
29 Walton County, GA
10 Jefferson County, WI
7 Midland County, MI
3 Bannock County, ID
27 St. Landry Parish, LA
40 Rockwall County, TX
28 Orange County, TX
2 Josephine County, OR
40 Howard County, IN
40 Newport County, RI
30 Maury County, TN
40 Caldwell County, NC
40 Wilson County, NC
12 Talladega County, AL
1 Cascade County, MT
1 Clinton County, NY
7 Clearfield County, PA
24 Glynn County, GA
40 Vermilion County, IL
17 Manitowoc County, WI
7 Elmore County, AL
39 Cullman County, AL
40 Frederick County, VA
40 Lauderdale County, MS
30 Christian County, MO
2 Cayuga County, NY
30 Cattaraugus County, NY
32 Lincoln County, NC
7 Island County, WA
40 Bartholomew County, IN
29 Henderson County, TX
33 Lancaster County, SC
40 Raleigh County, WV
0 Natrona County, WY
1 Twin Falls County, ID
40 White County, AR
3 Scioto County, OH
40 Payne County, OK
9 Leavenworth County, KS
40 Kosciusko County, IN
2 Ross County, OH
7 Rockingham County, VA
0 Coryell County, TX
40 Lynchburg city, VA
1 Somerset County, PA
40 Cape Girardeau County, MO
17 Forrest County, MS
40 Cheshire County, NH
0 Umatilla County, OR
33 Sullivan County, NY
0 Valencia County, NM
1 Liberty County, TX
14 Erie County, OH
40 Cole County, MO
2 Polk County, OR
17 Clinton County, MI
27 Bullitt County, KY
40 Hancock County, OH
0 Lewis County, WA
2 Riley County, KS
10 Van Buren County, MI
40 Christian County, KY
40 Anderson County, TN
40 Floyd County, IN
0 Wagoner County, OK
40 Monroe County, FL
40 Bastrop County, TX
18 Jefferson County, AR
11 Nassau County, FL
40 Oconee County, SC
0 Wood County, WI
2 Allegany County, MD
3 Iberia Parish, LA
0 Chelan County, WA
0 Augusta County, VA
13 Surry County, NC
2 Putnam County, FL
40 Putnam County, TN
0 Apache County, AZ
0 McKinley County, NM
2 Yuba County, CA
40 Bulloch County, GA
23 Madison County, NY
4 Dallas County, IA
0 Clallam County, WA
21 Grays Harbor County, WA
40 Franklin County, MA
31 DeKalb County, AL
40 Hancock County, IN
0 Aroostook County, ME
33 Pottawatomie County, OK
14 Creek County, OK
40 Isabella County, MI
40 Muskogee County, OK
40 Portage County, WI
40 Barrow County, GA
10 Lonoke County, AR
40 Greenwood County, SC
30 Belmont County, OH
14 Bedford County, VA
1 Morgan County, IN
26 Grant County, IN
31 Wilkes County, NC
30 Shiawassee County, MI
40 Harrison County, WV
40 James City County, VA
27 Greene County, TN
4 Jones County, MS
40 Troup County, GA
40 Kauai County, HI
17 Armstrong County, PA
40 Walker County, TX
21 Jefferson County, OH
40 Wayne County, IN
32 Darlington County, SC
5 Walker County, GA
40 Columbia County, FL
0 Marquette County, MI
0 Carteret County, NC
0 Grand Forks County, ND
40 Harrison County, TX
20 Rutherford County, NC
3 Adams County, IL
24 Robertson County, TN
40 Columbia County, PA
28 Williamson County, IL
40 Fauquier County, VA
10 Lea County, NM
40 Marion County, OH
3 Laurens County, SC
30 Walker County, AL
23 York County, VA
0 Otero County, NM
40 Nacogdoches County, TX
25 Chatham County, NC
34 St. Francois County, MO
0 Klamath County, OR
23 Butler County, KS
40 Chaves County, NM
29 Salem County, NJ
11 San Patricio County, TX
40 McCracken County, KY
25 Warren County, NY
9 Liberty County, GA
2 Blue Earth County, MN
36 Catoosa County, GA
40 Carbon County, PA
0 Lewis and Clark County, MT
28 Rice County, MN
32 Livingston County, NY
0 Ward County, ND
0 Herkimer County, NY
1 Reno County, KS
40 Athens County, OH
29 Pike County, KY
17 Lake County, CA
11 Sampson County, NC
7 Ionia County, MI
40 Spalding County, GA
16 Pulaski County, KY
28 Tehama County, CA
6 Montcalm County, MI
2 Washington County, NY
0 Chippewa County, WI
14 Crow Wing County, MN
14 Pittsylvania County, VA
1 Bradford County, PA
2 Pope County, AR
40 Hamblen County, TN
1 Sauk County, WI
12 Coos County, OR
23 Mercer County, WV
22 Columbia County, NY
2 Kershaw County, SC
0 Lawrence County, OH
6 Crawford County, AR
16 Acadia Parish, LA
5 Otsego County, NY
28 Tipton County, TN
0 Starr County, TX
13 Franklin County, NC
40 Washington County, OH
40 Oldham County, KY
40 Garfield County, OK
40 Rutland County, VT
9 Mason County, WA
3 St. Joseph County, MI
9 Knox County, OH
40 Stanly County, NC
31 Jackson County, GA
26 Sandusky County, OH
5 Warrick County, IN
18 Granville County, NC
29 Wise County, TX
0 Hall County, NE
40 Belknap County, NH
32 Georgetown County, SC
0 Clay County, MN
18 Jackson County, IL
9 Duplin County, NC
33 Genesee County, NY
0 Tooele County, UT
40 Lee County, NC
40 Lowndes County, MS
9 Washington County, VT
28 Laurel County, KY
0 Walla Walla County, WA
8 Huron County, OH
33 Lenoir County, NC
1 Newton County, MO
3 Barry County, MI
24 Boone County, IN
3 Haywood County, NC
1 Vermilion Parish, LA
40 Broomfield County, CO
40 Anderson County, TX
1 Whiteside County, IL
8 Blount County, AL
0 Russell County, AL
5 Lamar County, MS
27 Columbus County, NC
8 Walton County, FL
11 Oxford County, ME
7 Carter County, TN
1 Otter Tail County, MN
27 Cumberland County, TN
2 Garfield County, CO
7 Pike County, PA
1 San Benito County, CA
23 Marion County, WV
8 Hawkins County, TN
5 Columbia County, WI
21 Putnam County, WV
11 Franklin County, VA
6 Pickaway County, OH
32 Windsor County, VT
21 Seneca County, OH
40 Saline County, KS
4 Edgecombe County, NC
36 Gordon County, GA
26 Cherokee County, SC
0 Autauga County, AL
1 Maverick County, TX
3 Pearl River County, MS
40 Washington County, VA
4 Hardin County, TX
4 Campbell County, VA
2 Fulton County, NY
40 Carson City, NV
1 Tuscola County, MI
36 Hancock County, ME
40 Jefferson County, WV
1 Colbert County, AL
0 Eddy County, NM
5 Johnson County, MO
14 Venango County, PA
9 Pender County, NC
23 Rusk County, TX
0 Tuolumne County, CA
0 Halifax County, NC
14 Boone County, IL
0 Vernon Parish, LA
0 St. Mary Parish, LA
40 Coles County, IL
24 Roane County, TN
13 Chisago County, MN
9 Lincoln County, MO
3 Effingham County, GA
40 Pulaski County, MO
25 Coffee County, TN
0 Gila County, AZ
27 Grady County, OK
32 Jackson County, AL
29 Henry County, VA
10 Ashland County, OH
3 Taney County, MO
17 Ogle County, IL
9 St. Martin Parish, LA
2 Union County, OH
0 St. Charles Parish, LA
19 Darke County, OH
5 Van Zandt County, TX
39 McMinn County, TN
5 La Plata County, CO
2 Knox County, IL
15 Cass County, MI
32 Jefferson County, TN
2 Waupaca County, WI
40 Hood County, TX
5 Wayne County, PA
0 Somerset County, ME
4 Eagle County, CO
40 Watauga County, NC
0 Franklin County, NY
40 Washington County, OK
16 Winona County, MN
4 Worcester County, MD
3 Camden County, GA
16 Lyon County, NV
5 Coffee County, AL
0 Elko County, NV
5 Cherokee County, TX
3 Grant County, WI
40 Harrisonburg city, VA
9 Hoke County, NC
11 Tioga County, NY
18 Dale County, AL
31 Dickson County, TN
39 Grundy County, IL
0 Box Elder County, UT
4 Henry County, IL
1 Crittenden County, AR

This takes about 15 minutes to run through. Although it's running, it looks like we aren't getting all the data we wanted. In particular, some very large counties, like Pima, AZ are showing up as having no restaurants. Presumably the area has some nice dining options, but they likely only show up in Yelp when you select "Tucson" as the region, rather than "Pima County". Additionally, as noted in the guide for Developers, the API will only return the top 40 results. Each API call returns at most 20 businesses, and since the get_yelp_businesses function only returns one page, I don't have the full 40. I'm not sure that either of these bias the overall preliminary explanatory findings, so I think it's okay to press forward, although I would want to go back and fix some of the names. As a test of this, I already edited "Miami-Dade County" to "Miami" which you might have noticed in the county list.

Once all the county data is collected, a simple first pass at an analysis would be to count up how prominent different types of restaurants are in each type of county. This is slightly complicated by the fact that the number of counties differs between each of the education groups, the number of restaurants varies by county, and the number of categories varies by restaurant. One way to accurately summarize the data would be to look at the proportion of restaurants in a given county-type that are associated with each category. That is, what percentage of restaurants are described as pizza parlors or sushi bars?

First, I set up a counter to store how many restaurants are in each type of county:



In [122]:

    
restaurant_count = {'High'   : 0,
                    'Low'    : 0,
                    'Middle' : 0}

I filled in the keys and starting values to avoid errors the first time I encounter a county type. I didn't have to preset the value of zero. One other way would be to set up 'restaurant_count' as an empty dictionary, and try to increment the counter every time a restaurant is observed, and if there is an error set the value to 1.



In [128]:

    
restaurant_count = {}

try:
    # if it already has a value, increment the value by one
    restaurant_count['High'] +=1
except:
    # If this is the first time we are seeing it, start at one
    restaurant_count['High'] = 1

This is more flexible--it doesn't rely on the fixed categories of 'High', 'Low and 'Middle'.

Another Pythonic way of doing things is to use a defaultdict. This is similar to a standard dictionary, except that you can set a default value or object type for new keys.



In [2]:

    
from collections import defaultdict

restaurant_count = defaultdict(int)

Now 'restaurant_count' expects integers. By default, an unknown key is assumed to have a value of zero:



In [4]:

    
print restaurant_count['Super Fancy']

This way we don't have to worry about catching errors and have the flexibility if we want to shift to a different categorization scheme. In addition to numbers, default_dicts can be set up with other useful data types, such as list or dictionaries, or we can set up a couple of them inside another dictionary, which is what I wanted to do to keep track of the count of different restaurant categories by county type.



In [ ]:

    
restaraunt_types = {'High'   : defaultdict(int),
                    'Low'    : defaultdict(int),
                    'Middle' : defaultdict(int)}

I fill up these counters by looping through the list of counties and looking up the list of restaurants from the 'county_restaurants' dictionary I created earlier. For each restaurant, I increment the restaurant counter for that county type. I also increment the appropriate counter for each category of food that they are listed as serving.



In [133]:

    
for county in big_counties:
    full_name = '%s, %s' % (county[1],county[2])
    ses = county[-1]
    restaurants = county_restaurants[full_name]['businesses']
    for restaurant in restaurants:        
        #increment the restaurant counter
        restaurant_count[ses] += 1
        
        #Remove duplicate categories and the 'restaurant category'
        categories = restaurant.get('categories',[])
        categories = [c[0] for c in categories if c!= ['Restaurants','restaurants']]

        #increment the ses-category counter
        for category in categories:
            restaraunt_types[ses][category] = restaraunt_types[ses][category] + 1



In [137]:

    
print restaurant_count









    



defaultdict(<type 'int'>, {'High': 4174, 'Middle': 6984, 'Low': 2989})

Because of limitations in the API, I only gathered data on a maximum of 20 restaurants per county, so 'restaurant_count' probably understates the true difference in the number of variation in restaurants across each education level. That said, I have no strong reason to suspect that there is any bias in the types of restaurants listed.

Finally, to print out the results, I listed the top 25 restaurant categories for each county type using the values storied in 'restaurant_count' to compute the denominator.



In [236]:

    
for ses in restaurant_count:    
    print '%s county SES (n=%s)' % (ses, restaurant_count[ses])
    
    #loop over each category, sorted by frequency
    for category in sorted(restaraunt_types[ses], key = restaraunt_types[ses].get, reverse=True )[:25]:
        
        #Compute the % of overall restaurants that are of a type
        frequency  = restaraunt_types[ses][category]
        percent = frequency/restaurant_count[ses] * 100
        
        #Print out the results, including a leading zero so the columns look pretty.
        print '%04.1f%%  %s' % (percent , category)
    
    #Blank line in between each category
    print









    



High county SES (n=4174)
12.8%  Pizza
10.5%  American (New)
09.2%  American (Traditional)
08.4%  Sandwiches
07.7%  Mexican
07.7%  Italian
05.3%  Delis
05.1%  Chinese
04.6%  Burgers
04.3%  Breakfast & Brunch
04.1%  Fast Food
03.9%  Seafood
03.7%  Barbeque
03.2%  Cafes
02.9%  Sushi Bars
02.9%  Japanese
02.7%  Steakhouses
02.6%  Bars
02.2%  Coffee & Tea
02.0%  Thai
01.9%  Bakeries
01.7%  Diners
01.7%  Mediterranean
01.4%  Pubs
01.4%  Asian Fusion

Middle county SES (n=6984)
13.0%  Pizza
11.3%  American (Traditional)
08.8%  Mexican
08.5%  American (New)
07.1%  Sandwiches
06.1%  Italian
05.8%  Chinese
05.7%  Fast Food
04.8%  Burgers
04.5%  Barbeque
03.9%  Breakfast & Brunch
03.4%  Seafood
03.3%  Delis
03.1%  Steakhouses
02.7%  Cafes
02.6%  Japanese
02.5%  Diners
02.3%  Bars
01.9%  Sushi Bars
01.8%  Chicken Wings
01.6%  Thai
01.4%  Coffee & Tea
01.2%  Sports Bars
01.1%  Ice Cream & Frozen Yogurt
01.1%  Buffets

Low county SES (n=2989)
12.0%  Pizza
11.2%  Mexican
10.7%  American (Traditional)
08.4%  Fast Food
07.0%  American (New)
05.8%  Burgers
05.4%  Chinese
05.4%  Sandwiches
05.0%  Barbeque
04.3%  Italian
03.3%  Steakhouses
02.7%  Breakfast & Brunch
02.6%  Cafes
02.6%  Seafood
02.0%  Delis
01.8%  Japanese
01.7%  Bars
01.7%  Buffets
01.6%  Diners
01.6%  Chicken Wings
01.3%  Southern
00.9%  Caterers
00.9%  Sushi Bars
00.9%  Hot Dogs
00.8%  Thai

Everybody likes pizza. Assuming the Yelp data is representative and this method has something approaching plausibility, about one in eight restaurants in the US is a pizza place, and this is relatively constant across different types of counties. After that, there is a fair amount of variation. I need to do more research on what constitutes 'American (New)' cuisine, but it is more popular in high SES counties than in low SES counties, which favor 'American (Traditional)'.

Even more stratified is 'Sushi Bars' which constitute roughly 3% of restaurants in high SES areas, 2% in middle, and 1% in low SES counties. In contrast, approximately 2% of restaurants in low SES counties are 'Buffets', while they aren't even in the top 25 for high SES counties. The exact percent is:



In [192]:

    
print restaraunt_types['High']['Buffets'] / restaurant_count['High'] * 100









    



0.622903689506

Fast food is stratified as well and found approximately twice as often in low SES counties as in high SES counties:



In [197]:

    
high_ff = restaraunt_types['High']['Fast Food'] / restaurant_count['High'] * 100
low_ff  = restaraunt_types['Low']['Fast Food'] / restaurant_count['Low'] * 100
print low_ff/high_ff









    



2.05793171453

This code could be generalized to loop over a couple of popular food categories:



In [219]:

    
category_list = ['Mexican', u'Chinese', 'American (New)', 'American (Traditional)', 
                 'Breakfast & Brunch', 'Delis', 'Pizza', 'Sandwiches', 'Fast Food', 
                 'Burgers', 'Italian', 'Barbeque']

ratio = {}
for category in category_list:
    high_ff = restaraunt_types['High'][category] / restaurant_count['High'] * 100
    low_ff  = restaraunt_types['Low'][category] / restaurant_count['Low'] * 100
    ratio[category] = high_ff/low_ff
    
for category in sorted(ratio, key=ratio.get):
    print '%0.2f  %s' % (ratio[category], category)









    



0.49  Fast Food
0.69  Mexican
0.75  Barbeque
0.81  Burgers
0.87  American (Traditional)
0.94  Chinese
1.07  Pizza
1.50  American (New)
1.56  Sandwiches
1.58  Breakfast & Brunch
1.77  Italian
2.69  Delis

So the low SES eating environment has an overrepresentation of barbeque, Mexican and fast food places, while delis, Italian food, and breakfast & brunch places are more commonly found in high SES counties.

If I were doing this for a peer-reviewed manuscript, I would probably use a sample of census tracts rather than counties in order to get a more realistic picture of local food environments. I would also want to spend some time understanding how restaurants get categorized and what each category means.

For publication, I would also use statistical models appropriate for count variables and treat education as continuous rather than binning it. Python has pretty impressive capabilities for statistical models, so I put together some quick code to model the count of the number of barbeque joints as a function of the proportion of the adult population with a college degree using the statsmodels negative binomial regression function.



In [220]:

    
#Module for doing statistical analysis
import statsmodels.api as sm    

def county_category(businesses, category):
    #count how many businesses in a county are of a certain type
    category_count = 0
    for restaurant in businesses:            
        categories = restaurant.get('categories',[])
        categories = [c[0] for c in categories]
        if category in categories:
            category_count += 1
    return category_count

#dependent variable
Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],"Barbeque") 
     for c in big_counties]

#Explantory variable
education_level = [float(c[-3]) for c in big_counties]



In [221]:

    
#negative binomial regression model
mod_nbin = sm.NegativeBinomial(Y, education_level)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()









    



                     NegativeBinomial Regression Results                      
==============================================================================
Dep. Variable:                      y   No. Observations:                  973
Model:               NegativeBinomial   Df Residuals:                      972
Method:                           MLE   Df Model:                            0
Date:                Tue, 19 Nov 2013   Pseudo R-squ.:               -0.006523
Time:                        15:38:39   Log-Likelihood:                -1057.1
converged:                       True   LL-Null:                       -1050.2
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0161      0.002     -9.268      0.000        -0.019    -0.013
alpha          0.4821      0.113      4.257      0.000         0.260     0.704
==============================================================================

Not surprisingly, college education (x1) is negatively correlated with the count of barbeque restaurants. This relationship could be spurious, however, as county population size is negatively associated with both high levels of education and high levels of barbeque.



In [222]:

    
population = [float(c[-2]) for c in big_counties]
population = [np.log(p) for p in population]

#combine
X = zip(education_level, population)
print X[:5]
print ''
#rerun regression model
mod_nbin = sm.NegativeBinomial(Y, X)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()









    



[(29.2, 16.11809565095832), (33.7, 15.464169183551656), (27.9, 15.274125580663791), (29.1, 15.176487111099874), (34.2, 14.978661367769956)]

                     NegativeBinomial Regression Results                      
==============================================================================
Dep. Variable:                      y   No. Observations:                  973
Model:               NegativeBinomial   Df Residuals:                      971
Method:                           MLE   Df Model:                            1
Date:                Tue, 19 Nov 2013   Pseudo R-squ.:              -0.0007531
Time:                        15:40:19   Log-Likelihood:                -1051.0
converged:                       True   LL-Null:                       -1050.2
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0011      0.005      0.218      0.827        -0.009     0.011
x2            -0.0404      0.012     -3.501      0.000        -0.063    -0.018
alpha          0.4429      0.110      4.020      0.000         0.227     0.659
==============================================================================

Once population (x2) is in the model, education level (x1) is no longer significant. So the number of barbeque restaurants in town is largely a function of county size, but, because of the relationship between county size and education level, they are disproportionately found in low SES counties.

As before, we can revise the code to loop over more types of restaurants:



In [234]:

    
types = set(restaraunt_types['High'])

coefficients= {}

for type in types:
    Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],type) for c in big_counties]
    if np.mean(Y) > .05:
        mod_nbin = sm.NegativeBinomial(Y, X)
        res_nbin = mod_nbin.fit(disp=False)
        if np.absolute(res_nbin.tvalues[0]) > 1.96:
            coefficients[type] = res_nbin.params[0]
        
    
for category in sorted(coefficients, key=coefficients.get, reverse=True):
    print '%.02f %s' % (coefficients[category], category)









    



0.11 Food Trucks
0.08 Vietnamese
0.08 French
0.08 Food Stands
0.07 Wine Bars
0.07 Vegetarian
0.07 Middle Eastern
0.06 Mediterranean
0.06 Grocery
0.06 Indian
0.06 Greek
0.06 Latin American
0.06 Desserts
0.05 Asian Fusion
0.05 Bakeries
0.05 Sushi Bars
0.05 Lounges
0.04 Thai
0.04 Delis
0.04 Coffee & Tea
0.04 Pubs
0.03 Beer, Wine & Spirits
0.03 Japanese
0.03 Hot Dogs
0.03 Salad
0.03 Sports Bars
0.02 American (New)
0.02 Breakfast & Brunch
0.02 Cafes
0.02 Italian
0.02 Sandwiches
0.02 Seafood
0.02 Caterers
0.02 Bars
0.02 Ice Cream & Frozen Yogurt
0.01 Pizza
-0.02 Buffets

This time, I only printed out the coefficient for SES when there was a statistically significant relationship. Controlling for population size, food trucks are the most distinctively high SES food category. This specific relationship may have something to do with the operationalization of SES, as college towns, which often have high education levels but not the highest income levels, are rife with food trucks.

Only one food category was negatively associated with education level, after controlling for population size. I think this is because the outcome variable wasn't about the relative proportion in a county, but the absolute count, and high SES areas are more likely to have restaurants of all types (except buffets).



In [1]:

    
#ignore code below. Imports style sheet for this page.

from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[1]: