For excerises 1-4, use the Yelp business json file. For exercises 5-6, use the Yelp review json file.

Exercise 1: Create a new column that contains only the zipcode.


In [2]:
import pandas as pd
import numpy as np
# read json, lines=True because each line is a single json
business = pd.read_json('Yelp/yelp_academic_dataset_business.json', lines=True)
print(business.shape)
business.head()


(85901, 15)
Out[2]:
attributes business_id categories city full_address hours latitude longitude name neighborhoods open review_count stars state type
0 {'Take-out': True, 'Drive-Thru': False, 'Good ... 5UmKMjUEUNdYWqANhGckJw [Fast Food, Restaurants] Dravosburg 4734 Lebanon Church Rd\nDravosburg, PA 15034 {'Friday': {'close': '21:00', 'open': '11:00'}... 40.354327 -79.900706 Mr Hoagie [] True 7 3.5 PA business
1 {'Happy Hour': True, 'Accepts Credit Cards': T... UsFtqoBl7naz8AVUBZMjQQ [Nightlife] Dravosburg 202 McClure St\nDravosburg, PA 15034 {} 40.350553 -79.886814 Clancy's Pub [] True 5 3.0 PA business
2 {'Good for Kids': True} cE27W9VPgO88Qxe4ol6y_g [Active Life, Mini Golf, Golf] Bethel Park 1530 Hamilton Rd\nBethel Park, PA 15234 {} 40.354115 -80.014660 Cool Springs Golf Center [] False 5 2.5 PA business
3 {'Alcohol': 'full_bar', 'Noise Level': 'averag... mVHrayjG3uZ_RLHkLj-AMg [Bars, American (New), Nightlife, Lounges, Res... Braddock 414 Hawkins Ave\nBraddock, PA 15104 {'Tuesday': {'close': '19:00', 'open': '10:00'... 40.408830 -79.866211 Emil's Lounge [] True 26 4.5 PA business
4 {'Parking': {'garage': False, 'street': False,... mYSpR_SLPgUVymYOvTQd_Q [Active Life, Golf] Braddock 1000 Clubhouse Dr\nBraddock, PA 15104 {'Sunday': {'close': '15:00', 'open': '10:00'}... 40.403405 -79.855782 Grand View Golf Club [] True 3 5.0 PA business

In [9]:
# extract five consecutive digits after a white space at the end of the address
business['zipcode'] = business['full_address'].str.extract('(\s\d{5}$)', expand=True)
business[['zipcode', 'full_address']].tail()


Out[9]:
zipcode full_address
85896 89120 3899 East Sunset Rd\nSte 105\nLas Vegas, NV 89120
85897 85338 525 N Estrella Pkwy\nSte 100\nGoodyear, AZ 85338
85898 85020 9215 North 7th St\nPhoenix, AZ 85020
85899 NaN 16 Waterloo Place\nOld Town\nEdinburgh EH1 3EG
85900 28208 Charlotte Douglas International Airport Termin...

Exercise 2: The table contains a column called 'categories' and each entry in this column is populated by a list. We are interested in those businesses that restaurants. Create a new column 'Restaurant_type' that contains a description of the restaurant based on the other elements of 'categories.

That is, if we have '[Sushi Bars, Japanese, Restaurants]' in categories the 'Restaurant_type will be '{'SushiBars': 1, 'Japanese': 1, 'Mexican': 0, ...}'


In [10]:
# create a set containing all the possible categories
cat = set([])

for row in business['categories']:
    if 'Restaurants' in row:
        cat = cat.union(set(row))

cat = cat.difference(set(['Restaurants']))

In [11]:
# function that creates a dictionary based on the previous categories and sets the values to 1 if the category is presenti in the argument
def cat_dict(row):
    if 'Restaurants' in row:
        d = dict.fromkeys(cat, 0)
        for e in row:
            if e != 'Restaurants':
                d[e] = 1
        return d
    else:
        return np.NaN

In [12]:
business['Restaurant_type'] = business['categories'].apply(cat_dict)

In [13]:
business[['Restaurant_type', 'categories']].head()


Out[13]:
Restaurant_type categories
0 {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... [Fast Food, Restaurants]
1 NaN [Nightlife]
2 NaN [Active Life, Mini Golf, Golf]
3 {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... [Bars, American (New), Nightlife, Lounges, Res...
4 NaN [Active Life, Golf]

Exercise 3: Lets clean the 'attributes' column. The entries in this column are dictionaries. We need to do two things:

1) Turn all the True or False in the dictionary to 1s and 0s.

2) There are some entries within dictionaries that are dictionaries themselves, lets turn the whole entry into just one dictionary, for example if we have

'{'Accepts Credit Cards': True, 'Alcohol': 'none','Ambience': {'casual': False,'classy': False}}'

then turn it into

'{'Accepts Credit Cards':1, 'Alcohol_none': 1, 'Ambience_casual': 0, 'Ambience_classy': 0}'.

There might be other entries like {'Price Range': 1} where the values are numerical so we might want to change that into {'Price_Range_1': 1}.

The reason we modify categorical variables like this is that machine learning algorithms cannot interpret textual data like "True" and "False". They need numerical inputs such as 1 and 0.


In [14]:
# if the value is a boolean replaces the spaces in the key with underscores and returns 0-1 instead of True-False
# else replaces the spaces in the key with underscores, adds the value to the key and returns 1
def replace_entry(key, value):
    if type(value) is bool:
        if value:
            return key.replace(' ', '_'), 1
        else:
            return key.replace(' ', '_'), 0
    else:
        return key.replace(' ', '_') + '_' + str(value).replace(' ', '_'), 1

# returns the dictionary after applying the function above, navigates through the dictionary inside the values
def clean_dict(dct):
    
    out_dct = {}

    for key in dct:
        value = dct[key]
        if type(value) is dict:
            for subkey in value:
                new_key, new_value = replace_entry(subkey, value[subkey])
                out_dct[key.replace(' ', '_') + '_' + new_key] = new_value
        else:
            new_key, new_value = replace_entry(key, value)
            out_dct[new_key] = new_value

    return out_dct

business['attributes'] = business['attributes'].apply(clean_dict)

In [15]:
business[['attributes']].head(10)


Out[15]:
attributes
0 {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des...
1 {'Happy_Hour': 1, 'Accepts_Credit_Cards': 1, '...
2 {'Good_for_Kids': 1}
3 {'Alcohol_full_bar': 1, 'Noise_Level_average':...
4 {'Parking_garage': 0, 'Parking_street': 0, 'Pa...
5 {'Alcohol_full_bar': 1, 'Noise_Level_loud': 1,...
6 {'Accepts_Credit_Cards': 1}
7 {'Good_for_Kids': 1}
8 {}
9 {}

Exercise 4: Create a new column for every day of the week and fill it with the amount of hours the business is open that day.

Your approach should handle businesses that stay open late like bars and nightclubs.


In [16]:
# dct = {'Friday': {'close': '21:00', 'open': '11:00'},
#  'Monday': {'close': '21:00', 'open': '11:00'},
#  'Thursday': {'close': '21:00', 'open': '11:00'},
#  'Tuesday': {'close': '21:00', 'open': '11:00'},
#  'Wednesday': {'close': '21:00', 'open': '11:00'}}

dct = {'Friday': {'close': '02:00', 'open': '11:00'},
 'Monday': {'close': '02:00', 'open': '11:00'},
 'Saturday': {'close': '02:00', 'open': '12:00'},
 'Sunday': {'close': '22:00', 'open': '12:00'},
 'Thursday': {'close': '02:00', 'open': '11:00'},
 'Tuesday': {'close': '02:00', 'open': '11:00'},
 'Wednesday': {'close': '02:00', 'open': '11:00'}}

day_map = {'Friday': '2017-08-04',
 'Monday': '2017-07-31',
 'Saturday': '2017-08-05',
 'Sunday': '2017-08-06',
 'Thursday': '2017-08-03',
 'Tuesday': '2017-08-01',
 'Wednesday': '2017-08-02'}

In [17]:
def daily_hours(dct):
    # initialize output
    hours = dict.fromkeys(range(0, 7), 0)
    for day in dct:
        # set start datetime, end datetime and end of day datetime
        start = pd.to_datetime(day_map[day] + ' ' + dct[day]['open'])
        end = pd.to_datetime(day_map[day] + ' ' + dct[day]['close'])
        end_of_day = pd.to_datetime(day_map[day] + ' 00:00:00') + pd.Timedelta(1, unit='D')
        # if closing hours is in the next day add one to the end and add to output the two days
        if dct[day]['close'] < dct[day]['open']:
            end = end + pd.Timedelta(1, unit='D')
            diff = end_of_day - start
            hours[start.dayofweek] += int(np.floor(diff.seconds / 3600))
            diff = end - end_of_day
            hours[end_of_day.dayofweek] += int(np.floor(diff.seconds / 3600))
        # else add only the current day
        else:
            diff = end - start
            hours[start.dayofweek] += int(np.floor(diff.seconds / 3600))
    return hours[0], hours[1], hours[2], hours[3], hours[4], hours[5], hours[6]

business['Monday'], business['Tuesday'], business['Wednesday'], business['Thursday'], business['Friday'], business['Saturday'], business['Sunday'] = zip(*business['hours'].map(daily_hours))

#daily_hours(dct)

In [8]:
# solution code:
FMT = '%H:%M'

def Time(row, date):
    try:
        closing = pd.datetime.strptime(row['hours'][date]['close'], FMT)
        opening = pd.datetime.strptime(row['hours'][date]['open'], FMT)
        time = closing - opening
        
        if time < pd.Timedelta(days=0):
            time += pd.Timedelta(days=1)
        
    except:
        time = pd.Timedelta(days=0)
    
    return time

business.apply(lambda x: Time(x, 'Monday'), axis=1).head(10)


Out[8]:
0   10:00:00
1   00:00:00
2   00:00:00
3   00:00:00
4   00:00:00
5   15:00:00
6   10:30:00
7   00:00:00
8   00:00:00
9   00:00:00
dtype: timedelta64[ns]

In [18]:
business[['hours', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']].head(10)


Out[18]:
hours Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 {'Friday': {'close': '21:00', 'open': '11:00'}... 10 10 10 10 10 0 0
1 {} 0 0 0 0 0 0 0
2 {} 0 0 0 0 0 0 0
3 {'Tuesday': {'close': '19:00', 'open': '10:00'... 0 9 9 9 10 6 0
4 {'Sunday': {'close': '15:00', 'open': '10:00'}... 0 0 9 9 9 9 5
5 {'Monday': {'close': '02:00', 'open': '11:00'}... 13 15 15 15 15 14 12
6 {'Monday': {'close': '18:00', 'open': '07:30'}... 10 10 10 10 10 8 0
7 {} 0 0 0 0 0 0 0
8 {} 0 0 0 0 0 0 0
9 {} 0 0 0 0 0 0 0

Exercise 5: Create a table with the average review for a business.

You will need to pull in a new json file and merge DataFrames for the next 2 exercises.


In [19]:
review = pd.read_json('Yelp/yelp_academic_dataset_review.json', lines=True)
print(review.shape)
review.head()


(2685066, 8)
Out[19]:
business_id date review_id stars text type user_id votes
0 5UmKMjUEUNdYWqANhGckJw 2012-08-01 Ya85v4eqdd6k9Od8HbQjyA 4 Mr Hoagie is an institution. Walking in, it do... review PUFPaY9KxDAcGqfsorJp3Q {'funny': 0, 'useful': 0, 'cool': 0}
1 5UmKMjUEUNdYWqANhGckJw 2014-02-13 KPvLNJ21_4wbYNctrOwWdQ 5 Excellent food. Superb customer service. I mis... review Iu6AxdBYGR4A0wspR9BYHA {'funny': 0, 'useful': 0, 'cool': 0}
2 5UmKMjUEUNdYWqANhGckJw 2015-10-31 fFSoGV46Yxuwbr3fHNuZig 5 Yes this place is a little out dated and not o... review auESFwWvW42h6alXgFxAXQ {'funny': 1, 'useful': 1, 'cool': 0}
3 5UmKMjUEUNdYWqANhGckJw 2015-12-26 pVMIt0a_QsKtuDfWVfSk2A 3 PROS: Italian hoagie was delicious. Friendly ... review qiczib2fO_1VBG8IoCGvVg {'funny': 0, 'useful': 0, 'cool': 0}
4 5UmKMjUEUNdYWqANhGckJw 2016-04-08 AEyiQ_Y44isJmNbMTyoMKQ 2 First the only reason this place could possibl... review qEE5EvV-f-s7yHC0Z4ydJQ {'funny': 0, 'useful': 1, 'cool': 0}

In [20]:
average_review = review[['business_id', 'stars']].groupby('business_id').mean()

In [21]:
average_review.head()


Out[21]:
stars
business_id
--0ZoBTQWQra1FxD4rBWmg 2.000000
--1emggGHgoG6ipd_RMb-g 3.333333
--4Pe8BZ6gj57VFL5mUE8g 2.750000
--5jkZ3-nUPZxUvtcbr8Uw 4.580357
--7PRjnsjMA6uhPK8mW13Q 2.666667

Exercise 6: Create a new table that only contains restaurants with the following schema:

Business_Name | Restaurant_type | Friday hours | Saturday hours | Attributes | Zipcode | Average Rating


In [27]:
df = pd.merge(business[['business_id', 'name' , 'Restaurant_type', 'Friday', 'Saturday', 'attributes', 'zipcode']], average_review, left_on='business_id', right_index=True, how='inner')
df.drop('business_id', axis=1, inplace=True)
names = ['Business_Name', 'Restaurant_type', 'Friday hours', 'Saturday hours', 'Attributes', 'Zipcode', 'Average Rating']
df.columns = names
#remove restaurants
df = df[df['Restaurant_type'] == df['Restaurant_type']]
df.head()


Out[27]:
Business_Name Restaurant_type Friday hours Saturday hours Attributes Zipcode Average Rating
0 Mr Hoagie {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... 10 0 {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des... 15034 3.428571
3 Emil's Lounge {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... 10 6 {'Alcohol_full_bar': 1, 'Noise_Level_average':... 15104 4.680000
5 Alexion's Bar & Grill {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... 15 14 {'Alcohol_full_bar': 1, 'Noise_Level_loud': 1,... 15106 3.894737
12 Kings Family Restaurant {'Taiwanese': 0, 'Burgers': 1, 'Scottish': 0, ... 18 18 {'Take-out': 1, 'Drive-Thru': 0, 'Good_For_des... 15106 3.250000
17 Rocky's Lounge {'Taiwanese': 0, 'Burgers': 0, 'Scottish': 0, ... 12 12 {'Alcohol_full_bar': 1, 'Noise_Level_average':... 15106 3.800000