Initial Import and Cleaning

Since the Yelp business information is contained in a large JSON file, we will preprocess the data into a smaller file, for processing speed.

  • We remove businesses that are not restaurants, and any permanently closed businesses.
  • We combine the cities of Montréal and Montreal, removing the accent on the former.
  • We drop the address, neighborhood, and postal code information for now. We can use this later, especially for larger cities.
  • We also strip out geographical information and the hours and attributes. In the future, we may use this as well.
  • The trimmed data is saved to a new CSV file for analysis.

In [ ]:
import pandas as pd
import numpy as np

# import raw data file
df = pd.read_json('yelp_academic_dataset_business.json', lines=True)

# remove permanently closed
df = df[df['is_open'] == 1]
df = df.drop(['is_open'], axis=1)

# remove non restaurants
df = df[df['categories'].apply(str).str.contains("Restaurants")]

# combine Montreal cities
df['city'] = df['city'].replace(u'Montr\xe9al', 'Montreal')

# drop unnecessary columns for now
df = df.drop(['address', 'neighborhood', 'postal_code', 'type'], axis=1)

# strip latitude, longitude, hours, attributes for now too
df = df.drop(['latitude', 'longitude', 'hours', 'attributes'], axis=1)

# save cleaned data
df.to_csv('cleaned.csv', index=False)