Yelp Phoneix Dataset Analysis

We are going to analyze the businesses contained in the Yelp Phoneix Dataset Analysis to see if we can find relationships between any of the variables and the rating the business has.

Data Preprocessing

The first thing we are going to do is preprocess the data and leave it ready for the analysis. This preprocesssing is going to be divided in two steps: Extraction and Transformation.

Extraction

Since we have all the data in the same JSON file, extraction is rather a trivial step and it just consists on loading the data file. We can see how it is done in the next three lines of code:


In [ ]:
import json
business_file_path = 'yelp_academic_dataset_business.json'
records = [json.loads(line) for line in open(business_file_path)]

Transformation

In order to transform our data and leave it in a ready for the analysis, we create some auxiliary functions that will help us in this task.


In [2]:
def drop_fields(fields, dictionary_list):
    """
    Removes the specified fields from every dictionary in the list records

    :rtype : void
    :param fields: a list of strings, which contains the fields that are
    going to be removed from every dictionary in the list records
    :param dictionary_list: a list of dictionaries
    """
    for record in dictionary_list:
        for field in fields:
            del (record[field])

def add_transpose_list_column(field, dictionary_list):
    """
    Takes a list of dictionaries and adds to every dictionary a new field
    for each value contained in the specified field among all the
    dictionaries in the field, leaving 1 for the values that are present in
    the dictionary and 0 for the values that are not. It can be seen as
    transposing the dictionary matrix.

    :param field: the field which is going to be transposed
    :param dictionary_list: a list of dictionaries
    :return: the modified list of dictionaries
    """
    values_set = set()
    for dictionary in dictionary_list:
        values_set |= set(dictionary[field])

    for dictionary in dictionary_list:
        for value in values_set:
            if value in dictionary[field]:
                dictionary[value] = 1
            else:
                dictionary[value] = 0

    return dictionary_list

def add_transpose_single_column(field, dictionary_list):
    """
    Takes a list of dictionaries and adds to every dictionary a new field
    for each value contained in the specified field among all the
    dictionaries in the field, leaving 1 for the values that are present in
    the dictionary and 0 for the values that are not. It can be seen as
    transposing the dictionary matrix.

    :param field: the field which is going to be transposed
    :param dictionary_list: a list of dictionaries
    :return: the modified list of dictionaries
    """

    values_set = set()
    for dictionary in dictionary_list:
        values_set.add(dictionary[field])

    for dictionary in dictionary_list:
        for value in values_set:
            if value in dictionary[field]:
                dictionary[value] = 1
            else:
                dictionary[value] = 0

    return dictionary_list

def drop_unwanted_fields(dictionary_list):
    """
    Drops fields that are not useful for data analysis in the business
    data set

    :rtype : void
    :param dictionary_list: the list of dictionaries containing the data
    """
    unwanted_fields = [
        'attributes',
        'business_id',
        'categories',
        'city',
        'full_address',
        'hours',
        'name',
        'neighborhoods',
        'open',
        'state',
        'type'
    ]

    drop_fields(unwanted_fields, dictionary_list)

Finally, with this auxiliary functions we create another one that is in charge of loadin the data and transform it. Since we are going to perform a linear regression to analyze the data, all of the values must be numeric. But there are cases in which we cannot traduce a qualitative value into a quantitative one, such is the case of the city of the business.

For this case we simply transpose the matrix, and add each possible city as a column. Then, if the business belongs to that city, we put a 1 in that cell, if it doesn't, we put a 0.

As you can see, with the help of our auxiliary functions, extracting and transforming the data is very straightforward.


In [3]:
def load_file(file_path):
    """
    Loads the Yelp Phoenix Academic Data Set file for business data, and
    transforms it so it can be analyzed

    :type file_path: list of dictionaries
    :param file_path: the path for the file that contains the businesses
    data
    :return: a list of dictionaries with the preprocessed data
    """
    records = [json.loads(line) for line in open(file_path)]
    records = add_transpose_list_column('categories', records)
    records = add_transpose_single_column('city', records)
    drop_unwanted_fields(records)

    return records

Our records now have the shape of a numeric matrix in which most of the values are binary, due to the inclusion of columns for each category and each restaurant. It's a very wide matrix with a total of 663 columns. The first record looks like this:


In [4]:
records = load_file(business_file_path)
print records[0]
len(records[0])


{u'Roofing': 0, u'Truck Rental': 0, u'Drugstores': 0, u'Dry Cleaning & Laundry': 0, u'Buffets': 0, u'Cheese Shops': 0, u'Boating': 0, u'Child Care & Day Care': 0, u'Endodontists': 0, u'Creperies': 0, u'Pretzels': 0, u'Comedy Clubs': 0, u'Architects': 0, u'Juice Bars & Smoothies': 0, u'Pet Boarding/Pet Sitting': 0, u'Grocery': 0, u'Laveen': 0, u'Hair Removal': 0, u'Dog Parks': 0, u'Contractors': 0, u'Stadiums & Arenas': 1, u'Community Service/Non-Profit': 0, u'Bowling': 0, u'Restaurants': 0, u'DJs': 0, u'Playgrounds': 0, u'Hiking': 0, u'Tolleson': 0, u'Fabric Stores': 0, u'Bankruptcy Law': 0, u'Outdoor Gear': 0, u'Delis': 0, u'Anesthesiologists': 0, u'Tanning': 0, u'Halal': 0, u'Cantonese': 0, u'Herbs & Spices': 0, u'Wittmann': 0, u'Tobacco Shops': 0, u'Chicken Wings': 0, u'Laser Eye Surgery/Lasik': 0, u'Diagnostic Services': 0, u'Gold Buyers': 0, u'Antiques': 0, u'Goldfield': 0, u'Food Stands': 0, u'Orthodontists': 0, u'Shopping Centers': 0, u'Mountain Biking': 0, u'Nail Salons': 0, u'Elementary Schools': 0, u'Automotive': 0, u'Interior Design': 0, u'Cosmetic Surgeons': 0, u'Travel Services': 0, u'Lounges': 0, u'Persian/Iranian': 0, u'Shoe Repair': 0, u'Animal Shelters': 0, u'Cajun/Creole': 0, u"Men's Hair Salons": 0, u'Hobby Shops': 0, u'Buckeye': 0, u'Gyms': 0, u'Scandinavian': 0, u'Greek': 0, u'Surprise Crossing': 0, u'Zoos': 0, u'Hardware Stores': 0, u'Art Galleries': 0, u'Black Canyon City': 0, u'Watch Repair': 0, u'Limos': 0, u'Chinese': 0, u'Home Decor': 0, u'Food Trucks': 0, u'Health & Medical': 0, u'Bikes': 0, u'El Mirage': 0, u'Yoga': 0, u'Ophthalmologists': 0, u'Hookah Bars': 0, u'Middle Eastern': 0, u'Tonopah': 0, u'Guadalupe': 0, u'Scottsdale': 0, u'Brazilian': 0, u'Building Supplies': 0, u'Landscape Architects': 0, u'Recreation Centers': 0, u'Nutritionists': 0, u'Divorce & Family Law': 0, u'Handyman': 0, u'Costumes': 0, u'Queen Creek': 0, u'Vocational & Technical School': 0, u'Diners': 0, u'Sports Bars': 0, u'Russian': 0, u'Auto Parts & Supplies': 0, u'Seafood Markets': 0, u'Public Relations': 0, u'Vegetarian': 0, u'Piercing': 0, u'Kosher': 0, u'Climbing': 0, u'Pita': 0, u'Painters': 0, u'Couriers & Delivery Services': 0, u'German': 0, u'Real Estate Law': 0, u'Taiwanese': 0, u'longitude': -112.0923293, u'Mass Media': 0, u'Basque': 0, u'Home Cleaning': 0, u'Pilates': 0, u'Hypnosis/Hypnotherapy': 0, u'Vietnamese': 0, u'Personal Injury Law': 0, u'Officiants': 0, u'Smog Check Stations': 0, u'Home Services': 0, u'Eyewear & Opticians': 0, u'Sun City West': 0, u'Pizza': 0, u'Phoenix': 1, u'Marketing': 0, u'Electricians': 0, u'Campgrounds': 0, u'Gelato': 0, u'Office Cleaning': 0, u'Blow Dry/Out Services': 0, u'Tortilla Flat': 0, u'Food Court': 0, u'Furniture Reupholstery': 0, u'Music Venues': 0, u'Archery': 0, u'RV Parks': 0, u'Web Design': 0, u'Seafood': 0, u'Wickenburg': 0, u'British': 0, u'Adult Education': 0, u'Discount Store': 0, u'Clowns': 0, u'Professional Sports Teams': 0, u'Screen Printing': 0, u'Stanfield': 0, u'Bail Bondsmen': 0, u'Boat Charters': 0, u'Japanese': 0, u'Books, Mags, Music & Video': 0, u'Hot Dogs': 0, u'Beer, Wine & Spirits': 0, u'Festivals': 0, u'Podiatrists': 0, u'Sandwiches': 0, u'Shaved Ice': 0, u"Children's Clothing": 0, u'Dance Studios': 0, u'Rio Verde': 0, u'Chiropractors': 0, u'Shades & Blinds': 0, u'Dim Sum': 0, u'Fitness & Instruction': 0, u'Summer Camps': 0, u'Office Equipment': 0, u'Ear Nose & Throat': 0, u'Pediatric Dentists': 0, u'Adult Entertainment': 0, u'latitude': 33.6385727, u"Men's Clothing": 0, u'Music & DVDs': 0, u'Diagnostic Imaging': 0, u'Florence': 0, u'Gold Canyon': 0, u'Nightlife': 0, u'Cocktail Bars': 0, u'Pet Training': 0, u'Dive Bars': 0, u'Formal Wear': 0, u'Mortgage Brokers': 0, u'Baby Gear & Furniture': 0, u'Television Stations': 0, u'Hats': 0, u'Weight Loss Centers': 0, u'Accessories': 0, u'Watches': 0, u'Active Life': 1, u'Disc Golf': 0, u'Psychiatrists': 0, u'Hospitals': 0, u'Window Washing': 0, u'Bubble Tea': 0, u'Maricopa': 0, u'Tonto Basin': 0, u'Beauty & Spas': 0, u'Fountain Hills': 0, u'Private Tutors': 0, u'Acupuncture': 0, u'Damage Restoration': 0, u'Hair Salons': 0, u'Car Wash': 0, u'Lawyers': 0, u'Skydiving': 0, u'Coffee & Tea': 0, u'Optometrists': 0, u'Cheesesteaks': 0, u'Lakes': 0, u'Personal Shopping': 0, u'Estate Planning Law': 0, u'Tapas Bars': 0, u'Paintball': 0, u'Szechuan': 0, u'Sun City Anthem': 0, u'Public Transportation': 0, u'Midwives': 0, u'Hearing Aid Providers': 0, u'Avondale': 0, u'Amateur Sports Teams': 0, u'Body Shops': 0, u'Cuban': 0, u'Dog Walkers': 0, u'Dance Schools': 0, u'Auction Houses': 0, u'CSA': 0, u'Museums': 0, u'Burgers': 0, u'Orthopedists': 0, u'Lingerie': 0, u'Private Investigation': 0, u'Solar Installation': 0, u'Employment Agencies': 0, u'Business Law': 0, u'General Litigation': 0, u'Uniforms': 0, u'Pool Halls': 0, u'Recycling Center': 0, u'Rafting/Kayaking': 0, u'Keys & Locksmiths': 0, u'Real Estate': 0, u'Hospice': 0, u'Diving': 0, u'Professional Services': 0, u'Scuba Diving': 0, u'Performing Arts': 0, u'Cupcakes': 0, u'Sports Medicine': 0, u'RV Dealers': 0, u'Swimwear': 0, u'Trainers': 0, u'Indian': 0, u'Golf Equipment': 0, u'Gilbert': 0, u'Barbeque': 0, u'Massage': 0, u'Nurseries & Gardening': 0, u'Transportation': 0, u'Home Theatre Installation': 0, u'Retirement Homes': 0, u'Desert Ridge': 0, u'Tax Services': 0, u'Mesa': 0, u'Afghan': 0, u'Boot Camps': 0, u'Adult': 0, u'Cosmetology Schools': 0, u'Party & Event Planning': 0, u'Auto Detailing': 0, u'Art Schools': 0, u'Printing Services': 0, u'Skate Parks': 0, u'Bookstores': 0, u'Plumbing': 0, u'Charleston': 0, u'Taxis': 0, u'Bartenders': 0, u'Naturopathic/Holistic': 0, u'Cave Creek': 0, u'Auto Repair': 0, u'Laser Tag': 0, u'Peruvian': 0, u'Home Inspectors': 0, u'Waddell': 0, u'Rolfing': 0, u'Martial Arts': 0, u'Television Service Providers': 0, u'Knitting Supplies': 0, u'Aquariums': 0, u'Hot Tub & Pool': 0, u'Soccer': 0, u'Piano Bars': 0, u'Furniture Stores': 0, u'Food': 0, u'Heating & Air Conditioning/HVAC': 0, u'Apache Junction': 0, u'Photographers': 0, u'Maternity Wear': 0, u'Tree Services': 0, u'Boat Repair': 0, u'Life Coach': 0, u'Print Media': 0, u'Moroccan': 0, u'Venues & Event Spaces': 0, u'Ethnic Food': 0, u'Jazz & Blues': 0, u'Massage Schools': 0, u'Computers': 0, u'Local Services': 0, u'Tutoring Centers': 0, u'Reflexology': 0, u'Advertising': 0, u'Neurologist': 0, u'Shopping': 0, u'Allergists': 0, u'Pest Control': 0, u'Comfort Food': 0, u'Special Education': 0, u'Chocolatiers & Shops': 0, u'Medical Spas': 0, u'Art Supplies': 0, u'RV Rental': 0, u'Gymnastics': 0, u'Shared Office Spaces': 0, u'Cafeteria': 0, u'Soul Food': 0, u'IT Services & Computer Repair': 0, u'Anthem': 0, u'Gluten-Free': 0, u'Arts & Entertainment': 1, u'Home Window Tinting': 0, u'Morristown': 0, u'Saguaro Lake': 0, u'Sushi Bars': 0, u'Tea Rooms': 0, u'French': 0, u'Sun Lakes': 0, u'Newspapers & Magazines': 0, u'Permanent Makeup': 0, u'Cosmetics & Beauty Supply': 0, u'Jewelry Repair': 0, u'Hair Extensions': 0, u'Session Photography': 0, u'Educational Services': 0, u'Movers': 0, u'Barbers': 0, u'Food Delivery Services': 0, u'Garage Door Services': 0, u'Videographers': 0, u'Sporting Goods': 0, u'Amusement Parks': 0, u'Door Sales/Installation': 0, u'Spray Tanning': 0, u'Specialty Schools': 0, u'Fertility': 0, u'Italian': 0, u'Personal Assistants': 0, u'Mediterranean': 0, u'Department Stores': 0, u'Motorcycle Dealers': 0, u'Self Storage': 0, u'Medical Centers': 0, u'Education': 0, u'Tex-Mex': 0, u'Internet Cafes': 0, u'Personal Chefs': 0, u'Caribbean': 0, u'Dance Clubs': 0, u'American (New)': 0, u'Investing': 0, u'Caterers': 0, u'Airport Shuttles': 0, u'Breakfast & Brunch': 0, u'Radio Stations': 0, u'Oil Change Stations': 0, u'Surprise': 0, u'Airports': 0, u'Do-It-Yourself Food': 0, u'Comic Books': 0, u'Junk Removal & Hauling': 0, u'Gas & Service Stations': 0, u'Indonesian': 0, u'review_count': 29, u'Hawaiian': 0, u'Utilities': 0, u'Argentine': 0, u'Screen Printing/T-Shirt Printing': 0, u'Thai': 0, u'Fish & Chips': 0, u'Yuma': 0, u'Skin Care': 0, u'Ahwatukee': 0, u'Burmese': 0, u'Farmers Market': 0, u'African': 0, u'Outlet Stores': 0, u'Cinema': 0, u'Asian Fusion': 0, u'Mattresses': 0, u'Arts & Crafts': 0, u'Tours': 0, u'Fences & Gates': 0, u'Gun/Rifle Ranges': 0, u'Pool Cleaners': 0, u'Makeup Artists': 0, u'Commercial Real Estate': 0, u'Turkish': 0, u'Laser Hair Removal': 0, u'Live/Raw Food': 0, u'Pet Groomers': 0, u'Gay Bars': 0, u'Preschools': 0, u'Home Organization': 0, u'Internet Service Providers': 0, u'Guns & Ammo': 0, u'Southern': 0, u'Sewing & Alterations': 0, u'Registration Services': 0, u'Pediatricians': 0, u'Yelp Events': 0, u'Doctors': 0, u'Flowers': 0, u'Cooking Schools': 0, u'chandler': 0, u'Used, Vintage & Consignment': 0, u'Fort McDowell': 0, u'Physical Therapy': 0, u'Leisure Centers': 0, u'Mobile Phone Repair': 0, u'Post Offices': 0, u'Casinos': 0, u'Champagne Bars': 0, u'Pawn Shops': 0, u'Party Bus Rentals': 0, u'Cards & Stationery': 0, u'Irrigation': 0, u'Candy Stores': 0, u'Vinyl Records': 0, u'Botanical Gardens': 0, u'Tires': 0, u'Latin American': 0, u'Property Management': 0, u'Mongolian': 0, u'Wine Bars': 0, u'Hotels': 0, u'Fondue': 0, u'Shipping Centers': 0, u'Donuts': 0, u'Bagels': 0, u'Tennis': 0, u'Social Clubs': 0, u'Libraries': 0, u'Towing': 0, u'Bridal': 0, u'Vacation Rentals': 0, u'Chandler': 0, u'Youngtown': 0, u'Gila Bend': 0, u'Salad': 0, u'Resorts': 0, u'Massage Therapy': 0, u'Accountants': 0, u'Financial Advising': 0, u'Career Counseling': 0, u'Local Flavor': 0, u'Home Health Care': 0, u'Hotels & Travel': 0, u'Lebanese': 0, u'Swimming Lessons/Schools': 0, u'Tai Chi': 0, u'Luggage': 0, u'Flowers & Gifts': 0, u'Soup': 0, u'Car Rental': 0, u'Hair Stylists': 0, u'Glendale': 0, u'Wedding Planning': 0, u'Appliances': 0, u'Counseling & Mental Health': 0, u'Masonry/Concrete': 0, u'Tempe': 0, u'Apartments': 0, u'American (Traditional)': 0, u'Barre Classes': 0, u'San Tan Valley': 0, u'North Pinal': 0, u'Mobile Phones': 0, u'Motorcycle Repair': 0, u'Rugs': 0, u'Notaries': 0, u'Street Vendors': 0, u'Tapas/Small Plates': 0, u'Boxing': 0, u'Palm Springs': 0, u'Wholesale Stores': 0, u'Pets': 0, u'Churches': 0, u'Wigs': 0, u'Higley': 0, u'Skating Rinks': 0, u'Sun City': 0, u'Karaoke': 0, u'Middle Schools & High Schools': 0, u'Party Supplies': 0, u'Musical Instruments & Teachers': 0, u'Filipino': 0, u'Toy Stores': 0, u'Shoe Stores': 0, u'Dermatologists': 0, u'Immigration Law': 0, u'Leather Goods': 0, u'Steakhouses': 0, u'Fashion': 0, u'Thrift Stores': 0, u'Recording & Rehearsal Studios': 0, u'Litchfield Park': 0, u'Public Services & Government': 0, u'Desserts': 0, u'Pheonix': 0, u'Polish': 0, u'Event Planning & Services': 0, u'Obstetricians & Gynecologists': 0, u'Jewelry': 0, u'Home & Garden': 0, u'Pharmacy': 0, u'Fruits & Veggies': 0, u'Matchmakers': 0, u'Sports Wear': 0, u'Carpeting': 0, u'Banks & Credit Unions': 0, u'Goodyear': 0, u'Auto Glass Services': 0, u'Laboratory Testing': 0, u'Beer Hall': 0, u'Dentists': 0, u'Peoria': 0, u'Day Spas': 0, u'Family Practice': 0, u'North Scottsdale': 0, u'Carpet Installation': 0, u'Arcadia': 0, u'Go Karts': 0, u'Bike Rentals': 0, u'Urologists': 0, u'Gift Shops': 0, u'Cafes': 0, u"Women's Clothing": 0, u'Glendale Az': 0, u'Convenience Stores': 0, u'Plus Size Fashion': 0, u'Wineries': 0, u'Lighting Fixtures & Equipment': 0, u'Kitchen & Bath': 0, u'Hindu Temples': 0, u'Wheel & Rim Repair': 0, u'Ice Cream & Frozen Yogurt': 0, u'Eyelash Service': 0, u'Bars': 0, u'Payroll Services': 0, u'Litchfield Park ': 0, u'Periodontists': 0, u'New River': 0, u'Veterinarians': 0, u'Health Markets': 0, u'Event Photography': 0, u'Cosmetic Dentists': 0, u'Appliances & Repair': 0, u'Tattoo': 0, u'Electronics Repair': 0, u'Video/Film Production': 0, u'Landmarks & Historical Buildings': 0, u'Coolidge': 0, u'Phoenix ': 0, u'Carpet Cleaning': 0, u'Boat Dealers': 0, u'Ethiopian': 0, u'Courthouses': 0, u'Race Tracks': 0, u'Verde Valley': 0, u'Real Estate Services': 0, u'Pet Services': 0, u'Korean': 0, u'Pakistani': 0, u'Tattoo Removal': 0, u'Magicians': 0, u'Pet Stores': 0, u'Check Cashing/Pay-day Loans': 0, u'Security Systems': 0, u'Brasseries': 0, u'Cultural Center': 0, u'Horse Racing': 1, u'Vegan': 0, u'Pubs': 0, u'Landscaping': 0, u'Mini Golf': 0, u'Framing': 0, u'stars': 4.0, u'Casa Grande': 0, u'Gastropubs': 0, u'Car Stereo Installation': 0, u'Bakeries': 0, u'Hot Air Balloons': 0, u'Religious Organizations': 0, u'General Dentistry': 0, u'Swimming Pools': 0, u'Cardiologists': 0, u'Breweries': 0, u'Funeral Services & Cemeteries': 0, u'Graphic Design': 0, u'Modern European': 0, u'Photography Stores & Services': 0, u'Butcher': 0, u'Party Equipment Rentals': 0, u'Urgent Care': 0, u'Psychics & Astrologers': 0, u'Irish': 0, u'Golf': 0, u'Colleges & Universities': 0, u'Rehabilitation Center': 0, u'Cannabis Clinics': 0, u'Real Estate Agents': 0, u'Horseback Riding': 0, u'Parking': 0, u'Flight Instruction': 0, u'Videos & Video Game Rental': 0, u'Windshield Installation & Repair': 0, u'Insurance': 0, u'Flooring': 0, u'Meat Shops': 0, u'Opera & Ballet': 0, u'Airlines': 0, u'Parks': 0, u'Phoenix Sky Harbor Center': 0, u'Fast Food': 0, u'Electronics': 0, u'Colombian': 0, u'Bike Repair/Maintenance': 0, u'Medical Supplies': 0, u'Sports Clubs': 0, u'Departments of Motor Vehicles': 0, u'Peopria': 0, u'Carefree': 0, u'Gardeners': 0, u'Driving Schools': 0, u'Home Staging': 0, u'Financial Services': 0, u'Bed & Breakfast': 0, u'Florists': 0, u'Oral Surgeons': 0, u'Spanish': 0, u'Paradise Valley': 0, u'Windows Installation': 0, u'Internal Medicine': 0, u'Mexican': 0, u'Car Dealers': 0, u'Ticket Sales': 0, u'Talent Agencies': 0, u'Traditional Chinese Medicine': 0, u'Cambodian': 0, u'Specialty Food': 0, u'Police Departments': 0, u'Arcades': 0}
Out[4]:
663

Data Analysis

Now that we have our data in the way we want it, it's time to analyze it. For this, we are going to start with a simple linear regression using the number of reviews ('review_count') as the independent variable and the rating ('stars') as the dependent variable.

Before starting with the linear regression, we are going to plot the data to see if there seems to be a correlation between the two variables at simple sight.


In [5]:
import pylab as plt

records = load_file(business_file_path)
x = [record['review_count'] for record in records]
y = [record['stars'] for record in records]

plt.scatter(x, y)


Out[5]:
<matplotlib.collections.PathCollection at 0x938aab0>

Linear Regression

At first sight there seems to be a tendence: the more reviews, the better the business. We are going to perform a linear regression in order to find the slope of the line that best describes the data.


In [ ]:
import numpy as np
from sklearn import linear_model

records = load_file(business_file_path)
ratings = np.array([record['stars'] for record in records])
data = np.array([[record['review_count']] for record in records])

num_testing_records = int(len(ratings) * 0.8)
training_data = data[:num_testing_records]
testing_data = data[num_testing_records:]
training_ratings = ratings[:num_testing_records]
testing_ratings = ratings[num_testing_records:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(training_data, training_ratings)

# The coefficients
slope = regr.coef_[0]
intercept = regr.intercept_
print('Slope: \n', slope)
print('Intercept: \n', intercept)
# The root mean square error
print("RMSE: %.2f"
      % (np.mean(
    (regr.predict(testing_data) - testing_ratings) ** 2)) ** 0.5)

plt.scatter(testing_data, testing_ratings, color='black')
plt.plot(testing_data, regr.predict(testing_data), color='blue',
        linewidth=3)

plt.xticks(())
plt.yticks(())

We found that by doing a simple linear regression, the root mean square error is 0.91

Multiple linear regression

Now we are going to perform multiple linear regression using more variables to see if we can improve the accuracy of the predictions. We will define a function to do this task:


In [7]:
from sklearn.cross_validation import KFold

def multiple_lineal_regression():
    load_file(business_file_path)
    ratings = np.array([record['stars'] for record in records])
    drop_fields(['stars'], records)
    data = np.array([record.values() for record in records])

    # Create linear regression object
    regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(data, ratings)

    model = linear_model.LinearRegression(fit_intercept=True)
    model.fit(data, ratings)
    p = np.array([model.predict(xi) for xi in data])
    e = p - ratings

    total_error = np.dot(e, e)
    rmse_train = np.sqrt(total_error / len(p))

    kf = KFold(len(data), n_folds=10)
    err = 0
    for train, test in kf:
        model.fit(data[train], ratings[train])
        p = np.array([model.predict(xi) for xi in data[test]])
        e = p - ratings[test]
        err += np.dot(e, e)


    rmse_10cv = np.sqrt(err / len(data))
    print('RMSE on training: {}'.format(rmse_train))
    print('RMSE on 10-fold CV: {}'.format(rmse_10cv))

But before we use this function we are going to modify the load_file function to see how the RMSE changes when we add more variables.


In [8]:
def load_file(file_path):
    """
    Loads the Yelp Phoenix Academic Data Set file for business data, and
    transforms it so it can be analyzed

    :type file_path: list of dictionaries
    :param file_path: the path for the file that contains the businesses
    data
    :return: a list of dictionaries with the preprocessed data
    """
    records = [json.loads(line) for line in open(file_path)]
    records = add_transpose_list_column('categories', records)
    records = add_transpose_single_column('city', records)
    drop_unwanted_fields(records)

    return records

As we can see above, we are also adding the categories and cities as matrix columns along with the initial columns of review_count, longitude and latitued fields. The stars field is our dependent variable.

Now we are ready to execute our multiple linear regression function to see how much is the accuracy improved when we include several variables.


In [10]:
multiple_lineal_regression()


RMSE on training: 0.789763756989
RMSE on 10-fold CV: 0.829488852978

We can see that the RMSE has reduced from 0.91 to 0.829488852978.