Loading the Data

There are three reasonable ways to deal with the yelp dataset. Uncompressed, it is about 2GB.

Distritubted Solution

This is the way of working with the data that Yelp Recommends. This has the following advantages and disadvantages:

Advantages

  • Yelp Includes starter code for doing this.
  • Spinning up the instances takes 10s of minutes but the actual processing time takes less than 10 minutes.
  • Costs 1-2$

Disadvantages

  • Starter code is in traditional Map Reduce.
  • Can't use SKLearn or Pandas

We will use distributed computing later in this series to showcase the power of Spark.

Use a sample of the Data for exploration and then use out-of-core methods for the entire datast

Another option is to store the data in an HDFStore or SQL Database and examine only a sample of the data on your laptop. Once you have a good idea of how the data behaves, you can use something like VowPal Wabbit or certain functions in SK-Learn to analyze the entire dataset

Advantages

  • Can do everything on your laptop so there is no need to pay extra money for a VPS.

Disadvantages

  • Using only out-of-core processes is limiting

Use a single VPS

Amazon offers a VPS with 244GB ram that can be rented for as little as 27 cents an hour. See my guide on how to get EC2

Loading the Data

Understanding the format of the file


In [17]:
import ujson

In [2]:
?json_normalize

In [1]:
from pandas.io.json import json_normalize

In [ ]:
def read_reviews(location):
    json_normalize(location, )

In [ ]:
df = read_reviews(DATA_DIR + 'yelp_academic_dataset_review.json')

In [13]:
os.listdir(DATA_DIR)


Out[13]:
['yelp_academic_dataset_checkin.json',
 'yelp_academic_dataset_user.json',
 'yelp_academic_dataset_business.json',
 'yelp_academic_dataset_tip.json',
 'Dataset_Challenge_Academic_Dataset_Agreement.pdf',
 'yelp_academic_dataset_review.json',
 'Yelp_Dataset_Challenge_Terms_round_5.pdf']

In [15]:
df = pd.read_json(DATA_DIR + 'yelp_academic_dataset_review.json')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-cfc40edbbb69> in <module>()
----> 1 df = pd.read_json(DATA_DIR + 'yelp_academic_dataset_review.json')

/home/ubuntu/anaconda3/lib/python3.4/site-packages/pandas/io/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit)
    197         obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
    198                           keep_default_dates, numpy, precise_float,
--> 199                           date_unit).parse()
    200 
    201     if typ == 'series' or obj is None:

/home/ubuntu/anaconda3/lib/python3.4/site-packages/pandas/io/json.py in parse(self)
    265 
    266         else:
--> 267             self._parse_no_numpy()
    268 
    269         if self.obj is None:

/home/ubuntu/anaconda3/lib/python3.4/site-packages/pandas/io/json.py in _parse_no_numpy(self)
    482         if orient == "columns":
    483             self.obj = DataFrame(
--> 484                 loads(json, precise_float=self.precise_float), dtype=None)
    485         elif orient == "split":
    486             decoded = dict((str(k), v)

ValueError: Expected object or value

Setup


In [3]:
%pylab inline
from __future__ import print_function
import rpy2
%load_ext rpy2.ipython
import scipy as sp
import numpy as np
import pandas as pd
pd.options.display.mpl_style = 'default'
import os
import sklearn


Populating the interactive namespace from numpy and matplotlib

In [12]:
DATA_DIR = r'/home/ubuntu/data/yelp_dataset_challenge_academic_dataset'

In [ ]: