There are three reasonable ways to deal with the yelp dataset. Uncompressed, it is about 2GB.
This is the way of working with the data that Yelp Recommends. This has the following advantages and disadvantages:
We will use distributed computing later in this series to showcase the power of Spark.
Another option is to store the data in an HDFStore or SQL Database and examine only a sample of the data on your laptop. Once you have a good idea of how the data behaves, you can use something like VowPal Wabbit or certain functions in SK-Learn to analyze the entire dataset
Amazon offers a VPS with 244GB ram that can be rented for as little as 27 cents an hour. See my guide on how to get EC2
In [17]:
import ujson
In [2]:
?json_normalize
In [1]:
from pandas.io.json import json_normalize
In [ ]:
def read_reviews(location):
json_normalize(location, )
In [ ]:
df = read_reviews(DATA_DIR + 'yelp_academic_dataset_review.json')
In [13]:
os.listdir(DATA_DIR)
Out[13]:
In [15]:
df = pd.read_json(DATA_DIR + 'yelp_academic_dataset_review.json')
In [3]:
%pylab inline
from __future__ import print_function
import rpy2
%load_ext rpy2.ipython
import scipy as sp
import numpy as np
import pandas as pd
pd.options.display.mpl_style = 'default'
import os
import sklearn
In [12]:
DATA_DIR = r'/home/ubuntu/data/yelp_dataset_challenge_academic_dataset'
In [ ]: