We are using the Amazon product data available at http://jmcauley.ucsd.edu/data/amazon/ with appropriate permissions from the author. The complete dataset contains product reviews and metadata from Amazon, including 143.7 million reviews spanning May 1996 - July 2014. It includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
We use the Yelp dataset available from http://www.yelp.com/dataset_challenge/. The complete dataset contains:
{
'type': 'review',
'business_id': (the identifier of the reviewed business),
'user_id': (the identifier of the authoring user),
'stars': (star rating, integer 1-5),
'text': (review text),
'date': (date, formatted like '2011-04-19'),
'votes': {
'useful': (count of useful votes),
'funny': (count of funny votes),
'cool': (count of cool votes)
}
}
Category: Cell Phone and Accessories
Total Number of Reviews:
In [1]:
!sed -n '$=' reviews_Cell_Phones_and_Accessories.json.gz2
In [2]:
import pandas as pd
import ast, gzip
import simplejson as json
In [3]:
def parse(path):
count = 0
with gzip.open(path, 'r') as f:
for l in f:
rev = ast.literal_eval(l.strip())
year = int(rev['reviewTime'].split(',')[-1].strip())
if year == 2013:
count += 1
return count
path = '/home/ankesh/masters-thesis/data/reviews_Cell_Phones_and_Accessories.json.gz'
count = parse(path)
In [4]:
print count