The following script extracts the (more) helpful reviews from the swiss reviews and saves them locally. From the extracted reviews it also saves a list with their asin identifiers.

The list of asin identifiers will be later used to to find the average review rating for the respective products.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import yaml

Load the swiss reviews


In [91]:
with open("data/swiss-reviews.txt", 'r') as fp:
    swiss_rev = fp.readlines()

In [92]:
len(swiss_rev)


Out[92]:
304739

In [93]:
swiss_rev[2]


Out[93]:
'{"reviewerID": "A0009686KROLKEH2EHF4", "asin": "B000I7GST2", "reviewerName": "shahrizal abdullah", "helpful": [1, 1], "reviewText": "i like it so much, i hope Kuhn Rikon is available in my place in the future keep it up the good quality and company image thank you..", "overall": 5.0, "summary": "Very stylish and practical to use", "unixReviewTime": 1372723200, "reviewTime": "07 2, 2013"}\n'

The filter_helpful function keeps only the reviews which had at least 5 flags/votes in the helpfulness field. This amounts to a subset of around 23000 reviews. A smaller subset of around 10000 reviews was obtained as well by only keeping reviews with 10 flags/votes. The main advantage of the smaller subset is that it contains better quality reviews while its drawback is, of course, the reduced size.

1) Extract the helpful reviews


In [94]:
def filter_helpful(line):
    l = line.rstrip('\n')
    l = yaml.load(l)
    if('helpful' in l.keys()):
        if(l['helpful'][1] >= 5):
            return True
        else:
            return False
    else:
        print("Review does not have helpful score key: "+line)
        return False

Apply the filter_helpful to each swiss product review


In [95]:
def get_helpful(data):
    res = []
    counter = 1
    i = 0
    for line in data:
        i += 1
        if(filter_helpful(line)):
            if(counter % 1000 == 0):
                print("Count "+str(counter)+" / "+str(i))
            counter += 1
            res.append(line)
    return res

In [96]:
swiss_reviews_helpful = get_helpful(swiss_rev)


Count 1000 / 13319
Count 2000 / 25440
Count 3000 / 38733
Count 4000 / 50934
Count 5000 / 63854
Count 6000 / 77771
Count 7000 / 90390
Count 8000 / 103006
Count 9000 / 116094
Count 10000 / 129210
Count 11000 / 141829
Count 12000 / 154550
Count 13000 / 167209
Count 14000 / 179988
Count 15000 / 192203
Count 16000 / 204764
Count 17000 / 218146
Count 18000 / 231084
Count 19000 / 243987
Count 20000 / 256821
Count 21000 / 268986
Count 22000 / 281442
Count 23000 / 295562

In [97]:
len(swiss_reviews_helpful)


Out[97]:
23755

Save the subset with helpful swiss product reviews


In [99]:
write_file = open('data/swiss-reviews-helpful-correct-bigger.txt', 'w')
for item in swiss_reviews_helpful:
  write_file.write(item)
write_file.close()

2) Extract the asins of the products which the helpful reviews correspond to


In [2]:
with open('data/swiss-reviews-helpful-correct-bigger.txt', 'r') as fp:
    swiss_reviews_helpful = fp.readlines()

The following function simply extracts the 'asin' from the helpful reviews. Repetitions of the asins are of no consequence, as the list is just meant to be a check up.


In [3]:
def filter_asin(line):
    l = line.rstrip('\n')
    l = yaml.load(l)
    if('asin' in l.keys()):
        return l['asin']
    else:
        return ''

In [4]:
helpful_asins = []
counter = 1
for item in swiss_reviews_helpful:
    if(counter%500 == 0):
        print(counter)
    counter += 1
    x = filter_asin(item)
    if(len(x) > 0):
        helpful_asins.append(x)


500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
12500
13000
13500
14000
14500
15000
15500
16000
16500
17000
17500
18000
18500
19000
19500
20000
20500
21000
21500
22000
22500
23000
23500

Save the list of asins.


In [104]:
import pickle

with open('data/helpful_asins_bigger.pickle', 'wb') as fp:
    pickle.dump(helpful_asins, fp)