The following script extracts the (more) helpful reviews from the swiss reviews and saves them locally. From the extracted reviews it also saves a list with their asin identifiers.

The list of asin identifiers will be later used to to find the average review rating for the respective products.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import yaml

Load the swiss reviews



In [91]:

    
with open("data/swiss-reviews.txt", 'r') as fp:
    swiss_rev = fp.readlines()



In [92]:

    
len(swiss_rev)









    Out[92]:





304739



In [93]:

    
swiss_rev[2]









    Out[93]:





'{"reviewerID": "A0009686KROLKEH2EHF4", "asin": "B000I7GST2", "reviewerName": "shahrizal abdullah", "helpful": [1, 1], "reviewText": "i like it so much, i hope Kuhn Rikon is available in my place in the future keep it up the good quality and company image thank you..", "overall": 5.0, "summary": "Very stylish and practical to use", "unixReviewTime": 1372723200, "reviewTime": "07 2, 2013"}\n'

The filter_helpful function keeps only the reviews which had at least 5 flags/votes in the helpfulness field. This amounts to a subset of around 23000 reviews. A smaller subset of around 10000 reviews was obtained as well by only keeping reviews with 10 flags/votes. The main advantage of the smaller subset is that it contains better quality reviews while its drawback is, of course, the reduced size.

1) Extract the helpful reviews



In [94]:

    
def filter_helpful(line):
    l = line.rstrip('\n')
    l = yaml.load(l)
    if('helpful' in l.keys()):
        if(l['helpful'][1] >= 5):
            return True
        else:
            return False
    else:
        print("Review does not have helpful score key: "+line)
        return False

Apply the filter_helpful to each swiss product review



In [95]:

    
def get_helpful(data):
    res = []
    counter = 1
    i = 0
    for line in data:
        i += 1
        if(filter_helpful(line)):
            if(counter % 1000 == 0):
                print("Count "+str(counter)+" / "+str(i))
            counter += 1
            res.append(line)
    return res



In [96]:

    
swiss_reviews_helpful = get_helpful(swiss_rev)









    



Count 1000 / 13319
Count 2000 / 25440
Count 3000 / 38733
Count 4000 / 50934
Count 5000 / 63854
Count 6000 / 77771
Count 7000 / 90390
Count 8000 / 103006
Count 9000 / 116094
Count 10000 / 129210
Count 11000 / 141829
Count 12000 / 154550
Count 13000 / 167209
Count 14000 / 179988
Count 15000 / 192203
Count 16000 / 204764
Count 17000 / 218146
Count 18000 / 231084
Count 19000 / 243987
Count 20000 / 256821
Count 21000 / 268986
Count 22000 / 281442
Count 23000 / 295562



In [97]:

    
len(swiss_reviews_helpful)









    Out[97]:





23755

Save the subset with helpful swiss product reviews



In [99]:

    
write_file = open('data/swiss-reviews-helpful-correct-bigger.txt', 'w')
for item in swiss_reviews_helpful:
  write_file.write(item)
write_file.close()

2) Extract the asins of the products which the helpful reviews correspond to



In [2]:

    
with open('data/swiss-reviews-helpful-correct-bigger.txt', 'r') as fp:
    swiss_reviews_helpful = fp.readlines()

The following function simply extracts the 'asin' from the helpful reviews. Repetitions of the asins are of no consequence, as the list is just meant to be a check up.



In [3]:

    
def filter_asin(line):
    l = line.rstrip('\n')
    l = yaml.load(l)
    if('asin' in l.keys()):
        return l['asin']
    else:
        return ''



In [4]:

    
helpful_asins = []
counter = 1
for item in swiss_reviews_helpful:
    if(counter%500 == 0):
        print(counter)
    counter += 1
    x = filter_asin(item)
    if(len(x) > 0):
        helpful_asins.append(x)

Save the list of asins.



In [104]:

    
import pickle

with open('data/helpful_asins_bigger.pickle', 'wb') as fp:
    pickle.dump(helpful_asins, fp)