The following script extracts the (more) helpful reviews from the swiss reviews and saves them locally. From the extracted reviews it also saves a list with their asin identifiers.
The list of asin identifiers will be later used to to find the average review rating for the respective products.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import yaml
Load the swiss reviews
In [91]:
with open("data/swiss-reviews.txt", 'r') as fp:
swiss_rev = fp.readlines()
In [92]:
len(swiss_rev)
Out[92]:
In [93]:
swiss_rev[2]
Out[93]:
The filter_helpful function keeps only the reviews which had at least 5 flags/votes in the helpfulness field. This amounts to a subset of around 23000 reviews. A smaller subset of around 10000 reviews was obtained as well by only keeping reviews with 10 flags/votes. The main advantage of the smaller subset is that it contains better quality reviews while its drawback is, of course, the reduced size.
In [94]:
def filter_helpful(line):
l = line.rstrip('\n')
l = yaml.load(l)
if('helpful' in l.keys()):
if(l['helpful'][1] >= 5):
return True
else:
return False
else:
print("Review does not have helpful score key: "+line)
return False
Apply the filter_helpful to each swiss product review
In [95]:
def get_helpful(data):
res = []
counter = 1
i = 0
for line in data:
i += 1
if(filter_helpful(line)):
if(counter % 1000 == 0):
print("Count "+str(counter)+" / "+str(i))
counter += 1
res.append(line)
return res
In [96]:
swiss_reviews_helpful = get_helpful(swiss_rev)
In [97]:
len(swiss_reviews_helpful)
Out[97]:
Save the subset with helpful swiss product reviews
In [99]:
write_file = open('data/swiss-reviews-helpful-correct-bigger.txt', 'w')
for item in swiss_reviews_helpful:
write_file.write(item)
write_file.close()
In [2]:
with open('data/swiss-reviews-helpful-correct-bigger.txt', 'r') as fp:
swiss_reviews_helpful = fp.readlines()
The following function simply extracts the 'asin' from the helpful reviews. Repetitions of the asins are of no consequence, as the list is just meant to be a check up.
In [3]:
def filter_asin(line):
l = line.rstrip('\n')
l = yaml.load(l)
if('asin' in l.keys()):
return l['asin']
else:
return ''
In [4]:
helpful_asins = []
counter = 1
for item in swiss_reviews_helpful:
if(counter%500 == 0):
print(counter)
counter += 1
x = filter_asin(item)
if(len(x) > 0):
helpful_asins.append(x)
Save the list of asins.
In [104]:
import pickle
with open('data/helpful_asins_bigger.pickle', 'wb') as fp:
pickle.dump(helpful_asins, fp)