We found these statistics about the dataset on amazon: http://minimaxir.com/2017/01/amazon-spark/, credits: Max Woolf. Based on his analysis we made a comparison computing the same analysis but with only swiss reviews.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import datetime
from ggplot import *
plt.style.use('seaborn-whitegrid')
plt.style.use('seaborn-notebook')
#['grayscale', 'fivethirtyeight', 'seaborn-deep', 'bmh', 'seaborn-poster', 'seaborn-ticks', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-whitegrid', 'seaborn-white', 'seaborn-bright', 'seaborn-paper', 'seaborn-pastel', 'seaborn-colorblind', 'seaborn-notebook', 'seaborn-dark-palette', 'dark_background', 'ggplot', 'seaborn-muted', 'seaborn-talk', 'classic']
We start by looking at the product data without the reviews.
Load the product data in a dataframe
In [2]:
def parse(path):
f = open(path, 'r')
for l in f:
yield eval(l)
f.close()
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df_products = getDF('data/swiss_products.json')
df_products = df_products.sort_values(['brand', 'price'])
df_products[df_products['brand'] == 'victorinox']
Out[2]:
Our data contains false positives. We found no way of filtering them out besides a quick manual check.
In [3]:
df_products.brand.value_counts()
Out[3]:
We decided to remove the obvious ones
In [4]:
false_positives = ['samsonite', 'mazda', 'lacoste', 'pelikan', 'bell']
df_products = df_products[~df_products.brand.isin(false_positives)]
Let's take a took at the data. We start with detecting NA values
In [5]:
for column_name in df_products.columns:
print("NA values for " + column_name + ": " + str(df_products[column_name].isnull().sum()))
We fill na values as follows: empty for salesrank, related and description, and average price for price
In [6]:
df_products['salesrank'].fillna(value='{}', inplace=True)
df_products['related'].fillna(value='{}', inplace=True)
df_products['description'].fillna(value='', inplace=True)
av_price = df_products['price'].mean()
df_products['price'].fillna(value=av_price, inplace=True)
av_price
Out[6]:
The average price of the swiss products is $79.73.
Let's look at the most expansive product on amazon
In [7]:
df_products[df_products.price == max(df_products['price'])].iloc[0].asin
from IPython.display import Image
Image(filename='data/expensivewatch.jpg')
Out[7]:
Let's look at some statistics about the products
In [8]:
df_products.price.describe()
Out[8]:
Which brand occurs the most in our dataset?
In [9]:
df_products.brand.value_counts()[:20].plot(kind='bar', figsize=(10,5), title="Number of occurences of brands")
Out[9]:
Which category occurs the most?
In [ ]:
Which category occurs the most
In [10]:
flatten = lambda l: [item for sublist in l for item in sublist]
categories_list = []
for categories in df_products.categories:
for categorie in flatten(categories):
categories_list.append(categorie)
pd.Series(categories_list).value_counts()[:20].plot(kind='bar', figsize=(10,5), title="Number of occurences of categories")
Out[10]:
In [11]:
df = pd.read_csv("data/amazon_ratings.csv")
df.head(10)
Out[11]:
Transform categories into list
In [12]:
def get_category(x):
while 'u\'' in x:
x = x.replace('u\'','\'')
for y in eval(x):
for first in y:
return first
df['category'] = df['category'].apply(get_category)
df.head(10)
Out[12]:
Give number to each review according to n-th review of a user
In [13]:
df = df.sort_values(['user_id', 'timestamp'])
df[10:20]
Out[13]:
In [14]:
df['nth_user'] = 1
user_id = ''
counter = 1
for i in range(0,df.shape[0]):
if df.iloc[i].user_id != user_id:
counter = 1
user_id = df.iloc[i].user_id
else:
counter += 1
df.set_value(i, 'nth_user', counter)
df[10:20]
Out[14]:
Give number to each review according to n-th review of product
In [15]:
df = df.sort_values(['item_id', 'timestamp']).reset_index(drop=True)
df.head()
Out[15]:
In [16]:
df['nth_product'] = 1
item_id = ''
counter = 1
for i in range(0,df.shape[0]):
if df.iloc[i].item_id != item_id:
counter = 1
item_id = df.iloc[i].item_id
else:
counter += 1
df.set_value(i, 'nth_product', counter)
df.head()
Out[16]:
Transform date into readable date
In [17]:
def get_time(x):
return datetime.datetime.fromtimestamp(
int(x)
).strftime('%Y-%m-%d %H:%M:%S')
df['time'] = df['timestamp'].apply(get_time)
df.head(10)
Out[17]:
How many times each rating is given?
In [18]:
df_ratings = df.groupby(df.rating).count().ix[:,0]
df_ratings
Out[18]:
In [19]:
df_ratings.plot.bar()
Out[19]:
Do ratings evolve over time?
In [20]:
def percentage_rating(year):
df_year = df[df.time.str.contains(year)]
counts_year = df_year.groupby(['rating']).count().ix[:,0]
print(df_year.shape)
counts_year = counts_year/sum(counts_year)
ret = pd.DataFrame(counts_year)
ret.columns = [year]
return ret.transpose()
percentage_rating('1999')
Out[20]:
In [21]:
ax = pd.concat([percentage_rating('1999'), percentage_rating('2014')]).transpose().plot.bar()
ax.set_title("Percentage of ratings given per year", fontsize=18)
ax.set_xlabel("Rating given", fontsize=16)
ax.set_ylabel("Percentage of total reviews", fontsize=16)
ax.legend(loc=2,prop={'size':14})
Out[21]:
We will now start with making plots to compare the swiss products to the analysis of Max Woolf, as mentioned earlier.
In [22]:
df_user_review_counts = df.groupby(df.user_id)
counts = df_user_review_counts.count().ix[:,0].value_counts().sort_index()
counts
Out[22]:
In [23]:
num_reviews = []
prop = []
s = 0
for i in range(0,counts.shape[0]):
num_reviews.append(counts.index[i])
s += counts.iloc[i]/sum(counts)
prop.append(s)
df_counts = pd.DataFrame(
{'num_reviews': num_reviews,
'prop': prop
})
df_counts.head()
Out[23]:
This graph shows the accumalative distribution of reviews a user gives on average on swiss products.
In [24]:
ax = df_counts.plot(x='num_reviews', y='prop', figsize=(10,7),
color="#2980b9", legend=None)
ax.set_title("Cumulative Proportion of # Amazon Reviews Given by User", fontsize=18)
ax.set_xlabel("# Reviews Given By User", fontsize=14)
ax.set_ylabel("Cumulative Proportion of All Amazon Reviewers", fontsize=14)
ax.set_ylim((0,1.1))
ax.yaxis.set_ticks(np.arange(0, 1.1, 0.25))
ax.set_xlim((0,50))
Out[24]:
In [25]:
df_user_review_counts = df.groupby(df.item_id)
counts = df_user_review_counts.count().ix[:,0].value_counts().sort_index()
counts
Out[25]:
In [26]:
num_reviews = []
prop = []
s = 0
for i in range(0,counts.shape[0]):
num_reviews.append(counts.index[i])
s += counts.iloc[i]/sum(counts)
prop.append(s)
df_counts = pd.DataFrame(
{'num_reviews': num_reviews,
'prop': prop
})
df_counts.head()
Out[26]:
This graph shows the accumalative distribution of reviews a swiss product receives on average.
In [27]:
ax = df_counts.plot(x='num_reviews', y='prop', figsize=(10,7),
color="#27ae60", legend=None)
ax.set_title("Cumulative Proportion of # Reviews Given For Product", fontsize=18)
ax.set_xlabel("# Reviews Given For Product", fontsize=14)
ax.set_ylabel("Cumulative Proportion of All Amazon Products", fontsize=14)
ax.set_ylim((0,1.1))
ax.yaxis.set_ticks(np.arange(0, 1.1, 0.25))
ax.set_xlim((0,50))
Out[27]:
In [28]:
df_user_review_counts = df.groupby(df.category)
counts = df_user_review_counts.count().ix[:,0]
categories = pd.DataFrame({'avg_rating': df_user_review_counts.agg({'rating': 'mean'})['rating'],
'count': counts})
categories
Out[28]:
This graph shows the average rating by product category.
In [29]:
# [((categories.index == "baby products") | (categories.index == "office & school supplies"))]
cats = categories['avg_rating'].sort_values(ascending=True)
ax = cats.plot.barh(width=1.0, color = "#e67e22", alpha=0.9, figsize=(6, 16))
ax.set_title("Average Rating Score Given For Amazon Reviews, by Product Category", fontsize = 16)
ax.set_xlabel("Avg. Rating For Reviews Given in Category", fontsize = 14)
ax.set_ylabel("Category", fontsize = 14)
for i in range(0,cats.shape[0]):
height = cats[i]
ax.text(height-1/3, i-1/4,
'%.2f' % height,
ha='center', va='bottom', color = 'white', fontweight='bold', size=16)
In [30]:
counts = df.groupby(df.user_id).count().ix[:,0]
avg_rating = df.groupby(df.user_id).agg({'rating': 'mean'})['rating']
users = pd.DataFrame({'avg_rating': avg_rating,
'count_reviews': counts})
users = users[users.count_reviews > 4]
users['avg_rating'] = round(users['avg_rating'],1)
users.head()
Out[30]:
In [31]:
counts = users['avg_rating'].value_counts()
counts = pd.DataFrame(counts).sort_index()
counts2 = counts.reset_index()
counts2.head()
Out[31]:
In [32]:
av = sum(counts2['index']*counts2['avg_rating'])/ sum(counts2['avg_rating'])
av
Out[32]:
This graph illustrates the average score a user gives (provided that he at least reviewd 5 products in our dataset).
In [33]:
ax = counts.plot.bar(figsize=(8,5), color ='#2980b9', legend=None)
ax.axvline(counts2[counts2['index']==round(av,1)].index, color='k', linestyle='--')
ax.set_title("Distribution of Average Ratings by User, for Amazon Products")
ax.set_xlabel("Average Rating for Amazon Products Given By User (5 Ratings Minimum)")
ax.set_ylabel("Count of Users")
ax.set_xticks([0,7,16,26,36])
ax.set_xticklabels(['1','2','3','4','5'])
Out[33]:
In [34]:
counts = df.groupby(df.item_id).count().ix[:,0]
avg_rating = df.groupby(df.item_id).agg({'rating': 'mean'})['rating']
products = pd.DataFrame({'avg_rating': avg_rating,
'count_reviews': counts})
products = products[products.count_reviews > 4]
products['avg_rating'] = round(products['avg_rating'],1)
products.head()
Out[34]:
In [35]:
counts = products['avg_rating'].value_counts()
counts = pd.DataFrame(counts).sort_index()
counts2 = counts.reset_index()
counts2.head()
Out[35]:
In [36]:
av = sum(counts2['index']*counts2['avg_rating'])/ sum(counts2['avg_rating'])
av
Out[36]:
This graph illustrates the average score a product has (provided that it has at least 5 reviews).
In [37]:
ax = counts.plot.bar(figsize=(10,6), color ='#27ae60', legend=None, width = 1)
ax.axvline(counts2[counts2['index']==round(av,1)].index, color='k', linestyle='--')
ax.set_title("Distribution of Overall Ratings on Amazon Products", fontsize = 18)
ax.set_xlabel("Overall Amazon Product Ratings (5 Ratings Minimum)", fontsize = 14)
ax.set_ylabel("Count of Products", fontsize = 14)
ax.set_xticks([0,9,19,29,39])
ax.set_xticklabels(['1.1','2','3','4','5'])
Out[37]:
In [38]:
df_breakdown_users =df[df.nth_user <= 50].groupby(['nth_user', 'rating']).count()
df_breakdown_users = df_breakdown_users.reset_index(level=1)
df_breakdown_users.head()
Out[38]:
In [39]:
s = 0
df2 = pd.DataFrame({'rating_5':[], 'rating_4':[], 'rating_3':[], 'rating_2':[], 'rating_1':[]})
for i in range(1,51):
s = sum(df_breakdown_users[df_breakdown_users.index == i]['user_id'])
df2 = df2.append(pd.DataFrame({'rating_5':[0],
'rating_4':[0],
'rating_3':[0],
'rating_2':[0],
'rating_1':[0]},
index = [i]))
for j in range(1,6):
result = df_breakdown_users[(df_breakdown_users.index == i) & (df_breakdown_users.rating == j*1.0)]
if not result.empty:
df2.set_value(i,'rating_'+str(j), result.user_id/s)
df2 = df2[["rating_5","rating_4","rating_3","rating_2","rating_1"]]
df2.head()
Out[39]:
This graph shows how the ratings a person gives evolve over time on average.
In [40]:
ax = df2.plot.bar(stacked=True, color = ['#003a09', '#049104', '#d36743', '#dd310f', '#870202'], width = 0.9)
ax.set_xticks([10,20,30,40,50])
ax.set_xticklabels(['10','20','30','40','50'])
ax.set_title("Breakdown of Amazon Ratings Given by Users, by n-th Rating Given\n\n")
ax.set_xlabel("n-th Rating Given by User")
ax.set_ylabel("Proportion")
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=5, mode="expand", borderaxespad=0.)
ax.set_ylim((0,1))
Out[40]:
In [41]:
df_breakdown_users =df[df.nth_product <= 50].groupby(['nth_product', 'rating']).count()
df_breakdown_users = df_breakdown_users.reset_index(level=1)
df_breakdown_users.head()
Out[41]:
In [42]:
s = 0
df2 = pd.DataFrame({'rating_5':[], 'rating_4':[], 'rating_3':[], 'rating_2':[], 'rating_1':[]})
for i in range(1,51):
s = sum(df_breakdown_users[df_breakdown_users.index == i]['item_id'])
df2 = df2.append(pd.DataFrame({'rating_5':[0],
'rating_4':[0],
'rating_3':[0],
'rating_2':[0],
'rating_1':[0]},
index = [i]))
for j in range(1,6):
result = df_breakdown_users[(df_breakdown_users.index == i) & (df_breakdown_users.rating == j*1.0)]
if not result.empty:
df2.set_value(i,'rating_'+str(j), result.user_id/s)
df2 = df2[["rating_5","rating_4","rating_3","rating_2","rating_1"]]
df2.head()
Out[42]:
This graph shows how the ratings a product receives evolve over time on average.
In [43]:
ax = df2.plot.bar(stacked=True, color = ['#003a09', '#049104', '#d36743', '#dd310f', '#870202'], width = 0.9)
ax.set_xticks([10,20,30,40,50])
ax.set_xticklabels(['10','20','30','40','50'])
ax.set_title("Breakdown of Ratings for Amazon Products by n-th Item Rating Given\n\n")
ax.set_xlabel("n-th Review for Item Given")
ax.set_ylabel("Proportion")
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=5, mode="expand", borderaxespad=0.)
ax.set_ylim((0,1))
Out[43]:
When comparing the Swiiw reviews with the entire dataset, we can observe the following things: