For this exercise, you will analyze a dataset from Amazon. The data format and a sample entry are shown on the next page. A. (Suggested duration: 90 mins) With the given data for 548552 products, perform exploratory analysis and make suggestions for further analysis on the following aspects.
b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?
Yes, it is possible to reduce the number of categories by filtering relatively few category entries. We can use Machine Learning algorithm like Naive Bayes, SVM or NLTK or GenSim (uses cosine similarity) to classify similar words in the categories and combine them into fewer categories.
In [23]:
%matplotlib inline
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
from pylab import plot, show, text
import datetime
import matplotlib.dates as mdates
import pylab
In [88]:
# Extract all items and their average rating then save as a file
i = 0
with open('amazon_users.txt', 'w', encoding="utf8") as file:
with open('amazon-meta.txt', 'r', encoding="utf8") as f:
for line in f:
line = line.strip()
#if i == 10:
# break
if line[:3] == 'Id:':
id = int(line[4:])
i += 1
if line[:5] == 'ASIN:':
asin = line[6:]
if line[:6] == 'title:':# and asin == '0312254040':
title = line[7:].replace('"','').replace(',','')
if line[:6] == 'group:':
group = line[7:]
if line[:8] == 'similar:':
similar = int(line[9:11])
if line[:11] == 'categories:':
categories = (line[12:])
if line[:8] == 'reviews:':# and asin == '0312254040':
reviews = int(line[15:18])
avg_rating = float(line[-3:].replace(':',''))
strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' + str(similar) + ',' + categories + ',' + str(reviews) + ',' + str(avg_rating)
file.write(strwrite + '\n')
file.close
Out[88]:
In [86]:
#Display and write the categories of sample products
i, is_categories, is_group = 0, False, False
with open('amazon_categories.txt', 'w', encoding="utf8") as file:
with open('amazon-meta.txt', 'r', encoding="utf8") as f:
for line in f:
line = line.strip()
if i == 10:
break
if line[:6] == 'group:':
#print ('group', (line[7:]))
is_group = line[7:] == 'Book'
j = 0
is_categories = False
i += 1
if line[:11] == 'categories:' and group:
#print ('categories', (line[12:]))
num_categories = int(line[12:])
is_categories = True
if (is_categories and is_group and j <= num_categories):
j += 1
print (line)
#strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' + str(similar) + ',' + categories + ',' + str(reviews) + ',' + str(avg_rating)
#file.write(strwrite + '\n')
file.close
Out[86]:
In [41]:
# read file and load to dataframe
df = pd.read_csv('amazon_users.txt', header=None,
names=['id', 'asin', 'title', 'group', 'similar', 'categories', 'reviews', 'avg_rating'])
In [42]:
df.describe()
Out[42]:
In [43]:
# what item has the most number of reviews?
df[df['categories'] == 116]
Out[43]:
In [49]:
# how many items that has no review
print(len(df[df.reviews == 0]))
In [53]:
# select items with reviews
df_avg_rating = df[df.reviews > 0]
df_avg_rating.avg_rating.describe()
Out[53]:
In [58]:
# what is the average rating of items per group?
#CE is consumer electronics"
df_avg_rating.groupby(['group']).avg_rating.mean().sort_values(ascending=False)
Out[58]:
</b>