DATA SCIENCE INTERVIEW CHALLENGE

For this exercise, you will analyze a dataset from Amazon. The data format and a sample entry are shown on the next page. A. (Suggested duration: 90 mins) With the given data for 548552 products, perform exploratory analysis and make suggestions for further analysis on the following aspects.

  1. Trustworthiness of ratings
    Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively speaking) about the ratings in this dataset?
    It is using Likert scale (1-5) which a numerical value (a quantitative value) is assigned to a choice (qualitative). About 140K out of 542k of data have zero reviews so those needs to be cleaned or removed before getting the average of the ratings. The average rating is 4.3 which is interpreted as "Good to Very Good" compared to 3.2 ("Neutral") for uncleaned data.
    Software and Baby products have the highest average ratings at 4.5 ("Very Good") while Video games has the lowest rating of 2.5 ("Average".
  2. Category bloat
    Consider the product group named 'Books'. Each product in this group is associated with categories. Naturally, with categorization, there are tradeoffs between how broad or specific the categories must be. For this dataset, quantify the following:
    a. Is there redundancy in the categorization? How can it be identified/removed?
    Yes, some of the categories are duplicated like below since Preaching and Sermons can be combined together.
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]

b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?
Yes, it is possible to reduce the number of categories by filtering relatively few category entries. We can use Machine Learning algorithm like Naive Bayes, SVM or NLTK or GenSim (uses cosine similarity) to classify similar words in the categories and combine them into fewer categories.


In [23]:
%matplotlib inline
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
from pylab import plot, show, text
import datetime
import matplotlib.dates as mdates
import pylab

In [88]:
# Extract all items and their average rating then save as a file 
i = 0
with open('amazon_users.txt', 'w', encoding="utf8") as file:
    with open('amazon-meta.txt', 'r', encoding="utf8") as f:
        for line in f:
            line = line.strip()
            #if i == 10:
            #    break
            if line[:3] == 'Id:':
                id = int(line[4:])
                i += 1
            if line[:5] == 'ASIN:':
                asin = line[6:]
            if line[:6] == 'title:':# and asin == '0312254040':
                title = line[7:].replace('"','').replace(',','')
            if line[:6] == 'group:':
                group = line[7:]
            if line[:8] == 'similar:':
                similar = int(line[9:11])
            if line[:11] == 'categories:':
                categories = (line[12:])
            if line[:8] == 'reviews:':# and asin == '0312254040':
                reviews = int(line[15:18])
                avg_rating = float(line[-3:].replace(':',''))
                strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' +  str(similar) + ',' +  categories + ',' +  str(reviews) + ',' +  str(avg_rating)
                file.write(strwrite + '\n')
file.close


Out[88]:
<function TextIOWrapper.close>

In [86]:
#Display and write the categories of sample products

i, is_categories, is_group = 0, False, False
with open('amazon_categories.txt', 'w', encoding="utf8") as file:
    with open('amazon-meta.txt', 'r', encoding="utf8") as f:
        for line in f:
            line = line.strip()
            if i == 10:
                break
            if line[:6] == 'group:':
                #print ('group', (line[7:]))
                is_group = line[7:] == 'Book'
                j = 0
                is_categories = False
                i += 1
                
            if line[:11] == 'categories:' and group:
                #print ('categories', (line[12:]))
                num_categories = int(line[12:])
                is_categories = True
            
            if (is_categories and is_group and j <= num_categories):
                j += 1
                print (line)
                #strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' +  str(similar) + ',' +  categories + ',' +  str(reviews) + ',' +  str(avg_rating)
                #file.write(strwrite + '\n')
file.close


categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
categories: 1
|Books[283155]|Subjects[1000]|Home & Garden[48]|Crafts & Hobbies[5126]|General[5144]
categories: 5
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Reference[172810]|Commentaries[12155]|New Testament[12159]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Christian Living[12333]|Discipleship[12335]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Bibles[12059]|Translations[764432]|Life Application[572080]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Bible & Other Sacred Texts[12056]|Bible[764430]|New Testament[572082]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Bibles[12059]|Study Guides, History & Reference[764438]|General[572094]
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Worship & Devotion[12465]|Prayerbooks[12470]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Christian Living[12333]|Business[297488]
categories: 5
|Books[283155]|Subjects[1000]|Arts & Photography[1]|Photography[2020]|Photo Essays[2082]
|Books[283155]|Subjects[1000]|History[9]|Americas[4808]|United States[4853]|General[4870]
|Books[283155]|Subjects[1000]|History[9]|Jewish[4992]|General[4993]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Social Sciences[11232]|Sociology[11288]|Urban[11296]
|[172282]|Categories[493964]|Camera & Photo[502394]|Photography Books[733540]|Photo Essays[733676]
categories: 4
|Books[283155]|Subjects[1000]|Gay & Lesbian[301889]|Nonfiction[10703]|General[10716]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Crime & Criminals[11003]|Criminology[11005]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|General[11083]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|U.S.[11117]
categories: 1
|Books[283155]|Subjects[1000]|Cooking, Food & Wine[6]|Baking[4196]|Bread[4197]
Out[86]:
<function TextIOWrapper.close>

In [41]:
# read file and load to dataframe
df = pd.read_csv('amazon_users.txt', header=None, 
                 names=['id', 'asin', 'title', 'group', 'similar', 'categories', 'reviews', 'avg_rating'])

In [42]:
df.describe()


Out[42]:
id similar categories reviews avg_rating
count 542684.000000 542684.000000 542684.000000 542684.000000 542684.000000
mean 274414.212208 3.296071 4.624605 7.498229 3.209534
std 158454.479276 2.287289 4.450647 13.759132 1.996296
min 1.000000 0.000000 0.000000 0.000000 0.000000
25% 137161.750000 0.000000 2.000000 0.000000 0.000000
50% 274427.500000 5.000000 4.000000 2.000000 4.000000
75% 411674.250000 5.000000 6.000000 8.000000 5.000000
max 548551.000000 5.000000 116.000000 99.000000 5.000000

In [43]:
# what item has the most number of reviews?
df[df['categories'] == 116]


Out[43]:
id asin title group similar categories reviews avg_rating
113838 115078 9626341408 The History of Classical Music Book 5 116 2 4.5

In [49]:
# how many items that has no review
print(len(df[df.reviews == 0]))


139949

In [53]:
# select items with reviews
df_avg_rating = df[df.reviews > 0]
df_avg_rating.avg_rating.describe()


Out[53]:
count    402735.000000
mean          4.324836
std           0.739279
min           1.000000
25%           4.000000
50%           4.500000
75%           5.000000
max           5.000000
Name: avg_rating, dtype: float64

In [58]:
# what is the average rating of items per group?
#CE is consumer electronics"
df_avg_rating.groupby(['group']).avg_rating.mean().sort_values(ascending=False)


Out[58]:
group
Software        4.500000
Baby Product    4.500000
Music           4.482065
Toy             4.357143
Book            4.315994
Video           4.164579
Sports          4.000000
DVD             3.940051
CE              3.500000
Video Games     2.500000
Name: avg_rating, dtype: float64

PART B

B. (Suggested duration: 30 mins) Give the number crunching a rest! Just think about these problems.

  1. Algorithm thinking
    How would build the product categorization from scratch, using similar/co-purchased information?
    I would collect all words in the product categorization and including the categorization of the similar products to the given product. Then using Naive Bayes, or Support Vector Machine (SVM) or NLTK, I would classify the products based on similar categorization among the given product and it's similar products. For example: productA is categorized as cat1, cat2, cat3 while similar product is ProductA1 and categorized as cat1, cat2.1, cat5. Then based on similarity coefficient, ProductA is now classified as (0.6cat1, 0.15cat2, 0.15cat.1, .05cat3, .05cat5).


</b>

  1. Product thinking
    Now, put on your 'product thinking' hat.
    a. Is it a good idea to show users the categorization hierarchy for items?
    Yes, showing users the categorization hierarchy for each items will help them to search for more or other items which are on the same category with what he/she is looking at. This also gives more user engagement on the website and thus increase the probability that this user will purchase an item. b. Is it a good idea to show users similar/co-purchased items?
    Yes, it is a good idea to show users similar/co-purchased items since it is a quick way to showcase some of the products to the user. It practically increases sales since it will create a good feeling on the customer that it is being personalized for their own needs and increases shopping behavior to check it out or buy more. c. Is it a good idea to show users reviews and ratings for items?
    Yes, it is a good idea to show users reviews and ratings for items because for some customers that has limited information on the item, they will based their decisions on other customers who bought and used the product. It is also important to display the average rating AND HOW MANY REVIEWS. A five-star rating with 3 reviewers may not be good compared to 4.5 rating with 100 reviewers. d. For each of the above, why? How will you establish the same?

End of Report