DATA SCIENCE INTERVIEW CHALLENGE

For this exercise, you will analyze a dataset from Amazon. The data format and a sample entry are shown on the next page. A. (Suggested duration: 90 mins) With the given data for 548552 products, perform exploratory analysis and make suggestions for further analysis on the following aspects.

Trustworthiness of ratings
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively speaking) about the ratings in this dataset?
It is using Likert scale (1-5) which a numerical value (a quantitative value) is assigned to a choice (qualitative). About 140K out of 542k of data have zero reviews so those needs to be cleaned or removed before getting the average of the ratings. The average rating is 4.3 which is interpreted as "Good to Very Good" compared to 3.2 ("Neutral") for uncleaned data.
Software and Baby products have the highest average ratings at 4.5 ("Very Good") while Video games has the lowest rating of 2.5 ("Average".
Category bloat
Consider the product group named 'Books'. Each product in this group is associated with categories. Naturally, with categorization, there are tradeoffs between how broad or specific the categories must be. For this dataset, quantify the following:
a. Is there redundancy in the categorization? How can it be identified/removed?
Yes, some of the categories are duplicated like below since Preaching and Sermons can be combined together.
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]

b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?
Yes, it is possible to reduce the number of categories by filtering relatively few category entries. We can use Machine Learning algorithm like Naive Bayes, SVM or NLTK or GenSim (uses cosine similarity) to classify similar words in the categories and combine them into fewer categories.



In [23]:

    
%matplotlib inline
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
from pylab import plot, show, text
import datetime
import matplotlib.dates as mdates
import pylab



In [88]:

    
# Extract all items and their average rating then save as a file 
i = 0
with open('amazon_users.txt', 'w', encoding="utf8") as file:
    with open('amazon-meta.txt', 'r', encoding="utf8") as f:
        for line in f:
            line = line.strip()
            #if i == 10:
            #    break
            if line[:3] == 'Id:':
                id = int(line[4:])
                i += 1
            if line[:5] == 'ASIN:':
                asin = line[6:]
            if line[:6] == 'title:':# and asin == '0312254040':
                title = line[7:].replace('"','').replace(',','')
            if line[:6] == 'group:':
                group = line[7:]
            if line[:8] == 'similar:':
                similar = int(line[9:11])
            if line[:11] == 'categories:':
                categories = (line[12:])
            if line[:8] == 'reviews:':# and asin == '0312254040':
                reviews = int(line[15:18])
                avg_rating = float(line[-3:].replace(':',''))
                strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' +  str(similar) + ',' +  categories + ',' +  str(reviews) + ',' +  str(avg_rating)
                file.write(strwrite + '\n')
file.close









    Out[88]:





<function TextIOWrapper.close>



In [86]:

    
#Display and write the categories of sample products

i, is_categories, is_group = 0, False, False
with open('amazon_categories.txt', 'w', encoding="utf8") as file:
    with open('amazon-meta.txt', 'r', encoding="utf8") as f:
        for line in f:
            line = line.strip()
            if i == 10:
                break
            if line[:6] == 'group:':
                #print ('group', (line[7:]))
                is_group = line[7:] == 'Book'
                j = 0
                is_categories = False
                i += 1
                
            if line[:11] == 'categories:' and group:
                #print ('categories', (line[12:]))
                num_categories = int(line[12:])
                is_categories = True
            
            if (is_categories and is_group and j <= num_categories):
                j += 1
                print (line)
                #strwrite = str(id) + ',' + asin + ',"' + title + '",' + group + ',' +  str(similar) + ',' +  categories + ',' +  str(reviews) + ',' +  str(avg_rating)
                #file.write(strwrite + '\n')
file.close









    



categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
categories: 1
|Books[283155]|Subjects[1000]|Home & Garden[48]|Crafts & Hobbies[5126]|General[5144]
categories: 5
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Reference[172810]|Commentaries[12155]|New Testament[12159]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Christian Living[12333]|Discipleship[12335]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Bibles[12059]|Translations[764432]|Life Application[572080]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Bible & Other Sacred Texts[12056]|Bible[764430]|New Testament[572082]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Bibles[12059]|Study Guides, History & Reference[764438]|General[572094]
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Worship & Devotion[12465]|Prayerbooks[12470]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Christian Living[12333]|Business[297488]
categories: 5
|Books[283155]|Subjects[1000]|Arts & Photography[1]|Photography[2020]|Photo Essays[2082]
|Books[283155]|Subjects[1000]|History[9]|Americas[4808]|United States[4853]|General[4870]
|Books[283155]|Subjects[1000]|History[9]|Jewish[4992]|General[4993]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Social Sciences[11232]|Sociology[11288]|Urban[11296]
|[172282]|Categories[493964]|Camera & Photo[502394]|Photography Books[733540]|Photo Essays[733676]
categories: 4
|Books[283155]|Subjects[1000]|Gay & Lesbian[301889]|Nonfiction[10703]|General[10716]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Crime & Criminals[11003]|Criminology[11005]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|General[11083]
|Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|U.S.[11117]
categories: 1
|Books[283155]|Subjects[1000]|Cooking, Food & Wine[6]|Baking[4196]|Bread[4197]






    Out[86]:





<function TextIOWrapper.close>



In [41]:

    
# read file and load to dataframe
df = pd.read_csv('amazon_users.txt', header=None, 
                 names=['id', 'asin', 'title', 'group', 'similar', 'categories', 'reviews', 'avg_rating'])



In [42]:

    
df.describe()









    Out[42]:







  
    
      
      id
      similar
      categories
      reviews
      avg_rating
    
  
  
    
      count
      542684.000000
      542684.000000
      542684.000000
      542684.000000
      542684.000000
    
    
      mean
      274414.212208
      3.296071
      4.624605
      7.498229
      3.209534
    
    
      std
      158454.479276
      2.287289
      4.450647
      13.759132
      1.996296
    
    
      min
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      137161.750000
      0.000000
      2.000000
      0.000000
      0.000000
    
    
      50%
      274427.500000
      5.000000
      4.000000
      2.000000
      4.000000
    
    
      75%
      411674.250000
      5.000000
      6.000000
      8.000000
      5.000000
    
    
      max
      548551.000000
      5.000000
      116.000000
      99.000000
      5.000000



In [43]:

    
# what item has the most number of reviews?
df[df['categories'] == 116]









    Out[43]:







  
    
      
      id
      asin
      title
      group
      similar
      categories
      reviews
      avg_rating
    
  
  
    
      113838
      115078
      9626341408
      The History of Classical Music
      Book
      5
      116
      2
      4.5



In [49]:

    
# how many items that has no review
print(len(df[df.reviews == 0]))



In [53]:

    
# select items with reviews
df_avg_rating = df[df.reviews > 0]
df_avg_rating.avg_rating.describe()









    Out[53]:





count    402735.000000
mean          4.324836
std           0.739279
min           1.000000
25%           4.000000
50%           4.500000
75%           5.000000
max           5.000000
Name: avg_rating, dtype: float64



In [58]:

    
# what is the average rating of items per group?
#CE is consumer electronics"
df_avg_rating.groupby(['group']).avg_rating.mean().sort_values(ascending=False)









    Out[58]:





group
Software        4.500000
Baby Product    4.500000
Music           4.482065
Toy             4.357143
Book            4.315994
Video           4.164579
Sports          4.000000
DVD             3.940051
CE              3.500000
Video Games     2.500000
Name: avg_rating, dtype: float64

PART B

B. (Suggested duration: 30 mins) Give the number crunching a rest! Just think about these problems.

Algorithm thinking
How would build the product categorization from scratch, using similar/co-purchased information?
I would collect all words in the product categorization and including the categorization of the similar products to the given product. Then using Naive Bayes, or Support Vector Machine (SVM) or NLTK, I would classify the products based on similar categorization among the given product and it's similar products. For example: productA is categorized as cat1, cat2, cat3 while similar product is ProductA1 and categorized as cat1, cat2.1, cat5. Then based on similarity coefficient, ProductA is now classified as (0.6cat1, 0.15cat2, 0.15cat.1, .05cat3, .05cat5).

</b>

Product thinking
Now, put on your 'product thinking' hat.
a. Is it a good idea to show users the categorization hierarchy for items?
Yes, showing users the categorization hierarchy for each items will help them to search for more or other items which are on the same category with what he/she is looking at. This also gives more user engagement on the website and thus increase the probability that this user will purchase an item. b. Is it a good idea to show users similar/co-purchased items?
Yes, it is a good idea to show users similar/co-purchased items since it is a quick way to showcase some of the products to the user. It practically increases sales since it will create a good feeling on the customer that it is being personalized for their own needs and increases shopping behavior to check it out or buy more. c. Is it a good idea to show users reviews and ratings for items?
Yes, it is a good idea to show users reviews and ratings for items because for some customers that has limited information on the item, they will based their decisions on other customers who bought and used the product. It is also important to display the average rating AND HOW MANY REVIEWS. A five-star rating with 3 reviewers may not be good compared to 4.5 rating with 100 reviewers. d. For each of the above, why? How will you establish the same?

	id	similar	categories	reviews	avg_rating
count	542684.000000	542684.000000	542684.000000	542684.000000	542684.000000
mean	274414.212208	3.296071	4.624605	7.498229	3.209534
std	158454.479276	2.287289	4.450647	13.759132	1.996296
min	1.000000	0.000000	0.000000	0.000000	0.000000
25%	137161.750000	0.000000	2.000000	0.000000	0.000000
50%	274427.500000	5.000000	4.000000	2.000000	4.000000
75%	411674.250000	5.000000	6.000000	8.000000	5.000000
max	548551.000000	5.000000	116.000000	99.000000	5.000000

DATA SCIENCE INTERVIEW CHALLENGE

PART B

End of Report