Metrics to consider

The ratio (#different products)/(total #reviews). Ideally we would like to have several reviews per product, thus this should be high.
The similar products should be considered as alternatives for the buyers. E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.
Data skewness in favor of the ratings 1 and 5. It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.
The dataset should be loaded instantly. In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.
Existing bibliography. Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.

Queries to extract the metrics



In [1]:

    
def average_review_number_per_product(reviews_df, reviews_count):
    distinct_products = reviews_df.select('asin').distinct().count()
    
    return reviews_count / float(distinct_products)



In [2]:

    
def average_reviews_per_reviewer(reviews_df, reviews_count):
    distinct_reviewers = reviews_df.select('reviewerID').distinct().count()
    
    return reviews_count / float(distinct_reviewers)



In [3]:

    
def percentages_per_rating(reviews_df, reviews_count):
    rating_counts = (reviews_df
         .groupBy('overall')
         .count()
         .rdd
         .map(lambda row: row.asDict().values())
         .collect())
    
    return [ (str(int(rating)), rating_count / float(reviews_count))
        for rating_count, rating
        in rating_counts ]



In [4]:

    
import re
import numpy as np

def evaluate_metrics(reviews_df, filename):
    name = (re
      .search('^reviews_(.+)_5\.json\.gz*', filename)
      .group(1)
      .replace('_', ' '))
    
    print(name)
    
    reviews_count = reviews_df.count()
    
    return dict(
        [ ('dataset_name', name), 
          ('number_of_reviews', reviews_count), 
          ('reviews_per_product', average_review_number_per_product(reviews_df, reviews_count)),
          ('reviews_per_reviewer', average_reviews_per_reviewer(reviews_df, reviews_count))] 
        + percentages_per_rating(reviews_df, reviews_count))

Extract the metrics from all the data files of a given directory into a pandas dataframe



In [5]:

    
import os
import pandas as pd

def extract_metrics_from_directory(data_directory):
    return (pd
        .DataFrame
        .from_dict(
            [ evaluate_metrics(
                    (spark
                         .read
                         .json(os.path.join(data_directory, filename))), 
                    filename)
                for filename in sorted(os.listdir(data_directory)) ])
        .set_index('dataset_name'))

metrics = extract_metrics_from_directory('./data/raw_data')
metrics.to_csv('./metadata/initial-data-evaluation-metrics.csv')









    



Amazon Instant Video
Apps for Android
Automotive
Baby
Beauty
Cell Phones and Accessories
Clothing Shoes and Jewelry
Digital Music
Grocery and Gourmet Food
Health and Personal Care
Home and Kitchen
Kindle Store
Office Products
Patio Lawn and Garden
Pet Supplies
Sports and Outdoors
Tools and Home Improvement
Toys and Games
Video Games



In [9]:

    
metrics.sort_values(['number_of_reviews'], ascending=False)









    Out[9]:







  
    
      
      1
      2
      3
      4
      5
      number_of_reviews
      reviews_per_product
      reviews_per_reviewer
    
    
      dataset_name
      
      
      
      
      
      
      
      
    
  
  
    
      Kindle Store
      0.023425
      0.034734
      0.097896
      0.258506
      0.585440
      982619
      15.865583
      14.403046
    
    
      Apps for Android
      0.104541
      0.058949
      0.113052
      0.209952
      0.513505
      752937
      57.001817
      8.627574
    
    
      Home and Kitchen
      0.049133
      0.044071
      0.081676
      0.191248
      0.633872
      551682
      19.537557
      8.293600
    
    
      Health and Personal Care
      0.047772
      0.048372
      0.096011
      0.196815
      0.611029
      346355
      18.687547
      8.970836
    
    
      Sports and Outdoors
      0.030523
      0.034434
      0.081228
      0.218700
      0.635115
      296337
      16.142997
      8.324541
    
    
      Clothing Shoes and Jewelry
      0.040161
      0.055487
      0.109177
      0.209407
      0.585768
      278677
      12.099032
      7.075355
    
    
      Video Games
      0.064082
      0.058948
      0.121991
      0.236448
      0.518531
      231780
      21.718516
      9.537094
    
    
      Beauty
      0.053027
      0.057712
      0.112079
      0.200205
      0.576977
      198502
      16.403768
      8.876358
    
    
      Cell Phones and Accessories
      0.068294
      0.056902
      0.110261
      0.205684
      0.558859
      194439
      18.644069
      6.974389
    
    
      Toys and Games
      0.028085
      0.037578
      0.097597
      0.223423
      0.613316
      167597
      14.055434
      8.633680
    
    
      Baby
      0.048628
      0.057173
      0.107313
      0.205228
      0.581658
      160792
      22.807376
      8.269067
    
    
      Pet Supplies
      0.055425
      0.056432
      0.100947
      0.177368
      0.609829
      157836
      18.547121
      7.949033
    
    
      Grocery and Gourmet Food
      0.038207
      0.052342
      0.115792
      0.215518
      0.578140
      151254
      17.359578
      10.302704
    
    
      Tools and Home Improvement
      0.038245
      0.036899
      0.080081
      0.210714
      0.634061
      134476
      13.161985
      8.082462
    
    
      Digital Music
      0.043134
      0.046518
      0.104921
      0.255556
      0.549872
      64706
      18.135090
      11.677676
    
    
      Office Products
      0.021217
      0.032408
      0.095009
      0.281929
      0.569436
      53258
      22.007438
      10.857900
    
    
      Amazon Instant Video
      0.046275
      0.050773
      0.112778
      0.227496
      0.562678
      37126
      22.033234
      7.237037
    
    
      Automotive
      0.026474
      0.029600
      0.069848
      0.193767
      0.680311
      20473
      11.156948
      6.992145
    
    
      Patio Lawn and Garden
      0.039105
      0.050708
      0.125000
      0.254973
      0.530214
      13272
      13.796258
      7.871886

	1	2	3	4	5	number_of_reviews	reviews_per_product	reviews_per_reviewer
dataset_name
Kindle Store	0.023425	0.034734	0.097896	0.258506	0.585440	982619	15.865583	14.403046
Apps for Android	0.104541	0.058949	0.113052	0.209952	0.513505	752937	57.001817	8.627574
Home and Kitchen	0.049133	0.044071	0.081676	0.191248	0.633872	551682	19.537557	8.293600
Health and Personal Care	0.047772	0.048372	0.096011	0.196815	0.611029	346355	18.687547	8.970836
Sports and Outdoors	0.030523	0.034434	0.081228	0.218700	0.635115	296337	16.142997	8.324541
Clothing Shoes and Jewelry	0.040161	0.055487	0.109177	0.209407	0.585768	278677	12.099032	7.075355
Video Games	0.064082	0.058948	0.121991	0.236448	0.518531	231780	21.718516	9.537094
Beauty	0.053027	0.057712	0.112079	0.200205	0.576977	198502	16.403768	8.876358
Cell Phones and Accessories	0.068294	0.056902	0.110261	0.205684	0.558859	194439	18.644069	6.974389
Toys and Games	0.028085	0.037578	0.097597	0.223423	0.613316	167597	14.055434	8.633680
Baby	0.048628	0.057173	0.107313	0.205228	0.581658	160792	22.807376	8.269067
Pet Supplies	0.055425	0.056432	0.100947	0.177368	0.609829	157836	18.547121	7.949033
Grocery and Gourmet Food	0.038207	0.052342	0.115792	0.215518	0.578140	151254	17.359578	10.302704
Tools and Home Improvement	0.038245	0.036899	0.080081	0.210714	0.634061	134476	13.161985	8.082462
Digital Music	0.043134	0.046518	0.104921	0.255556	0.549872	64706	18.135090	11.677676
Office Products	0.021217	0.032408	0.095009	0.281929	0.569436	53258	22.007438	10.857900
Amazon Instant Video	0.046275	0.050773	0.112778	0.227496	0.562678	37126	22.033234	7.237037
Automotive	0.026474	0.029600	0.069848	0.193767	0.680311	20473	11.156948	6.992145
Patio Lawn and Garden	0.039105	0.050708	0.125000	0.254973	0.530214	13272	13.796258	7.871886