The ratio (#different products)/(total #reviews). Ideally we would like to have several reviews per product, thus this should be high.
The similar products should be considered as alternatives for the buyers. E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.
Data skewness in favor of the ratings 1 and 5. It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.
The dataset should be loaded instantly. In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.
Existing bibliography. Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.
In [1]:
def average_review_number_per_product(reviews_df, reviews_count):
distinct_products = reviews_df.select('asin').distinct().count()
return reviews_count / float(distinct_products)
In [2]:
def average_reviews_per_reviewer(reviews_df, reviews_count):
distinct_reviewers = reviews_df.select('reviewerID').distinct().count()
return reviews_count / float(distinct_reviewers)
In [3]:
def percentages_per_rating(reviews_df, reviews_count):
rating_counts = (reviews_df
.groupBy('overall')
.count()
.rdd
.map(lambda row: row.asDict().values())
.collect())
return [ (str(int(rating)), rating_count / float(reviews_count))
for rating_count, rating
in rating_counts ]
In [4]:
import re
import numpy as np
def evaluate_metrics(reviews_df, filename):
name = (re
.search('^reviews_(.+)_5\.json\.gz*', filename)
.group(1)
.replace('_', ' '))
print(name)
reviews_count = reviews_df.count()
return dict(
[ ('dataset_name', name),
('number_of_reviews', reviews_count),
('reviews_per_product', average_review_number_per_product(reviews_df, reviews_count)),
('reviews_per_reviewer', average_reviews_per_reviewer(reviews_df, reviews_count))]
+ percentages_per_rating(reviews_df, reviews_count))
In [5]:
import os
import pandas as pd
def extract_metrics_from_directory(data_directory):
return (pd
.DataFrame
.from_dict(
[ evaluate_metrics(
(spark
.read
.json(os.path.join(data_directory, filename))),
filename)
for filename in sorted(os.listdir(data_directory)) ])
.set_index('dataset_name'))
metrics = extract_metrics_from_directory('./data/raw_data')
metrics.to_csv('./metadata/initial-data-evaluation-metrics.csv')
In [9]:
metrics.sort_values(['number_of_reviews'], ascending=False)
Out[9]: