Importing the large datasets to a postgresql server and computing their metrics

It is not possible to load the larger data sets in the memory of a local machine therefeore an alternative is to import them to a psql table and query them from there. By adding the right indices this can make the queries fast enough. After this import one can extract some basic statistics using sql and also export smaller portions of the data which can be handled by spark or pandas on a local machine.

Helper functions


In [1]:
import timeit
def stopwatch(function):
    start_time = timeit.default_timer()
    result = function()
    print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))
    return result

Unzipping the data and converting it to csv format

Unfortunately psql does not support an import of record json files therefore we need to convert the data sets to csv. We use here the command line tool json2csv.

WARNING: The following two commands will run for a while, especially the second one. You can expect approximately 1 minute per GB of unzipped data.


In [196]:
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.gz | grep -Po '.*(?=.gz)' | xargs -I {} gunzip {}.gz

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))


Elapsed time: 178 sec

In [2]:
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.json | xargs sed -i 's/|/?/g;s/\u0000/?/g'

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))


Elapsed time: 275 sec

In [3]:
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.json | grep -Po '.*(?=.json)' | xargs -I {} json2csv -p -d '|' -k asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime -i {}.json -o {}.csv
!rm ./data/large-datasets/*.json

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))


2017/08/28 00:16:16 ERROR Decoding JSON at line 8789077: invalid character '?' in string escape code
{"reviewerID": "AWGWD8R8PLWH3", "asin": "B00JL1H75A", "reviewerName": "Kim", "helpful": [0, 0], "reviewText": "Love this series! I even like Elijah. He has been through a lot and it sounds like his father is some kind of crazy tyrant or despot. I'm thinking along the lines of Saddom Hussein. He really sounds evil. I like Elijah and Natalie together and really hope she can turn him around. I do think there is a chance for them. I just got the 4 th book and now I have to decide if I can wait til the box set comes out. Great Book, and it is hot. He is a little finish and takes over Natalie's life and is very controlling. But as their relationship sort of grows, you get little glimps\?s of something deeper and better for them. You should read this book. It is really good.", "overall": 5.0, "summary": "Wow", "unixReviewTime": 1403827200, "reviewTime": "06 27, 2014"}
Elapsed time: 735 sec

Importing the data in psql

To import the data in psql we create a table with the appropriate shape and import form the csv files generated above.

Some preparation to run psql transactions and queries in python


In [4]:
import psycopg2 as pg
import pandas as pd

db_conf = { 
    'user': 'mariosk',
    'database': 'amazon_reviews'
}

connection_factory = lambda: pg.connect(user=db_conf['user'], database=db_conf['database'])

def transaction(*statements):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        for statement in statements:
            cursor.execute(statement)
        connection.commit()
        cursor.close()
    except pg.DatabaseError as error:
        print(error)
    finally:
        if connection is not None:
            connection.close()
    
def query(statement):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        cursor.execute(statement)
        
        header = [ description[0] for description in cursor.description ]
        rows = cursor.fetchall()
        
        cursor.close()
        return pd.DataFrame.from_records(rows, columns=header)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        return None
    finally:
        if connection is not None:
            connection.close()

Creating tables for with indices for the large datasets


In [5]:
import re

table_names = [ re.search('reviews_(.*)_5.csv', filename).group(1) 
    for filename 
    in sorted(os.listdir('./data/large-datasets'))
    if not filename.endswith('json') ]

In [6]:
def create_table(table_name):
    transaction(
        'create table %s (asin text, helpful text, overall double precision, reviewText text, reviewTime text, reviewerID text, reviewerName text, summary text, unixReviewTime int);' % table_name,
        'create index {0}_asin ON {0} (asin);'.format(table_name),
        'create index {0}_overall ON {0} (overall);'.format(table_name),
        'create index {0}_reviewerID ON {0} (reviewerID);'.format(table_name),
        'create index {0}_unixReviewTime ON {0} (unixReviewTime);'.format(table_name))

for table_name in table_names:
    create_table(table_name)

Importing the datasets to psql

WARNING: The following command will take long time to complete. Estimate ~1 minute for each GB of csv data.


In [7]:
start_time = timeit.default_timer()

!ls ./data/large-datasets | grep -Po '(?<=reviews_).*(?=_5.csv)' | xargs -I {} psql -U mariosk -d amazon_reviews -c "\copy {} from './data/large-datasets/reviews_{}_5.csv' with (format csv, delimiter '|', header true);"

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))


COPY 8898040
COPY 1097592
COPY 1689188
COPY 1697533
Elapsed time: 871 sec

Querying the metrics


In [67]:
def average_reviews_per_product(table_name):
    return (query('''
        with distinct_products as (select count(distinct asin) as products from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / products as reviews_per_product
        from distinct_products cross join reviews_count
    '''.format(table_name))
    .rename(index={0: table_name.replace('_', ' ')}))

In [68]:
def average_reviews_per_reviewer(table_name):
    return (query('''
        with distinct_reviewers as (select count(distinct reviewerID) as reviewers from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / reviewers as reviews_per_reviewer
        from distinct_reviewers cross join reviews_count
    '''.format(table_name))
    .rename(index={ 0: table_name.replace('_', ' ')}))

In [69]:
def percentages_per_rating(table_name):
    return (query('''
            with rating_counts as (select overall, count(overall) as rating_count from {0} group by overall),
                 reviews_count as (select cast(count(*) as double precision) as reviews from {0})
            select cast(overall as int) as dataset_name, rating_count / reviews as row
            from rating_counts cross join reviews_count
        '''.format(table_name))
        .set_index('dataset_name')
        .sort_index()
        .transpose()
        .rename(index={'row': table_name.replace('_', ' ')}))

In [75]:
def number_of_reviews(table_name):
    return (query('''
        select count(*) as number_of_reviews from {0}
    '''.format(table_name))
    .rename(index={ 0: table_name.replace('_', ' ') }))

In [76]:
def all_metrics(table_name):
    print(table_name)
    
    return pd.concat(
        [ f(table_name) 
            for f
            in [ percentages_per_rating, number_of_reviews, average_reviews_per_product, average_reviews_per_reviewer ]], 
        axis=1)

In [78]:
metrics = stopwatch(lambda: pd.concat([ all_metrics(table) for table in table_names ]))


Books
CDs_and_Vinyl
Electronics
Movies_and_TV
Elapsed time: 146 sec

In [79]:
metrics.index.name = 'dataset_name'
metrics.to_csv('./metadata/large-datasets-evaluation-metrics.csv')

In [80]:
metrics


Out[80]:
1 2 3 4 5 number_of_reviews reviews_per_product reviews_per_reviewer
dataset_name
Books 0.036394 0.046652 0.107348 0.249841 0.559765 8898040 24.180639 14.739956
CDs and Vinyl 0.042088 0.042430 0.092770 0.224424 0.598288 1097592 17.031982 14.584390
Electronics 0.064365 0.048626 0.084216 0.205448 0.597344 1689188 26.812082 8.779427
Movies and TV 0.061394 0.060329 0.118585 0.225618 0.534074 1697533 33.915388 13.694200