Importing the large datasets to a postgresql server and computing their metrics

It is not possible to load the larger data sets in the memory of a local machine therefeore an alternative is to import them to a psql table and query them from there. By adding the right indices this can make the queries fast enough. After this import one can extract some basic statistics using sql and also export smaller portions of the data which can be handled by spark or pandas on a local machine.

Helper functions



In [1]:

    
import timeit
def stopwatch(function):
    start_time = timeit.default_timer()
    result = function()
    print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))
    return result

Unzipping the data and converting it to csv format

Unfortunately psql does not support an import of record json files therefore we need to convert the data sets to csv. We use here the command line tool json2csv.

WARNING: The following two commands will run for a while, especially the second one. You can expect approximately 1 minute per GB of unzipped data.



In [196]:

    
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.gz | grep -Po '.*(?=.gz)' | xargs -I {} gunzip {}.gz

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))









    



Elapsed time: 178 sec



In [2]:

    
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.json | xargs sed -i 's/|/?/g;s/\u0000/?/g'

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))









    



Elapsed time: 275 sec



In [3]:

    
start_time = timeit.default_timer()

!ls ./data/large-datasets/*.json | grep -Po '.*(?=.json)' | xargs -I {} json2csv -p -d '|' -k asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime -i {}.json -o {}.csv
!rm ./data/large-datasets/*.json

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))









    



2017/08/28 00:16:16 ERROR Decoding JSON at line 8789077: invalid character '?' in string escape code
{"reviewerID": "AWGWD8R8PLWH3", "asin": "B00JL1H75A", "reviewerName": "Kim", "helpful": [0, 0], "reviewText": "Love this series! I even like Elijah. He has been through a lot and it sounds like his father is some kind of crazy tyrant or despot. I'm thinking along the lines of Saddom Hussein. He really sounds evil. I like Elijah and Natalie together and really hope she can turn him around. I do think there is a chance for them. I just got the 4 th book and now I have to decide if I can wait til the box set comes out. Great Book, and it is hot. He is a little finish and takes over Natalie's life and is very controlling. But as their relationship sort of grows, you get little glimps\?s of something deeper and better for them. You should read this book. It is really good.", "overall": 5.0, "summary": "Wow", "unixReviewTime": 1403827200, "reviewTime": "06 27, 2014"}
Elapsed time: 735 sec

Importing the data in psql

To import the data in psql we create a table with the appropriate shape and import form the csv files generated above.

Some preparation to run psql transactions and queries in python



In [4]:

    
import psycopg2 as pg
import pandas as pd

db_conf = { 
    'user': 'mariosk',
    'database': 'amazon_reviews'
}

connection_factory = lambda: pg.connect(user=db_conf['user'], database=db_conf['database'])

def transaction(*statements):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        for statement in statements:
            cursor.execute(statement)
        connection.commit()
        cursor.close()
    except pg.DatabaseError as error:
        print(error)
    finally:
        if connection is not None:
            connection.close()
    
def query(statement):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        cursor.execute(statement)
        
        header = [ description[0] for description in cursor.description ]
        rows = cursor.fetchall()
        
        cursor.close()
        return pd.DataFrame.from_records(rows, columns=header)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        return None
    finally:
        if connection is not None:
            connection.close()

Creating tables for with indices for the large datasets



In [5]:

    
import re

table_names = [ re.search('reviews_(.*)_5.csv', filename).group(1) 
    for filename 
    in sorted(os.listdir('./data/large-datasets'))
    if not filename.endswith('json') ]



In [6]:

    
def create_table(table_name):
    transaction(
        'create table %s (asin text, helpful text, overall double precision, reviewText text, reviewTime text, reviewerID text, reviewerName text, summary text, unixReviewTime int);' % table_name,
        'create index {0}_asin ON {0} (asin);'.format(table_name),
        'create index {0}_overall ON {0} (overall);'.format(table_name),
        'create index {0}_reviewerID ON {0} (reviewerID);'.format(table_name),
        'create index {0}_unixReviewTime ON {0} (unixReviewTime);'.format(table_name))

for table_name in table_names:
    create_table(table_name)

Importing the datasets to psql

WARNING: The following command will take long time to complete. Estimate ~1 minute for each GB of csv data.



In [7]:

    
start_time = timeit.default_timer()

!ls ./data/large-datasets | grep -Po '(?<=reviews_).*(?=_5.csv)' | xargs -I {} psql -U mariosk -d amazon_reviews -c "\copy {} from './data/large-datasets/reviews_{}_5.csv' with (format csv, delimiter '|', header true);"

print('Elapsed time: %i sec' % int(timeit.default_timer() - start_time))









    



COPY 8898040
COPY 1097592
COPY 1689188
COPY 1697533
Elapsed time: 871 sec

Querying the metrics



In [67]:

    
def average_reviews_per_product(table_name):
    return (query('''
        with distinct_products as (select count(distinct asin) as products from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / products as reviews_per_product
        from distinct_products cross join reviews_count
    '''.format(table_name))
    .rename(index={0: table_name.replace('_', ' ')}))



In [68]:

    
def average_reviews_per_reviewer(table_name):
    return (query('''
        with distinct_reviewers as (select count(distinct reviewerID) as reviewers from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / reviewers as reviews_per_reviewer
        from distinct_reviewers cross join reviews_count
    '''.format(table_name))
    .rename(index={ 0: table_name.replace('_', ' ')}))



In [69]:

    
def percentages_per_rating(table_name):
    return (query('''
            with rating_counts as (select overall, count(overall) as rating_count from {0} group by overall),
                 reviews_count as (select cast(count(*) as double precision) as reviews from {0})
            select cast(overall as int) as dataset_name, rating_count / reviews as row
            from rating_counts cross join reviews_count
        '''.format(table_name))
        .set_index('dataset_name')
        .sort_index()
        .transpose()
        .rename(index={'row': table_name.replace('_', ' ')}))



In [75]:

    
def number_of_reviews(table_name):
    return (query('''
        select count(*) as number_of_reviews from {0}
    '''.format(table_name))
    .rename(index={ 0: table_name.replace('_', ' ') }))



In [76]:

    
def all_metrics(table_name):
    print(table_name)
    
    return pd.concat(
        [ f(table_name) 
            for f
            in [ percentages_per_rating, number_of_reviews, average_reviews_per_product, average_reviews_per_reviewer ]], 
        axis=1)



In [78]:

    
metrics = stopwatch(lambda: pd.concat([ all_metrics(table) for table in table_names ]))









    



Books
CDs_and_Vinyl
Electronics
Movies_and_TV
Elapsed time: 146 sec



In [79]:

    
metrics.index.name = 'dataset_name'
metrics.to_csv('./metadata/large-datasets-evaluation-metrics.csv')



In [80]:

    
metrics









    Out[80]:







  
    
      
      1
      2
      3
      4
      5
      number_of_reviews
      reviews_per_product
      reviews_per_reviewer
    
    
      dataset_name
      
      
      
      
      
      
      
      
    
  
  
    
      Books
      0.036394
      0.046652
      0.107348
      0.249841
      0.559765
      8898040
      24.180639
      14.739956
    
    
      CDs and Vinyl
      0.042088
      0.042430
      0.092770
      0.224424
      0.598288
      1097592
      17.031982
      14.584390
    
    
      Electronics
      0.064365
      0.048626
      0.084216
      0.205448
      0.597344
      1689188
      26.812082
      8.779427
    
    
      Movies and TV
      0.061394
      0.060329
      0.118585
      0.225618
      0.534074
      1697533
      33.915388
      13.694200

	1	2	3	4	5	number_of_reviews	reviews_per_product	reviews_per_reviewer
dataset_name
Books	0.036394	0.046652	0.107348	0.249841	0.559765	8898040	24.180639	14.739956
CDs and Vinyl	0.042088	0.042430	0.092770	0.224424	0.598288	1097592	17.031982	14.584390
Electronics	0.064365	0.048626	0.084216	0.205448	0.597344	1689188	26.812082	8.779427
Movies and TV	0.061394	0.060329	0.118585	0.225618	0.534074	1697533	33.915388	13.694200