Data Cleaning

Notebook for testing the cleaning process for our dataset.

Setup



In [1]:

    
import pandas as pd
import numpy as np
import amzn_reviews_cleaner_funcs as amzn
from pyspark.sql import SparkSession

%autoreload 2

Load Data



In [2]:

    
# create spark session
spark = SparkSession(sc)



In [3]:

    
# get dataframe
# specify s3 as sourc with s3a://
df = spark.read.json("s3a://amazon-review-data/reviews_Musical_Instruments_5.json.gz")

Test helper module

Add tfidf vectors



In [4]:

    
df_tfidf, vocab = amzn.add_tfidf(df)

df_tfidf.select("idf_vector").show(3)









    



+--------------------+
|          idf_vector|
+--------------------+
|(21502,[0,2,8,10,...|
|(21502,[0,2,4,10,...|
|(21502,[0,7,13,17...|
+--------------------+
only showing top 3 rows

Test extract top n features



In [5]:

    
df_features = amzn.add_top_features(df_tfidf, vocab)



In [27]:

    
df_features.select("top_features").show(3)









    



+--------------------+
|        top_features|
+--------------------+
|[supposed, record...|
|[nose, candy, car...|
|[pops, allowing, ...|
+--------------------+
only showing top 3 rows

Test clean_reviewText()



In [10]:

    
df_clean = amzn.clean_reviewText(df)
df_clean.select("cleanText").show(3)









    



+--------------------+
|           cleanText|
+--------------------+
|Not much to write...|
|The product does ...|
|The primary job o...|
+--------------------+
only showing top 3 rows

Test removal of empty tokens



In [15]:

    
# clean
df_clean = amzn.clean_reviewText(df)

# tokenize
df_raw_tokens = amzn.tokenize(df_clean)



In [7]:

    
df_raw_tokens.select("raw_tokens").show(3)









    



+--------------------+
|          raw_tokens|
+--------------------+
|[not, much, to, w...|
|[the, product, do...|
|[the, primary, jo...|
+--------------------+
only showing top 3 rows



In [33]:

    
df_tfidf.printSchema()









    



root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)
 |-- cleanText: string (nullable = true)
 |-- raw_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tf_vector: vector (nullable = true)
 |-- idf_vector: vector (nullable = true)

Analyze row



In [11]:

    
df_features.show(1)









    



+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|summary|unixReviewTime|           cleanText|          raw_tokens|              tokens|           tf_vector|          idf_vector|        top_features|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|1384719342| [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|   good|    1393545600|Not much to write...|[not, much, to, w...|[much, write, , e...|(21502,[0,2,8,10,...|(21502,[0,2,8,10,...|[supposed, record...|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 1 row



In [12]:

    
test_row = df_features.first()



In [13]:

    
test_row["top_features"]









    Out[13]:





u'[supposed, recordings, honestly, crisp, despite, prices, write, filters, lowest, pricing]'



In [ ]: