Data Cleaning

Notebook for testing the cleaning process for our dataset.


Setup


In [1]:
import pandas as pd
import numpy as np
import amzn_reviews_cleaner_funcs as amzn
from pyspark.sql import SparkSession

%autoreload 2


Load Data


In [2]:
# create spark session
spark = SparkSession(sc)

In [3]:
# get dataframe
# specify s3 as sourc with s3a://
df = spark.read.json("s3a://amazon-review-data/reviews_Musical_Instruments_5.json.gz")


Test helper module

Add tfidf vectors


In [4]:
df_tfidf, vocab = amzn.add_tfidf(df)

df_tfidf.select("idf_vector").show(3)


+--------------------+
|          idf_vector|
+--------------------+
|(21502,[0,2,8,10,...|
|(21502,[0,2,4,10,...|
|(21502,[0,7,13,17...|
+--------------------+
only showing top 3 rows

Test extract top n features


In [5]:
df_features = amzn.add_top_features(df_tfidf, vocab)

In [27]:
df_features.select("top_features").show(3)


+--------------------+
|        top_features|
+--------------------+
|[supposed, record...|
|[nose, candy, car...|
|[pops, allowing, ...|
+--------------------+
only showing top 3 rows

Test clean_reviewText()


In [10]:
df_clean = amzn.clean_reviewText(df)
df_clean.select("cleanText").show(3)


+--------------------+
|           cleanText|
+--------------------+
|Not much to write...|
|The product does ...|
|The primary job o...|
+--------------------+
only showing top 3 rows

Test removal of empty tokens


In [15]:
# clean
df_clean = amzn.clean_reviewText(df)

# tokenize
df_raw_tokens = amzn.tokenize(df_clean)

In [7]:
df_raw_tokens.select("raw_tokens").show(3)


+--------------------+
|          raw_tokens|
+--------------------+
|[not, much, to, w...|
|[the, product, do...|
|[the, primary, jo...|
+--------------------+
only showing top 3 rows


In [33]:
df_tfidf.printSchema()


root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)
 |-- cleanText: string (nullable = true)
 |-- raw_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tf_vector: vector (nullable = true)
 |-- idf_vector: vector (nullable = true)


Analyze row


In [11]:
df_features.show(1)


+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|summary|unixReviewTime|           cleanText|          raw_tokens|              tokens|           tf_vector|          idf_vector|        top_features|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|1384719342| [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|   good|    1393545600|Not much to write...|[not, much, to, w...|[much, write, , e...|(21502,[0,2,8,10,...|(21502,[0,2,8,10,...|[supposed, record...|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+-------+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 1 row


In [12]:
test_row = df_features.first()

In [13]:
test_row["top_features"]


Out[13]:
u'[supposed, recordings, honestly, crisp, despite, prices, write, filters, lowest, pricing]'

In [ ]: