Data Cleaning

Notebook for creating the cleaning process for our dataset.


Setup


In [1]:
import pyspark
import json
import pandas as pd
import numpy as np
import amzn_reviews_cleaner_funcs as amzn
from pyspark.sql import SparkSession

%autoreload 2


Load Data


In [2]:
# create spark session
spark = SparkSession(sc)

In [3]:
# get dataframe
# specify s3 as sourc with s3a://
df = spark.read.json("s3a://amazon-review-data/reviews_Musical_Instruments_5.json.gz")
df.show(3)


+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin| helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|1384719342|  [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|                good|    1393545600|
|1384719342|[13, 14]|    5.0|The product does ...|03 16, 2013|A14VAT5EAX3D9S|                Jake|                Jake|    1363392000|
|1384719342|  [1, 1]|    5.0|The primary job o...|08 28, 2013|A195EZSQDW3E21|Rick Bennette "Ri...|It Does The Job Well|    1377648000|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
only showing top 3 rows


Clean Data


In [27]:
df_text = df.select("asin", "reviewerID", "overall", "reviewText")
df_text.show(3)


+----------+--------------+-------+--------------------+
|      asin|    reviewerID|overall|          reviewText|
+----------+--------------+-------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|
+----------+--------------+-------+--------------------+
only showing top 3 rows

Import mlib classes


In [24]:
from pyspark.ml.feature import Tokenizer, CountVectorizer, StopWordsRemover, NGram, IDF
from nltk.corpus import stopwords

Tokenize docs


In [28]:
tokenizer = Tokenizer(inputCol="reviewText", outputCol="raw_tokens")
df_raw_tokens = tokenizer.transform(df_text)

df_raw_tokens.show(3)


+----------+--------------+-------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|
+----------+--------------+-------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|
+----------+--------------+-------+--------------------+--------------------+
only showing top 3 rows

Remove stop words


In [29]:
remover = StopWordsRemover(inputCol="raw_tokens", outputCol="tokens", stopWords=stopwords.words("english"))
df_tokens = remover.transform(df_raw_tokens)

df_tokens.show(3)


+----------+--------------+-------+--------------------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|              tokens|
+----------+--------------+-------+--------------------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|[much, write, her...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|[product, exactly...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|[primary, job, de...|
+----------+--------------+-------+--------------------+--------------------+--------------------+
only showing top 3 rows

Create TF vectors


In [30]:
cv = CountVectorizer(inputCol="tokens", outputCol="tf_vectors")
tf_model = cv.fit(df_tokens)
df_tf = tf_model.transform(df_tokens)

df_tf.show(3)


+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|              tokens|          tf_vectors|
+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|[much, write, her...|(51989,[3,4,14,18...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|[product, exactly...|(51989,[2,3,14,20...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|[primary, job, de...|(51989,[10,13,24,...|
+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows

Get vocabulary


In [21]:
vocab = tf_model.vocabulary

vocab[:10]


Out[21]:
[u'',
 u'guitar',
 u'like',
 u'one',
 u"it's",
 u'use',
 u'good',
 u'great',
 u'sound',
 u'get']

Create IDF vectors


In [31]:
idf = IDF(inputCol="tf_vectors", outputCol="tfidf_vectors")
idf_model = idf.fit(df_tf)
df_idf = idf_model.transform(df_tf)

df_idf.select("asin", "tf_vectors", "tfidf_vectors").show(3)


+----------+--------------------+--------------------+
|      asin|          tf_vectors|       tfidf_vectors|
+----------+--------------------+--------------------+
|1384719342|(51989,[3,4,14,18...|(51989,[3,4,14,18...|
|1384719342|(51989,[2,3,14,20...|(51989,[2,3,14,20...|
|1384719342|(51989,[10,13,24,...|(51989,[10,13,24,...|
+----------+--------------------+--------------------+
only showing top 3 rows

Map most important elements from a product's tfidf_vector to the corresponding terms


In [32]:
test_row = df_idf.first()

In [35]:
test_row["tf_vectors"]


Out[35]:
SparseVector(51989, {3: 1.0, 4: 1.0, 14: 1.0, 18: 2.0, 36: 1.0, 41: 1.0, 101: 1.0, 146: 1.0, 246: 1.0, 250: 1.0, 531: 1.0, 540: 2.0, 710: 1.0, 1329: 1.0, 1352: 1.0, 1387: 1.0, 1467: 1.0, 1776: 1.0, 1781: 1.0, 1907: 1.0, 2543: 1.0, 2562: 2.0, 4627: 1.0, 11514: 1.0})

In [48]:
test_tf_vect = test_row["tf_vectors"]

In [50]:
row_terms = []
for i in test_tf_vect.indices:
    row_terms.append(vocab[i])

In [51]:
row_terms


Out[51]:
[u'one',
 u"it's",
 u'well',
 u'much',
 u'buy',
 u'work',
 u'it,',
 u'might',
 u'amazon',
 u'exactly',
 u'supposed',
 u'pop',
 u'to.',
 u'honestly',
 u'sounds.',
 u'recordings',
 u'despite',
 u'here,',
 u'write',
 u'prices',
 u'lowest',
 u'filters',
 u'crisp.',
 u'pricing,']

In [53]:
test_row["reviewText"]


Out[53]:
u"Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,"

In [ ]: