Data Cleaning

Notebook for creating the cleaning process for our dataset.

Setup



In [1]:

    
import pyspark
import json
import pandas as pd
import numpy as np
import amzn_reviews_cleaner_funcs as amzn
from pyspark.sql import SparkSession

%autoreload 2

Load Data



In [2]:

    
# create spark session
spark = SparkSession(sc)



In [3]:

    
# get dataframe
# specify s3 as sourc with s3a://
df = spark.read.json("s3a://amazon-review-data/reviews_Musical_Instruments_5.json.gz")
df.show(3)









    



+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin| helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|1384719342|  [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|                good|    1393545600|
|1384719342|[13, 14]|    5.0|The product does ...|03 16, 2013|A14VAT5EAX3D9S|                Jake|                Jake|    1363392000|
|1384719342|  [1, 1]|    5.0|The primary job o...|08 28, 2013|A195EZSQDW3E21|Rick Bennette "Ri...|It Does The Job Well|    1377648000|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
only showing top 3 rows

Clean Data



In [27]:

    
df_text = df.select("asin", "reviewerID", "overall", "reviewText")
df_text.show(3)









    



+----------+--------------+-------+--------------------+
|      asin|    reviewerID|overall|          reviewText|
+----------+--------------+-------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|
+----------+--------------+-------+--------------------+
only showing top 3 rows

Import mlib classes



In [24]:

    
from pyspark.ml.feature import Tokenizer, CountVectorizer, StopWordsRemover, NGram, IDF
from nltk.corpus import stopwords

Tokenize docs



In [28]:

    
tokenizer = Tokenizer(inputCol="reviewText", outputCol="raw_tokens")
df_raw_tokens = tokenizer.transform(df_text)

df_raw_tokens.show(3)









    



+----------+--------------+-------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|
+----------+--------------+-------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|
+----------+--------------+-------+--------------------+--------------------+
only showing top 3 rows

Remove stop words



In [29]:

    
remover = StopWordsRemover(inputCol="raw_tokens", outputCol="tokens", stopWords=stopwords.words("english"))
df_tokens = remover.transform(df_raw_tokens)

df_tokens.show(3)









    



+----------+--------------+-------+--------------------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|              tokens|
+----------+--------------+-------+--------------------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|[much, write, her...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|[product, exactly...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|[primary, job, de...|
+----------+--------------+-------+--------------------+--------------------+--------------------+
only showing top 3 rows

Create TF vectors



In [30]:

    
cv = CountVectorizer(inputCol="tokens", outputCol="tf_vectors")
tf_model = cv.fit(df_tokens)
df_tf = tf_model.transform(df_tokens)

df_tf.show(3)









    



+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
|      asin|    reviewerID|overall|          reviewText|          raw_tokens|              tokens|          tf_vectors|
+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
|1384719342|A2IBPI20UZIR0U|    5.0|Not much to write...|[not, much, to, w...|[much, write, her...|(51989,[3,4,14,18...|
|1384719342|A14VAT5EAX3D9S|    5.0|The product does ...|[the, product, do...|[product, exactly...|(51989,[2,3,14,20...|
|1384719342|A195EZSQDW3E21|    5.0|The primary job o...|[the, primary, jo...|[primary, job, de...|(51989,[10,13,24,...|
+----------+--------------+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows

Get vocabulary



In [21]:

    
vocab = tf_model.vocabulary

vocab[:10]









    Out[21]:





[u'',
 u'guitar',
 u'like',
 u'one',
 u"it's",
 u'use',
 u'good',
 u'great',
 u'sound',
 u'get']

Create IDF vectors



In [31]:

    
idf = IDF(inputCol="tf_vectors", outputCol="tfidf_vectors")
idf_model = idf.fit(df_tf)
df_idf = idf_model.transform(df_tf)

df_idf.select("asin", "tf_vectors", "tfidf_vectors").show(3)









    



+----------+--------------------+--------------------+
|      asin|          tf_vectors|       tfidf_vectors|
+----------+--------------------+--------------------+
|1384719342|(51989,[3,4,14,18...|(51989,[3,4,14,18...|
|1384719342|(51989,[2,3,14,20...|(51989,[2,3,14,20...|
|1384719342|(51989,[10,13,24,...|(51989,[10,13,24,...|
+----------+--------------------+--------------------+
only showing top 3 rows

Map most important elements from a product's tfidf_vector to the corresponding terms



In [32]:

    
test_row = df_idf.first()



In [35]:

    
test_row["tf_vectors"]









    Out[35]:





SparseVector(51989, {3: 1.0, 4: 1.0, 14: 1.0, 18: 2.0, 36: 1.0, 41: 1.0, 101: 1.0, 146: 1.0, 246: 1.0, 250: 1.0, 531: 1.0, 540: 2.0, 710: 1.0, 1329: 1.0, 1352: 1.0, 1387: 1.0, 1467: 1.0, 1776: 1.0, 1781: 1.0, 1907: 1.0, 2543: 1.0, 2562: 2.0, 4627: 1.0, 11514: 1.0})



In [48]:

    
test_tf_vect = test_row["tf_vectors"]



In [50]:

    
row_terms = []
for i in test_tf_vect.indices:
    row_terms.append(vocab[i])



In [51]:

    
row_terms









    Out[51]:





[u'one',
 u"it's",
 u'well',
 u'much',
 u'buy',
 u'work',
 u'it,',
 u'might',
 u'amazon',
 u'exactly',
 u'supposed',
 u'pop',
 u'to.',
 u'honestly',
 u'sounds.',
 u'recordings',
 u'despite',
 u'here,',
 u'write',
 u'prices',
 u'lowest',
 u'filters',
 u'crisp.',
 u'pricing,']



In [53]:

    
test_row["reviewText"]









    Out[53]:





u"Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,"



In [ ]: