Summarize the reviews

The idea in this solution is to provide a new feature to the customer which will reduce the need to go through several reviews in order to evaluate a product. In order to achieve that, we will attempt to extract the most predictive words or sentences from the ratings and present them in a nice format (e.g. wordcloud).

Implementation steps of a proof of concept

  • Extract the summaries and split them to words
  • Keep only the data with ranks 1, 2 -labeled as 0- and 5 -labeled as 1.
  • Generate tf-idf vector features from the words
  • Train a binary logistic regression model which predicts the rankings from the vector features
  • Using this model evaluate each word by generating the features for it as if it were a whole summary
  • Order the words by the probability generated by the model to be in the '0' or '1' category
  • Select the words with highest probability to be '1' as the positive ones
  • Select the words with highest probability to be '0' as the negative ones
  • Pick a random set of products and print the top 10 words with highest probabilities (max of positive and negative) on a wordcloud

Loading and preparing the data


In [1]:
all_reviews = (spark
    .read
    .json('./data/raw_data/reviews_Baby_5.json.gz',)
    .na
    .fill({ 'reviewerName': 'Unknown' }))

In [2]:
from pyspark.sql.functions import col, expr, udf, trim
from pyspark.sql.types import IntegerType
import re

remove_punctuation = udf(lambda line: re.sub('[^A-Za-z\s]', '', line))
make_binary = udf(lambda rating: 0 if rating in [1, 2] else 1, IntegerType())

reviews = (all_reviews
    .filter(col('overall').isin([1, 2, 5]))
    .withColumn('label', make_binary(col('overall')))
    .select(col('label').cast('int'), remove_punctuation('summary').alias('summary'))
    .filter(trim(col('summary')) != ''))

Splitting data and balancing skewness


In [3]:
train, test = reviews.randomSplit([.8, .2], seed=5436L)

In [4]:
def multiply_dataset(dataset, n):
    return dataset if n <= 1 else dataset.union(multiply_dataset(dataset, n - 1))

In [5]:
reviews_good = train.filter('label == 1')
reviews_bad = train.filter('label == 0')

reviews_bad_multiplied = multiply_dataset(reviews_bad, reviews_good.count() / reviews_bad.count())


train_reviews = reviews_bad_multiplied.union(reviews_good)

Benchmark: predict by distribution


In [10]:
accuracy = reviews_good.count() / float(train_reviews.count())
print('Always predicting 5 stars accuracy: {0}'.format(accuracy))


Always predicting 5 stars accuracy: 0.523964341898

Learning pipeline


In [14]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF, StopWordsRemover
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.classification import LogisticRegression

tokenizer = Tokenizer(inputCol='summary', outputCol='words')

pipeline = Pipeline(stages=[
    tokenizer, 
    StopWordsRemover(inputCol='words', outputCol='filtered_words')
    HashingTF(inputCol='filtered_words', outputCol='rawFeatures', numFeatures=120000),
    IDF(inputCol='rawFeatures', outputCol='features'),
    LogisticRegression(regParam=.3, elasticNetParam=.01)
])

Testing the model accuracy


In [11]:
model = pipeline.fit(train_reviews)

In [12]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

prediction = model.transform(test)
BinaryClassificationEvaluator().evaluate(prediction)


Out[12]:
0.9641531568696533

Using model to extract the most predictive words


In [15]:
from pyspark.sql.functions import explode
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType

words = (tokenizer
    .transform(reviews)
    .select(explode(col('words')).alias('summary')))

predictors = (model
    .transform(words)
    .select(col('summary').alias('word'), 'probability'))

first = udf(lambda x: x[0].item(), FloatType())
second = udf(lambda x: x[1].item(), FloatType())

predictive_words = (predictors
   .select(
       'word', 
       second(col('probability')).alias('positive'), 
       first(col('probability')).alias('negative'))
   .groupBy('word')
   .agg(
       F.max('positive').alias('positive'),
       F.max('negative').alias('negative')))

positive_predictive_words = (predictive_words
    .select(col('word').alias('positive_word'), col('positive').alias('pos_prob'))
    .sort('pos_prob', ascending=False))

negative_predictive_words = (predictive_words
    .select(col('word').alias('negative_word'), col('negative').alias('neg_prob'))
    .sort('neg_prob', ascending=False))

In [16]:
import pandas as pd

pd.concat([
    positive_predictive_words.toPandas().head(n=20),
    negative_predictive_words.toPandas().head(n=20) ],
    axis=1)


Out[16]:
positive_word pos_prob negative_word neg_prob
0 lifesaver 0.717243 worst 0.671203
1 perfect 0.715261 disappointed 0.663169
2 toxic 0.715261 not 0.660606
3 biteproof 0.715261 disappointing 0.656997
4 awesome 0.715167 meh 0.652190
5 excellent 0.711688 poor 0.648915
6 wonderful 0.707410 useless 0.646298
7 chairbut 0.707410 returned 0.644646
8 loves 0.704475 hate 0.643379
9 five 0.700906 awful 0.639849
10 amazing 0.699648 terrible 0.639652
11 best 0.697946 disappointment 0.639614
12 exactly 0.697775 ok 0.635477
13 must 0.697551 hates 0.635062
14 boogin 0.697551 eh 0.634872
15 finally 0.695884 okay 0.634070
16 handy 0.695835 pointless 0.633207
17 great 0.695623 garbage 0.631947
18 sooner 0.695381 worthless 0.631146
19 saver 0.695207 hated 0.630003

Summarize single product - picks the best and worst


In [17]:
full_model = pipeline.fit(reviews)

In [18]:
highly_reviewed_products = (all_reviews
    .groupBy('asin')
    .agg(F.count('asin').alias('count'), F.avg('overall').alias('avg_rating'))
    .filter('count > 25'))

In [19]:
best_product = highly_reviewed_products.sort('avg_rating', ascending=False).take(1)[0][0]

worst_product = highly_reviewed_products.sort('avg_rating').take(1)[0][0]

In [20]:
def most_contributing_summaries(product, total_reviews, ranking_model):
    reviews = total_reviews.filter(col('asin') == product).select('summary', 'overall')
    
    udf_max = udf(lambda p: max(p.tolist()), FloatType())
    
    summary_ranks = (ranking_model
        .transform(reviews)
        .select(
            'summary', 
            second(col('probability')).alias('pos_prob')))
    
    pos_summaries = { row[0]: row[1] for row in summary_ranks.sort('pos_prob', ascending=False).take(10) }
    neg_summaries = { row[0]: row[1] for row in summary_ranks.sort('pos_prob').take(10) }
    
    return pos_summaries, neg_summaries

In [23]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def present_product(product, total_reviews, ranking_model):
    pos_summaries, neg_summaries = most_contributing_summaries(product, total_reviews, ranking_model)
    
    pos_wordcloud = WordCloud(background_color='white', max_words=20).fit_words(pos_summaries)
    neg_wordcloud = WordCloud(background_color='white', max_words=20).fit_words(neg_summaries)
    
    fig = plt.figure(figsize=(15, 15))
    
    ax = fig.add_subplot(1,2,1)
    ax.set_title('Positive summaries')
    ax.imshow(pos_wordcloud, interpolation='bilinear')
    ax.axis('off')
    
    ax = fig.add_subplot(1,2,2)
    ax.set_title('Negative summaries')
    ax.imshow(neg_wordcloud, interpolation='bilinear')
    ax.axis('off')
    
    plt.show()

In [24]:
present_product(best_product, all_reviews, full_model)



In [25]:
present_product(worst_product, all_reviews, full_model)