Analyzing unstructured text in product review data

It's common for companies to have useful data hidden in large volumes of text:

  • online reviews
  • social media posts and tweets
  • interactions with customers, such as emails and call center transcripts

For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.

In this notebook we seek to automate the task of determining product strengths and weaknesses from review text.

  1. splitting Amazon review text into sentences and applying a sentiment analysis model
  2. tagging documents that mention aspects of interest
  3. extract adjectives from raw text, and compare their use in positive and negative reviews
  4. summarizing the use of adjectives for tagged documents

GraphLab Create includes feature engineering objects that leverage spaCy, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.


In [1]:
import graphlab as gl

In [2]:
from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_part_of_speech, stopwords, PartOfSpeech

def nlp_pipeline(reviews, title, aspects):

    print(title)
    
    print('1. Get reviews for this product')
    reviews = reviews.filter_by(title, 'name')

    print('2. Splitting reviews into sentences')
    reviews['sentences'] = split_by_sentence(reviews['review'])
    sentences = reviews.stack('sentences', 'sentence').dropna()

    print('3. Tagging relevant reviews')
    tags = gl.SFrame({'tag': aspects})
    tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
    tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
                         .join(sentences, on='sentence')

    print('4. Extracting adjectives')
    tagged['cleaned']    = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
    tagged['adjectives'] = extract_part_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])

    print('5. Predicting sentence-level sentiment')
    model = gl.sentiment_analysis.create(tagged, features=['review'])
    tagged['sentiment']  = model.predict(tagged)
    return tagged

In [3]:
reviews = gl.SFrame('amazon_baby.gl')


2016-04-14 09:42:44,477 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1460652163.log
This commercial license of GraphLab Create is assigned to engr@turi.com.

In [4]:
reviews


Out[4]:
review rating name
This book is amazing! I
bought it and read it to ...
5.0 Stop Pacifier Sucking
without tears with ...
disappointed that the
book and puppet were so ...
2.0 Stop Pacifier Sucking
without tears with ...
I have this nook and I
love it. But this ...
5.0 A Tale of Baby's Days
with Peter Rabbit ...
Perfect for new parents.
We were able to keep ...
5.0 Baby Tracker® - Daily
Childcare Journal, ...
We have used this product
since our daughter was ...
5.0 Baby Tracker® - Daily
Childcare Journal, ...
Every new mom needs this
journal. The 24 hour ...
5.0 Baby Tracker® - Daily
Childcare Journal, ...
This is great for basics,
but I wish the space to ...
4.0 Baby Tracker® - Daily
Childcare Journal, ...
I love this journal and
our nanny uses it ...
4.0 Baby Tracker® - Daily
Childcare Journal, ...
My 3 month old son spend
half of his days with my ...
5.0 Baby Tracker® - Daily
Childcare Journal, ...
I wanted to love this,
but it was pretty ...
3.0 Baby Tracker® - Daily
Childcare Journal, ...
[228421 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [5]:
from helper_util import *

Focus on chosen aspects about baby monitors


In [6]:
aspects = ['audio', 'price', 'signal', 'range', 'battery life']

In [7]:
reviews = search(reviews, 'monitor')


Tokenizing...
TF-IDF transform...
Creating inverted index...
Creating query expansion model...
Saving data for querying...

In [8]:
reviews


Out[8]:
review rating name
This baby monitor has
been working for us for ...
4.0 Baby Monitor - Direct
Link Privacy Monitor ...
My sister recommended we
get this monitor because ...
5.0 Graco ultraclear baby
monitor ...
These monitors are
absolutly wonderful. ...
5.0 Graco ultraclear baby
monitor ...
I have been using this
monitor for 10 months ...
5.0 Graco ultraclear baby
monitor ...
After trying and
returning 3 different ...
5.0 Graco ultraclear baby
monitor ...
This monitor has been
great! It's always very ...
5.0 Graco ultraclear baby
monitor ...
I have used this monitor
for three years with my ...
3.0 Graco ultraclear baby
monitor ...
Amazing monitor, I can
hear every little sound ...
5.0 Graco ultraclear baby
monitor ...
We were giving this
monitor as a gift since ...
2.0 Graco ultraclear baby
monitor ...
1 is too many stars for
the product! ...
1.0 Fisher-price Super-
sensitive Nursey Monitor ...
[7358 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Process reviews for the most common product


In [9]:
item_a = 'Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision'
reviews_a = nlp_pipeline(reviews, item_a, aspects)


Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment

In [10]:
reviews_a


Out[10]:
sentence_id sentence tag score review rating
2 It killed our wifi signal
then lost it's pairing ...
signal 1.0 It killed our wifi signal
then lost it's pairing ...
1.0
4 Both audio and video
monitor supersedes the ...
audio 0.5 I love this video
monitor. Both audio and ...
5.0
4 Both audio and video
monitor supersedes the ...
price 0.5 I love this video
monitor. Both audio and ...
5.0
5 The VOX poewr saving
helps prevent quick ...
battery life 0.454545454545 I love this video
monitor. Both audio and ...
5.0
11 Definitely worth the
price. ...
price 0.666666666667 This is such a great
camera. It doesn't pivot ...
5.0
17 I purchased it based on
reviews and the appea ...
price 0.666666666667 I had bought this monitor
back in july when my ...
1.0
18 First and foremost the
battery life SUCKS. ...
battery life 1.0 I had bought this monitor
back in july when my ...
1.0
35 It's good and inexpensive
enough that we've dec ...
range 0.333333333333 I tried this after having
the Lenox monitor. The ...
4.0
36 It's been hard to find a
video system that does ...
price 0.666666666667 I tried this after having
the Lenox monitor. The ...
4.0
40 This is a great camera
monitor for the price. ...
price 0.666666666667 This is a great camera
monitor for the price. ...
4.0
name cleaned adjectives sentiment
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
wifi signal lost it's
camera. ...
[] 0.726841681132
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
audio video monitor
supersedes quality ...
[audio, previous] 0.95069270188
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
audio video monitor
supersedes quality ...
[audio, previous] 0.954092282107
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
vox saving quick battery
(which doesn't wanted ...
[vox, quick, which] 0.830247002167
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
worth price. [worth] 0.999794625299
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
purchased based reviews
price. ...
[] 0.177435087275
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
battery life [] 0.0887905806345
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
it's good inexpensive
we've decided angelcare ...
[good, inexpensive,
sound, total] ...
0.999940613283
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
it's hard find video
system price, work us. ...
[hard, find] 0.999923476009
Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
great camera monitor
price. ...
[great] 0.999951337173
[787 rows x 10 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Comparing to another product


In [11]:
dropdown = get_dropdown(reviews)
display(dropdown)

In [12]:
item_b = dropdown.value
reviews_b = nlp_pipeline(reviews, item_b, aspects)
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)


VTech Communications Safe & Sound Digital Audio Monitor
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment

Comparing the number of sentences that mention each aspect


In [13]:
counts


Out[13]:
tag Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
VTech Communications Safe
& Sound Digital A ...
signal 107 15
battery life 180 93
range 144 68
audio 105 27
price 251 69
[5 rows x 3 columns]

Comparing the sentence-level sentiment for each aspect of each product


In [14]:
sentiment


Out[14]:
tag Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
VTech Communications Safe
& Sound Digital A ...
signal 0.674221231596 0.695281129091
battery life 0.761379749571 0.831840808502
range 0.862246778561 0.840649624394
audio 0.826972950724 0.92381971402
price 0.885832746604 0.914282565406
[5 rows x 3 columns]

Comparing the use of adjectives for each aspect


In [ ]:
adjectives

Investigating good and bad sentences


In [ ]:
good, bad = get_extreme_sentences(reviews_a)

Print good sentences for the first item, where adjectives and aspects are highlighted.


In [ ]:
print_sentences(good['highlighted'])

Print bad sentences for the first item, where adjectives and aspects are highlighted.


In [ ]:
print_sentences(bad['highlighted'])

Deployment


In [17]:
service = gl.deploy.predictive_service.load("s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5")


2016-04-14 10:33:45,329 [WARNING] graphlab.deploy.predictive_service, 384: Overwriting existing Predictive Service "demolab-ps-one-eight-five" in local session.

In [18]:
service.get_predictive_objects_status()


---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
<ipython-input-18-1e75c08f9093> in <module>()
----> 1 service.get_predictive_objects_status()

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service.pyc in get_predictive_objects_status(self)
    702         This method will be deprecated., Use `get_status('endpoint')` instead
    703         '''
--> 704         all_status = self._get_endpoints_status()
    705         if not all_status:
    706             return

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service.pyc in _get_endpoints_status(self)
    734             result = self._environment.get_status_from_nodes_directly()
    735         else:
--> 736             result = self._environment.get_status()
    737         if len(result) == 0:
    738             _logger.info('No nodes in the Predictive Service')

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service_environment.pyc in get_status(self, _show_errors)
    611         endpoint = self.load_balancer_dns_name
    612         url = '%s%s/manage/status' % (schema, endpoint)
--> 613         response = self._post(url)
    614         info = json.loads(response.text)
    615         return info.values()

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service_environment.pyc in _post(self, url, data, admin_key)
    265         headers = {'content-type': 'application/json'}
    266         response = post(url=url, data=json.dumps(data), headers=headers,
--> 267                 verify=self._should_verify_certificate, timeout=60, auth=('admin_key', auth_admin_key))
    268         if not response:
    269             raise RuntimeError("Request failed. Status code: %s" % response.status_code)

/home/chris/miniconda2/lib/python2.7/site-packages/requests/api.pyc in post(url, data, json, **kwargs)
    105     """
    106 
--> 107     return request('post', url, data=data, json=json, **kwargs)
    108 
    109 

/home/chris/miniconda2/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
     51     # cases, and look like a memory leak in others.
     52     with sessions.Session() as session:
---> 53         return session.request(method=method, url=url, **kwargs)
     54 
     55 

/home/chris/miniconda2/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    466         }
    467         send_kwargs.update(settings)
--> 468         resp = self.send(prep, **send_kwargs)
    469 
    470         return resp

/home/chris/miniconda2/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    574 
    575         # Send the request
--> 576         r = adapter.send(request, **kwargs)
    577 
    578         # Total elapsed time of the request (approximately)

/home/chris/miniconda2/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    435                 raise RetryError(e, request=request)
    436 
--> 437             raise ConnectionError(e, request=request)
    438 
    439         except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='demolab-ps-one-eight-five-1964664870.us-west-2.elb.amazonaws.com', port=443): Max retries exceeded with url: /manage/status (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7ff8ec0b8210>: Failed to establish a new connection: [Errno 111] Connection refused',))

In [ ]:
def word_count(text):
    sa = gl.SArray([text])
    sa = gl.text_analytics.count_words(sa)
    return sa[0]

In [ ]:
service.update('chris_bow', word_count)

In [ ]:
service.apply_changes()

In [ ]:
service.query('chris_bow', text=["It's a beautiful day in the neighborhood. Beautiful day for a neighbor."])