Analyzing unstructured text in product review data

It's common for companies to have useful data hidden in large volumes of text:

online reviews
social media posts and tweets
interactions with customers, such as emails and call center transcripts

For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.

In this notebook we seek to automate the task of determining product strengths and weaknesses from review text.

splitting Amazon review text into sentences and applying a sentiment analysis model
tagging documents that mention aspects of interest
extract adjectives from raw text, and compare their use in positive and negative reviews
summarizing the use of adjectives for tagged documents

GraphLab Create includes feature engineering objects that leverage spaCy, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.



In [1]:

    
import graphlab as gl



In [2]:

    
from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_part_of_speech, stopwords, PartOfSpeech

def nlp_pipeline(reviews, title, aspects):

    print(title)
    
    print('1. Get reviews for this product')
    reviews = reviews.filter_by(title, 'name')

    print('2. Splitting reviews into sentences')
    reviews['sentences'] = split_by_sentence(reviews['review'])
    sentences = reviews.stack('sentences', 'sentence').dropna()

    print('3. Tagging relevant reviews')
    tags = gl.SFrame({'tag': aspects})
    tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
    tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
                         .join(sentences, on='sentence')

    print('4. Extracting adjectives')
    tagged['cleaned']    = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
    tagged['adjectives'] = extract_part_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])

    print('5. Predicting sentence-level sentiment')
    model = gl.sentiment_analysis.create(tagged, features=['review'])
    tagged['sentiment']  = model.predict(tagged)
    return tagged



In [3]:

    
reviews = gl.SFrame('amazon_baby.gl')









    



2016-04-14 09:42:44,477 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1460652163.log






    



This commercial license of GraphLab Create is assigned to engr@turi.com.



In [4]:

    
reviews









    Out[4]:





    
        review
        rating
        name
    
    
        This book is amazing!  I
bought it and read it to ...
        5.0
        Stop Pacifier Sucking
without tears with ...
    
    
        disappointed that the
book and puppet were so ...
        2.0
        Stop Pacifier Sucking
without tears with ...
    
    
        I have this nook and I
love it. But this ...
        5.0
        A Tale of Baby's Days
with Peter Rabbit ...
    
    
        Perfect for new parents.
We were able to keep ...
        5.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        We have used this product
since our daughter was ...
        5.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        Every new mom needs this
journal.  The 24 hour ...
        5.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        This is great for basics,
but I wish the space to ...
        4.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        I love this journal and
our nanny uses it ...
        4.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        My 3 month old son spend
half of his days with my ...
        5.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    
    
        I wanted to love this,
but it was pretty ...
        3.0
        Baby Tracker&reg; - Daily
Childcare Journal, ...
    

[228421 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [5]:

    
from helper_util import *

Focus on chosen aspects about baby monitors



In [6]:

    
aspects = ['audio', 'price', 'signal', 'range', 'battery life']



In [7]:

    
reviews = search(reviews, 'monitor')









    




Tokenizing...






    




TF-IDF transform...






    




Creating inverted index...






    




Creating query expansion model...






    




Saving data for querying...



In [8]:

    
reviews









    Out[8]:





    
        review
        rating
        name
    
    
        This baby monitor has
been working for us for ...
        4.0
        Baby Monitor - Direct
Link Privacy Monitor ...
    
    
        My sister recommended we
get this monitor because ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        These monitors are
absolutly wonderful.  ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        I have been using this
monitor for 10 months ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        After trying and
returning 3 different ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        This monitor has been
great!  It's always very ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        I have used this monitor
for three years with my ...
        3.0
        Graco ultraclear baby
monitor ...
    
    
        Amazing monitor, I can
hear every little sound ...
        5.0
        Graco ultraclear baby
monitor ...
    
    
        We were giving this
monitor as a gift since ...
        2.0
        Graco ultraclear baby
monitor ...
    
    
        1 is too many stars for
the product! ...
        1.0
        Fisher-price Super-
sensitive Nursey Monitor ...
    

[7358 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Process reviews for the most common product



In [9]:

    
item_a = 'Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision'
reviews_a = nlp_pipeline(reviews, item_a, aspects)









    



Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment



In [10]:

    
reviews_a









    Out[10]:





    
        sentence_id
        sentence
        tag
        score
        review
        rating
    
    
        2
        It killed our wifi signal
then lost it's pairing ...
        signal
        1.0
        It killed our wifi signal
then lost it's pairing ...
        1.0
    
    
        4
        Both audio and video
monitor supersedes the ...
        audio
        0.5
        I love this video
monitor. Both audio and ...
        5.0
    
    
        4
        Both audio and video
monitor supersedes the ...
        price
        0.5
        I love this video
monitor. Both audio and ...
        5.0
    
    
        5
        The VOX poewr saving
helps prevent quick ...
        battery life
        0.454545454545
        I love this video
monitor. Both audio and ...
        5.0
    
    
        11
        Definitely worth the
price. ...
        price
        0.666666666667
        This is such a great
camera. It doesn't pivot ...
        5.0
    
    
        17
        I purchased it based on
reviews and the appea ...
        price
        0.666666666667
        I had bought this monitor
back in july when my ...
        1.0
    
    
        18
        First and foremost the
battery life SUCKS. ...
        battery life
        1.0
        I had bought this monitor
back in july when my ...
        1.0
    
    
        35
        It's good and inexpensive
enough that we've dec ...
        range
        0.333333333333
        I tried this after having
the Lenox monitor. The ...
        4.0
    
    
        36
        It's been hard to find a
video system that does ...
        price
        0.666666666667
        I tried this after having
the Lenox monitor. The ...
        4.0
    
    
        40
        This is a great camera
monitor for the price. ...
        price
        0.666666666667
        This is a great camera
monitor for the price. ...
        4.0
    


    
        name
        cleaned
        adjectives
        sentiment
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        wifi signal lost it's
camera. ...
        []
        0.726841681132
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        audio video monitor
supersedes quality ...
        [audio, previous]
        0.95069270188
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        audio video monitor
supersedes quality ...
        [audio, previous]
        0.954092282107
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        vox saving quick battery
(which doesn't wanted ...
        [vox, quick, which]
        0.830247002167
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        worth price.
        [worth]
        0.999794625299
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        purchased based reviews
price. ...
        []
        0.177435087275
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        battery life
        []
        0.0887905806345
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        it's good inexpensive
we've decided angelcare ...
        [good, inexpensive,
sound, total] ...
        0.999940613283
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        it's hard find video
system price, work us. ...
        [hard, find]
        0.999923476009
    
    
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        great camera monitor
price. ...
        [great]
        0.999951337173
    

[787 rows x 10 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Comparing to another product



In [11]:

    
dropdown = get_dropdown(reviews)
display(dropdown)



In [12]:

    
item_b = dropdown.value
reviews_b = nlp_pipeline(reviews, item_b, aspects)
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)









    



VTech Communications Safe &amp; Sound Digital Audio Monitor
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment

Comparing the number of sentences that mention each aspect



In [13]:

    
counts









    Out[13]:





    
        tag
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        VTech Communications Safe
&amp; Sound Digital A ...
    
    
        signal
        107
        15
    
    
        battery life
        180
        93
    
    
        range
        144
        68
    
    
        audio
        105
        27
    
    
        price
        251
        69
    

[5 rows x 3 columns]

Comparing the sentence-level sentiment for each aspect of each product



In [14]:

    
sentiment









    Out[14]:





    
        tag
        Infant Optics DXR-5 2.4
GHz Digital Video Baby ...
        VTech Communications Safe
&amp; Sound Digital A ...
    
    
        signal
        0.674221231596
        0.695281129091
    
    
        battery life
        0.761379749571
        0.831840808502
    
    
        range
        0.862246778561
        0.840649624394
    
    
        audio
        0.826972950724
        0.92381971402
    
    
        price
        0.885832746604
        0.914282565406
    

[5 rows x 3 columns]

Comparing the use of adjectives for each aspect



In [ ]:

    
adjectives

Investigating good and bad sentences



In [ ]:

    
good, bad = get_extreme_sentences(reviews_a)

Print good sentences for the first item, where adjectives and aspects are highlighted.



In [ ]:

    
print_sentences(good['highlighted'])

Print bad sentences for the first item, where adjectives and aspects are highlighted.



In [ ]:

    
print_sentences(bad['highlighted'])

Deployment



In [17]:

    
service = gl.deploy.predictive_service.load("s3://gl-demo-usw2/predictive_service/demolab/ps-1.8.5")









    



2016-04-14 10:33:45,329 [WARNING] graphlab.deploy.predictive_service, 384: Overwriting existing Predictive Service "demolab-ps-one-eight-five" in local session.



In [18]:

    
service.get_predictive_objects_status()









    



---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
<ipython-input-18-1e75c08f9093> in <module>()
----> 1 service.get_predictive_objects_status()

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service.pyc in get_predictive_objects_status(self)
    702         This method will be deprecated., Use `get_status('endpoint')` instead
    703         '''
--> 704         all_status = self._get_endpoints_status()
    705         if not all_status:
    706             return

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service.pyc in _get_endpoints_status(self)
    734             result = self._environment.get_status_from_nodes_directly()
    735         else:
--> 736             result = self._environment.get_status()
    737         if len(result) == 0:
    738             _logger.info('No nodes in the Predictive Service')

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service_environment.pyc in get_status(self, _show_errors)
    611         endpoint = self.load_balancer_dns_name
    612         url = '%s%s/manage/status' % (schema, endpoint)
--> 613         response = self._post(url)
    614         info = json.loads(response.text)
    615         return info.values()

/home/chris/miniconda2/lib/python2.7/site-packages/graphlab/deploy/_predictive_service/_predictive_service_environment.pyc in _post(self, url, data, admin_key)
    265         headers = {'content-type': 'application/json'}
    266         response = post(url=url, data=json.dumps(data), headers=headers,
--> 267                 verify=self._should_verify_certificate, timeout=60, auth=('admin_key', auth_admin_key))
    268         if not response:
    269             raise RuntimeError("Request failed. Status code: %s" % response.status_code)

/home/chris/miniconda2/lib/python2.7/site-packages/requests/api.pyc in post(url, data, json, **kwargs)
    105     """
    106 
--> 107     return request('post', url, data=data, json=json, **kwargs)
    108 
    109 

/home/chris/miniconda2/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
     51     # cases, and look like a memory leak in others.
     52     with sessions.Session() as session:
---> 53         return session.request(method=method, url=url, **kwargs)
     54 
     55 

/home/chris/miniconda2/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    466         }
    467         send_kwargs.update(settings)
--> 468         resp = self.send(prep, **send_kwargs)
    469 
    470         return resp

/home/chris/miniconda2/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    574 
    575         # Send the request
--> 576         r = adapter.send(request, **kwargs)
    577 
    578         # Total elapsed time of the request (approximately)

/home/chris/miniconda2/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    435                 raise RetryError(e, request=request)
    436 
--> 437             raise ConnectionError(e, request=request)
    438 
    439         except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='demolab-ps-one-eight-five-1964664870.us-west-2.elb.amazonaws.com', port=443): Max retries exceeded with url: /manage/status (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7ff8ec0b8210>: Failed to establish a new connection: [Errno 111] Connection refused',))



In [ ]:

    
def word_count(text):
    sa = gl.SArray([text])
    sa = gl.text_analytics.count_words(sa)
    return sa[0]



In [ ]:

    
service.update('chris_bow', word_count)



In [ ]:

    
service.apply_changes()



In [ ]:

    
service.query('chris_bow', text=["It's a beautiful day in the neighborhood. Beautiful day for a neighbor."])

review	rating	name
This book is amazing! I bought it and read it to ...	5.0	Stop Pacifier Sucking without tears with ...
disappointed that the book and puppet were so ...	2.0	Stop Pacifier Sucking without tears with ...
I have this nook and I love it. But this ...	5.0	A Tale of Baby's Days with Peter Rabbit ...
Perfect for new parents. We were able to keep ...	5.0	Baby Tracker® - Daily Childcare Journal, ...
We have used this product since our daughter was ...	5.0	Baby Tracker® - Daily Childcare Journal, ...
Every new mom needs this journal. The 24 hour ...	5.0	Baby Tracker® - Daily Childcare Journal, ...
This is great for basics, but I wish the space to ...	4.0	Baby Tracker® - Daily Childcare Journal, ...
I love this journal and our nanny uses it ...	4.0	Baby Tracker® - Daily Childcare Journal, ...
My 3 month old son spend half of his days with my ...	5.0	Baby Tracker® - Daily Childcare Journal, ...
I wanted to love this, but it was pretty ...	3.0	Baby Tracker® - Daily Childcare Journal, ...

review	rating	name
This baby monitor has been working for us for ...	4.0	Baby Monitor - Direct Link Privacy Monitor ...
My sister recommended we get this monitor because ...	5.0	Graco ultraclear baby monitor ...
These monitors are absolutly wonderful. ...	5.0	Graco ultraclear baby monitor ...
I have been using this monitor for 10 months ...	5.0	Graco ultraclear baby monitor ...
After trying and returning 3 different ...	5.0	Graco ultraclear baby monitor ...
This monitor has been great! It's always very ...	5.0	Graco ultraclear baby monitor ...
I have used this monitor for three years with my ...	3.0	Graco ultraclear baby monitor ...
Amazing monitor, I can hear every little sound ...	5.0	Graco ultraclear baby monitor ...
We were giving this monitor as a gift since ...	2.0	Graco ultraclear baby monitor ...
1 is too many stars for the product! ...	1.0	Fisher-price Super- sensitive Nursey Monitor ...

sentence_id	sentence	tag	score	review	rating
2	It killed our wifi signal then lost it's pairing ...	signal	1.0	It killed our wifi signal then lost it's pairing ...	1.0
4	Both audio and video monitor supersedes the ...	audio	0.5	I love this video monitor. Both audio and ...	5.0
4	Both audio and video monitor supersedes the ...	price	0.5	I love this video monitor. Both audio and ...	5.0
5	The VOX poewr saving helps prevent quick ...	battery life	0.454545454545	I love this video monitor. Both audio and ...	5.0
11	Definitely worth the price. ...	price	0.666666666667	This is such a great camera. It doesn't pivot ...	5.0
17	I purchased it based on reviews and the appea ...	price	0.666666666667	I had bought this monitor back in july when my ...	1.0
18	First and foremost the battery life SUCKS. ...	battery life	1.0	I had bought this monitor back in july when my ...	1.0
35	It's good and inexpensive enough that we've dec ...	range	0.333333333333	I tried this after having the Lenox monitor. The ...	4.0
36	It's been hard to find a video system that does ...	price	0.666666666667	I tried this after having the Lenox monitor. The ...	4.0
40	This is a great camera monitor for the price. ...	price	0.666666666667	This is a great camera monitor for the price. ...	4.0

name	cleaned	adjectives	sentiment
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	wifi signal lost it's camera. ...	[]	0.726841681132
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	audio video monitor supersedes quality ...	[audio, previous]	0.95069270188
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	audio video monitor supersedes quality ...	[audio, previous]	0.954092282107
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	vox saving quick battery (which doesn't wanted ...	[vox, quick, which]	0.830247002167
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	worth price.	[worth]	0.999794625299
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	purchased based reviews price. ...	[]	0.177435087275
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	battery life	[]	0.0887905806345
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	it's good inexpensive we've decided angelcare ...	[good, inexpensive, sound, total] ...	0.999940613283
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	it's hard find video system price, work us. ...	[hard, find]	0.999923476009
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	great camera monitor price. ...	[great]	0.999951337173

tag	Infant Optics DXR-5 2.4 GHz Digital Video Baby ...	VTech Communications Safe & Sound Digital A ...
signal	0.674221231596	0.695281129091
battery life	0.761379749571	0.831840808502
range	0.862246778561	0.840649624394
audio	0.826972950724	0.92381971402
price	0.885832746604	0.914282565406