Excerpts Extraction


In [1]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [2]:
import pandas as pd
import numpy as np
import nltk

In [3]:
import plotly 
import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
print(cf.__version__)
# Configure cufflings 
cf.set_config_file(offline=False, world_readable=True, theme='pearl')


0.12.1

In [4]:
reviews_and_ratings_df = pd.read_pickle('../data/interim/001_pre_processed_reviews+and_ratings.p')
reviews_and_ratings_df.head()


Out[4]:
reviewerID asin reviewText overall
0 A2XQ5LZHTD4AFT 000100039X A timeless classic. It is a very demanding an... 5.0
1 AF7CSSGV93RXN 000100039X I first read The Prophet by Kahlil Gibran over... 5.0
2 A1NPNGWBVD9AK3 000100039X This is one of the first (literary) books I re... 5.0
3 A3IS4WGMFR4X65 000100039X The Prophet is Kahlil Gibran's best known work... 5.0
4 AWLFVCT9128JV 000100039X Gibran Khalil Gibran was born in 1883 in what ... 5.0

In [5]:
reviews_vs_feature_opinion_pairs = pd.read_pickle("../data/interim/006_pairs_per_review.p")

In [6]:
reviews_vs_feature_opinion_pairs.head()


Out[6]:
userId asin reviewText imp_nns num_of_imp_nouns pairs num_of_pairs
0 A2XQ5LZHTD4AFT 000100039X [(timeless, NN), ( classic, JJ), ( demanding, ... [kneads, profits, preachers, territory, exile,... 26 [(birth, prophets), (book, flows)] 2
2 A1NPNGWBVD9AK3 000100039X [(one, CD), ( first, NNP), ( literary, JJ), ( ... [kneads, profits, preachers, territory, exile,... 26 [(relevant, catechism), (within, prophets), (t... 4
4 AWLFVCT9128JV 000100039X [(gibran, NN), ( khalil, NNP), ( gibran, NNP),... [kneads, profits, preachers, territory, exile,... 26 [(forty-eight, almustafa)] 1
5 AFY0BT42DDYZV 000100039X [(days, NNS), ( kahlil, VBP), ( gibrans, NNS),... [kneads, profits, preachers, territory, exile,... 26 [(souls, profits), (wordofmouth, twentysix), (... 3
13 A2ZZHMT58ZMVCZ 000100039X [(prophet, NN), ( waited, VBD), ( twelve, CD),... [kneads, profits, preachers, territory, exile,... 26 [(bear, departs), (others, pillars), (similar,... 4

In [7]:
df00 = reviews_vs_feature_opinion_pairs[['userId','asin','pairs']]
df00.columns = ['reviewerID','asin','pairs']
df00.head()


Out[7]:
reviewerID asin pairs
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)]
2 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t...
4 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)]
5 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (...
13 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,...

In [8]:
df01 = df00.merge(reviews_and_ratings_df, left_on=['reviewerID','asin'], right_on=['reviewerID','asin'], how='inner')
df01[0:31]


Out[8]:
reviewerID asin pairs reviewText overall
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)] A timeless classic. It is a very demanding an... 5.0
1 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t... This is one of the first (literary) books I re... 5.0
2 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)] Gibran Khalil Gibran was born in 1883 in what ... 5.0
3 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (... These days, Kahlil Gibran's "The Prophet" ofte... 5.0
4 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,... A prophet has waited twelve years in a coastal... 5.0
5 ADIDQRLLR4KBQ 000100039X [(beautiful, metaphors), (live, prophets)] Being an Atheist, it may seem strange to some ... 5.0
6 A281NPSIMI1C2R 000100039X [(pain, waves), (separate, almustafa)] I am alive like you, and I am standing beside ... 5.0
7 A2R64CR74I98K3 000100039X [(religious, texts)] This is a very usefull book that can be used a... 5.0
8 AF4QKY2R2TD3U 000100039X [(rich, metaphors)] "Say not, 'I have found the truth,' but rather... 5.0
9 A3SMT15X2QVUR8 000100039X [(orphalese, metaphor)] The Prophet Almustafa waits in the city of Orp... 5.0
10 A2INDDW3XYFFV1 000100039X [(home, prophets)] Khalil Gibran's The Prophet is a truly awe ins... 5.0
11 A1CSL3TFTFOTWH 0002051850 [(independent, periods), (story, progresses), ... I found this book, which takes place during th... 2.0
12 A313LJLZT8646J 0002051850 [(consistent, dire), (nine hundred and thirty-... For Whom The Bell Tolls by Ernest HemingwayBoo... 5.0
13 AHCVWPLA1O4X8 0002051850 [(american, spain), (fact, thee)] This is one of the greatest modernist novels t... 5.0
14 A1K1JW1C5CUSUZ 0002051850 [(political, fascism), (extensive, flashbacks)... Hemingway's magnificent novel has something fo... 5.0
15 A33R4E8T9KVLOM 0002051850 [(read, spain), (indepth, reflects), (sadistic... Robert Jordan is one of the most exciting, int... 5.0
16 A3IKBHODOTYYHM 0002051850 [(sold, spain)] In this novel the sum is more consequential th... 4.0
17 A1PN3R8DXRQ1C3 0002051850 [(western, spain), (many, intellectuals), (des... The Spanish Civil War was surely the most brut... 2.0
18 A1RECBDKHVOJMW 0002051850 [(european, spain), (red, spain), (conservativ... "For Whom The Bell Tolls" has long been my fav... 5.0
19 A3SI6F1RGCTAOH 0002051850 [(war, shines), (sex, declarations), (novel, c... The last time I read Hemingway's novels was so... 4.0
20 A3QZCA4LTTVGAD 0002051850 [(republican, guerrilla), (various, focuses), ... Set during the Spanish Civil War, Ernest Hemin... 5.0
21 A1MC81HLJ6Z9ZQ 0002051850 [(horrible, affair), (pull, coltish), (content... Just about anything Hemingway ever wrote was p... 5.0
22 A8IPQ1Q1O7YX5 0002051850 [(enemy, guerrilla)] I don't think I have ever taken so long to rea... 4.0
23 A3Q9K57FARA2WQ 0002051850 [(american, spain)] What more is there to say about this masterpie... 5.0
24 A3KRRXPFEAO6V 0002051850 [(fascist, threatens), (latter, partners), (co... Ernest Hemingway - For Whom The Bell TollsFor ... 5.0
25 A1RLYOPK16YXC1 0002051850 [(american, spain), (missions, guerrilla), (pr... FOR WHOM THE BELL TOLLS takes place in the spa... 5.0
26 AMTADN8VCK6J2 0002051850 [(story, mountains)] This novel is considered by most to be one of ... 4.0
27 A29SHFBU5O9BWO 0002051850 [(literary, greatness), (suitable, achievement... Perhaps the bell tolled for Ernest and his bid... 4.0
28 A2EQ74Y24BHHIF 0002113570 [(like, michener), (could, homo), (common, anc... Jane Goodall is a unique undividual whose work... 5.0
29 A2KUKUSSSAYAKH 0002117088 [(hip, surgery), (hip, goodnight), (come, clau... We adopted "Renoir, My Father" as be... 5.0
30 A280GY5UVUS2QH 000215725X [(thought, fraser), (one, fraser), (nineteenth... William Dalrymple is a historian and brings co... 5.0

Break reviews to their composing sentences


In [9]:
from nltk.tokenize import sent_tokenize
df01['reviewText'] = df01['reviewText'].progress_apply(lambda review: sent_tokenize(review))
df01.head()


Progress:: 100%|██████████| 249871/249871 [02:07<00:00, 1957.10it/s]
Out[9]:
reviewerID asin pairs reviewText overall
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)] [A timeless classic., It is a very demanding a... 5.0
1 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t... [This is one of the first (literary) books I r... 5.0
2 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)] [Gibran Khalil Gibran was born in 1883 in what... 5.0
3 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (... [These days, Kahlil Gibran's "The Prophet" oft... 5.0
4 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,... [A prophet has waited twelve years in a coasta... 5.0

After identifying the distinct sentences, next we need to apply the same normalisation process we employed at the beggining of this project, but this time on each sentence rather than on reviews.


In [10]:
# Word Tokenize
import re
import string
import inflect
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import regexp_tokenize
tokenizer=RegexpTokenizer('[\'\w\-]+',gaps=False)

# Convert to Lowercase
def convert_to_lowercase(sentence):

    for i in range(len(sentence)):
        sentence[i] = sentence[i].lower()
    return sentence

# Eliminate Punctuation
def eliminate_punctuation(sentence, regex):
    new_sentence = []
    for token in sentence:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_sentence.append(new_token)
    return new_sentence

r1 = re.compile("([a-zA-Z]+)([0-9]+)")
r2 = re.compile("([0-9]+)([a-zA-Z]+)")
r3 = re.compile("([a-zA-Z]+)([0-9]+)([a-zA-Z]+)")
r4 = re.compile("([0-9]+)([a-zA-Z]+)([0-9]+)")

# Split words into numbers
def split_words_and_nums(sentence):
    new_sentence = []
    for token in sentence:
        firstRegexIsTrue = r1.match(token)
        secondRegexIsTrue = r2.match(token)
        thirdRegexIsTrue = r3.match(token)
        fourthRegexIsTrue = r4.match(token)
    
        if(firstRegexIsTrue):
            new_sentence.append(firstRegexIsTrue.group(0))
            new_sentence.append(firstRegexIsTrue.group(1))
        elif(firstRegexIsTrue):
            new_sentence.append(secondRegexIsTrue.group(0))
            new_sentence.append(secondRegexIsTrue.group(1))
        elif(thirdRegexIsTrue):
            new_sentence.append(thirdRegexIsTrue.group(0))
            new_sentence.append(thirdRegexIsTrue.group(1))
            new_sentence.append(thirdRegexIsTrue.group(2))
        elif(fourthRegexIsTrue):
            new_sentence.append(fourthRegexIsTrue.group(0))
            new_sentence.append(fourthRegexIsTrue.group(1))
            new_sentence.append(fourthRegexIsTrue.group(2))
        else:
            new_sentence.append(token)
    return new_sentence

## Convert Numbers to Words
def numStringToWord(sentence, p):        
    for i in range(len(sentence)):
        if(sentence[i].isdigit()):
            if(len(sentence[i])<10):
                sentence[i] = p.number_to_words(sentence[i])
    return sentence

# Replace negatives with antonyms 
class AntonymReplacer(object):
    def replace(self, token, pos=None):
        antonyms = set()
        for syn in wordnet.synsets(token, pos=pos):
            for lemma in syn.lemmas():
                for antonym in lemma.antonyms():
                    antonyms.add(antonym.name())
        if len(antonyms) == 1:
            return antonyms.pop()
        else:
            return None
        
    def replace_negations(self, sentence):
        i, l = 0, len(sentence)
        tokens = []
        while i<l:
            token = sentence[i]
            if token == 'not' and i+1 <l:
                ant = self.replace(sentence[i+1])
                if ant:
                    tokens.append(ant)
                    i += 2
                    continue
            tokens.append(token)
            i += 1

        return tokens

In [11]:
replacer = AntonymReplacer()
regex=re.compile('[%s]' % re.escape(string.punctuation))
p = inflect.engine()
def normalise_and_tokenize_sentences(review):
    new_review = []
    for sentence in review:
        step_0 = tokenizer.tokenize(sentence)
        step_1 = convert_to_lowercase(step_0)
        step_2 = eliminate_punctuation(step_1, regex)
        step_3 = split_words_and_nums(step_2)
        step_4 = numStringToWord(step_3, p)
        step_5 = replacer.replace_negations(step_4)
        new_review.append(step_5)
    
    return new_review

In [12]:
df2 = df01.assign(norm_sentences = df01['reviewText'].progress_apply(lambda reviewText:normalise_and_tokenize_sentences(reviewText)))
df2.head()


Progress:: 100%|██████████| 249871/249871 [04:22<00:00, 950.77it/s] 
Out[12]:
reviewerID asin pairs reviewText overall norm_sentences
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)] [A timeless classic., It is a very demanding a... 5.0 [[a, timeless, classic], [it, is, a, very, dem...
1 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t... [This is one of the first (literary) books I r... 5.0 [[this, is, one, of, the, first, literary, boo...
2 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)] [Gibran Khalil Gibran was born in 1883 in what... 5.0 [[gibran, khalil, gibran, was, born, in, one t...
3 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (... [These days, Kahlil Gibran's "The Prophet" oft... 5.0 [[these, days, kahlil, gibrans, the, prophet, ...
4 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,... [A prophet has waited twelve years in a coasta... 5.0 [[a, prophet, has, waited, twelve, years, in, ...

In [13]:
df2.to_pickle('../data/interim/007_pre_processed_dataset_for_excerpts_extraction.p')

Begin Excerpt Extraction


In [14]:
matrix_m01 = df2.as_matrix()

In [15]:
matrix_m02 = np.append(matrix_m01,np.zeros([len(matrix_m01),1]),1)
sample = pd.DataFrame(matrix_m02[0:10])
sample


Out[15]:
0 1 2 3 4 5 6
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)] [A timeless classic., It is a very demanding a... 5 [[a, timeless, classic], [it, is, a, very, dem... 0
1 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t... [This is one of the first (literary) books I r... 5 [[this, is, one, of, the, first, literary, boo... 0
2 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)] [Gibran Khalil Gibran was born in 1883 in what... 5 [[gibran, khalil, gibran, was, born, in, one t... 0
3 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (... [These days, Kahlil Gibran's "The Prophet" oft... 5 [[these, days, kahlil, gibrans, the, prophet, ... 0
4 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,... [A prophet has waited twelve years in a coasta... 5 [[a, prophet, has, waited, twelve, years, in, ... 0
5 ADIDQRLLR4KBQ 000100039X [(beautiful, metaphors), (live, prophets)] [Being an Atheist, it may seem strange to some... 5 [[being, an, atheist, it, may, seem, strange, ... 0
6 A281NPSIMI1C2R 000100039X [(pain, waves), (separate, almustafa)] [I am alive like you, and I am standing beside... 5 [[i, am, alive, like, you, and, i, am, standin... 0
7 A2R64CR74I98K3 000100039X [(religious, texts)] [This is a very usefull book that can be used ... 5 [[this, is, a, very, usefull, book, that, can,... 0
8 AF4QKY2R2TD3U 000100039X [(rich, metaphors)] ["Say not, 'I have found the truth,' but rathe... 5 [[say, not, i, have, found, the, truth, but, r... 0
9 A3SMT15X2QVUR8 000100039X [(orphalese, metaphor)] [The Prophet Almustafa waits in the city of Or... 5 [[the, prophet, almustafa, waits, in, the, cit... 0

In [16]:
def identify_excerpt_index_for(review_sentences, pair):
    index = None
    for i in range(len(review_sentences)):
        sentence = review_sentences[i]
        if pair[0] in sentence:
            if pair[1] in sentence:
                index = i
                break
    return index

In [17]:
from tqdm import tqdm

with tqdm(total=len(matrix_m02)) as pbar:
    for i in range(len(matrix_m02)):
        excerpt_indices = []
        actual_sentences = matrix_m02[i][3]
        review_sentences = matrix_m02[i][5]
        pairs = matrix_m02[i][2]
        
        for pair in pairs:
            index_of_sentence_with_pair = identify_excerpt_index_for(review_sentences,pair)
            
            if index_of_sentence_with_pair is not None and index_of_sentence_with_pair not in excerpt_indices:
                excerpt_indices.append(index_of_sentence_with_pair)
    
        excerpts = []
        for index in excerpt_indices:
            excerpts.append(actual_sentences[index])
    
        matrix_m02[i][6] = excerpts
    
        pbar.update(1)


100%|██████████| 249871/249871 [00:06<00:00, 37508.66it/s]

In [18]:
df20 = pd.DataFrame(matrix_m02)
df20.columns = ['reviewerID','asin','pairs','reviewText','overall','norm_sentences','excerpts']
df20.head()


Out[18]:
reviewerID asin pairs reviewText overall norm_sentences excerpts
0 A2XQ5LZHTD4AFT 000100039X [(birth, prophets), (book, flows)] [A timeless classic., It is a very demanding a... 5 [[a, timeless, classic], [it, is, a, very, dem... [There is much that hints at his birth place, ...
1 A1NPNGWBVD9AK3 000100039X [(relevant, catechism), (within, prophets), (t... [This is one of the first (literary) books I r... 5 [[this, is, one, of, the, first, literary, boo... [I believe that was my first taste of spiritua...
2 AWLFVCT9128JV 000100039X [(forty-eight, almustafa)] [Gibran Khalil Gibran was born in 1883 in what... 5 [[gibran, khalil, gibran, was, born, in, one t... [He died of cancer in a New York hospital at t...
3 AFY0BT42DDYZV 000100039X [(souls, profits), (wordofmouth, twentysix), (... [These days, Kahlil Gibran's "The Prophet" oft... 5 [[these, days, kahlil, gibrans, the, prophet, ... [There is no political, religious, or commerci...
4 A2ZZHMT58ZMVCZ 000100039X [(bear, departs), (others, pillars), (similar,... [A prophet has waited twelve years in a coasta... 5 [[a, prophet, has, waited, twelve, years, in, ... [A local seeress who knows him best asks him t...

In [19]:
df30 = df20[['reviewerID','asin','overall','excerpts']]
df30.head()


Out[19]:
reviewerID asin overall excerpts
0 A2XQ5LZHTD4AFT 000100039X 5 [There is much that hints at his birth place, ...
1 A1NPNGWBVD9AK3 000100039X 5 [I believe that was my first taste of spiritua...
2 AWLFVCT9128JV 000100039X 5 [He died of cancer in a New York hospital at t...
3 AFY0BT42DDYZV 000100039X 5 [There is no political, religious, or commerci...
4 A2ZZHMT58ZMVCZ 000100039X 5 [A local seeress who knows him best asks him t...

In [20]:
len(df30)


Out[20]:
249871

In [21]:
df31 = df30[df30['excerpts'].map(lambda excerpts: len(excerpts)) > 0]
len(df31)


Out[21]:
231936

In [25]:
231936/249871


Out[25]:
0.9282229630489333

In [26]:
249871 - 231936


Out[26]:
17935

Get Polarity of Excerpts


In [22]:
import numpy as np
from textblob import TextBlob

def get_overal_polarity(excerpts):
    text = ''.join(excerpts)
    blob = TextBlob(text)
    
    polarity = []
    for sentence in blob.sentences:
        polarity.append(sentence.sentiment.polarity)

    return np.mean(polarity)

In [23]:
df40 = df31.assign(polarity = df31['excerpts'].progress_apply(lambda excerpts:get_overal_polarity(excerpts)))
df40.head()


Progress:: 100%|██████████| 231936/231936 [03:40<00:00, 1052.48it/s]
Out[23]:
reviewerID asin overall excerpts polarity
0 A2XQ5LZHTD4AFT 000100039X 5 [There is much that hints at his birth place, ... 0.332292
1 A1NPNGWBVD9AK3 000100039X 5 [I believe that was my first taste of spiritua... 0.425000
2 AWLFVCT9128JV 000100039X 5 [He died of cancer in a New York hospital at t... 0.133182
3 AFY0BT42DDYZV 000100039X 5 [There is no political, religious, or commerci... 0.155729
4 A2ZZHMT58ZMVCZ 000100039X 5 [A local seeress who knows him best asks him t... 0.096580

In [24]:
df40.to_pickle('../data/interim/007_excerpts_with_polarity.p')

Produce Summaries


In [27]:
def merge_list(summariesList):
    summary = []
    for excerpt in summariesList:
        summary = summary + excerpt
    return summary

In [31]:
df_book_summaries = pd.DataFrame(df40.groupby(['asin'])['excerpts'].progress_apply(list)).reset_index()
df_book_summaries.head()


Progress:: 100%|█████████▉| 48693/48694 [00:01<00:00, 28126.36it/s]
Out[31]:
asin excerpts
0 000100039X [[There is much that hints at his birth place,...
1 0002051850 [[However, as the story progresses, Hemingway'...
2 0002113570 [[That an English woman scientist would journe...
3 0002117088 [[We adopted &quot;Renoir, My Father&quot; as ...
4 000215725X [[William and Olivia stayed in the Fraser resi...

In [32]:
df_book_summaries['excerpts'] = df_book_summaries['excerpts'].progress_apply(lambda summariesList: merge_list(summariesList))
df_book_summaries.head()


Progress:: 100%|██████████| 48693/48693 [00:00<00:00, 388636.69it/s]
Out[32]:
asin excerpts
0 000100039X [There is much that hints at his birth place, ...
1 0002051850 [However, as the story progresses, Hemingway's...
2 0002113570 [That an English woman scientist would journey...
3 0002117088 [We adopted &quot;Renoir, My Father&quot; as b...
4 000215725X [William and Olivia stayed in the Fraser resid...

Evaluation


In [37]:
df40["overall"] = pd.to_numeric(df40["overall"], errors='coerce')
df40["polarity"] = pd.to_numeric(df40["polarity"], errors='coerce')
df40.head()


Out[37]:
reviewerID asin overall excerpts polarity
0 A2XQ5LZHTD4AFT 000100039X 5.0 [There is much that hints at his birth place, ... 0.332292
1 A1NPNGWBVD9AK3 000100039X 5.0 [I believe that was my first taste of spiritua... 0.425000
2 AWLFVCT9128JV 000100039X 5.0 [He died of cancer in a New York hospital at t... 0.133182
3 AFY0BT42DDYZV 000100039X 5.0 [There is no political, religious, or commerci... 0.155729
4 A2ZZHMT58ZMVCZ 000100039X 5.0 [A local seeress who knows him best asks him t... 0.096580

In [62]:
mean_rating_vs_polarity_per_book = pd.DataFrame(df40.groupby(['asin'])[["overall","polarity"]].mean()).reset_index()
mean_rating_vs_polarity_per_book.head()


Out[62]:
asin overall polarity
0 000100039X 5.000000 0.217668
1 0002051850 4.357143 0.094471
2 0002113570 5.000000 0.142857
3 0002117088 5.000000 0.237500
4 000215725X 4.666667 0.190030

In [83]:
### Normalise polarity values to match 
def normalise(polarity):
    
    positeiv_polarity = polarity + 1
    normalised_polarity = (4 * positeiv_polarity)/2
    
    return normalised_polarity

mean_rating_vs_polarity_per_book = mean_rating_vs_polarity_per_book.assign(norm_polarity = mean_rating_vs_polarity_per_book['polarity'].progress_apply(lambda polarity:normalise(polarity)))
mean_rating_vs_polarity_per_book.head()


Progress:: 100%|██████████| 48693/48693 [00:00<00:00, 1190302.22it/s]
Out[83]:
asin overall polarity norm_polarity
0 000100039X 5.000000 0.217668 2.435336
1 0002051850 4.357143 0.094471 2.188942
2 0002113570 5.000000 0.142857 2.285714
3 0002117088 5.000000 0.237500 2.475000
4 000215725X 4.666667 0.190030 2.380060

In [84]:
mean_rating_vs_polarity_per_book = mean_rating_vs_polarity_per_book.assign(norm_overall = mean_rating_vs_polarity_per_book['overall'].progress_apply(lambda overall:overall - 1))
mean_rating_vs_polarity_per_book.head()


Progress:: 100%|██████████| 48693/48693 [00:00<00:00, 1401930.56it/s]
Out[84]:
asin overall polarity norm_polarity norm_overall
0 000100039X 5.000000 0.217668 2.435336 4.000000
1 0002051850 4.357143 0.094471 2.188942 3.357143
2 0002113570 5.000000 0.142857 2.285714 4.000000
3 0002117088 5.000000 0.237500 2.475000 4.000000
4 000215725X 4.666667 0.190030 2.380060 3.666667

In [85]:
import itertools
import numpy as np

x_ratings = np.asarray(list(itertools.chain(*mean_rating_vs_polarity_per_book.as_matrix(columns=mean_rating_vs_polarity_per_book.columns[4:5]))))
y_polarity = np.asarray(list(itertools.chain(*mean_rating_vs_polarity_per_book.as_matrix(columns=mean_rating_vs_polarity_per_book.columns[3:4]))))

In [97]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter(x=x_ratings, y=y_polarity, 
                    mode='markers',
                    name='ROC curve (area = %0.2f)' % roc_auc[2]
                   )

trace2 = go.Scatter(x=[0, 4], y=[0, 4], 
                    mode='markers', 
                    line=dict(color='red', width=lw, dash='dash'),
                    showlegend=False)

layout = go.Layout(title='Receiver Operating Characteristic Function',
                   xaxis=dict(title='False Positive Rate'),
                   yaxis=dict(title='True Positive Rate'))

fig = go.Figure(data=[trace1, trace2], layout=layout)

In [96]:
py.iplot(fig)


/Users/falehalrashidi/anaconda3/lib/python3.6/site-packages/plotly/plotly/plotly.py:224: UserWarning:

Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points

If the visualization you're using aggregates points (e.g., box plot, histogram, etc.) you can disregard this warning.

The draw time for this plot will be slow for clients without much RAM.
/Users/falehalrashidi/anaconda3/lib/python3.6/site-packages/plotly/api/v1/clientresp.py:40: UserWarning:

Estimated Draw Time Slow

Out[96]:

In [99]:
# Create a trace
trace = go.Scatter(
    x = x_ratings,
    y = y_polarity,
    mode = 'markers'
)

layout = go.Layout(title='Correlation between Polarity and Rating',
                   xaxis=dict(title='Ratings'),
                   yaxis=dict(title='Polarity'))

fig = go.Figure(data=[trace], layout=layout)

In [100]:
# Plot and embed in ipython notebook!
py.iplot(fig)


/Users/falehalrashidi/anaconda3/lib/python3.6/site-packages/plotly/plotly/plotly.py:224: UserWarning:

Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points

If the visualization you're using aggregates points (e.g., box plot, histogram, etc.) you can disregard this warning.

The draw time for this plot will be slow for clients without much RAM.
/Users/falehalrashidi/anaconda3/lib/python3.6/site-packages/plotly/api/v1/clientresp.py:40: UserWarning:

Estimated Draw Time Slow

Out[100]:

In [101]:
mean_rating_vs_polarity_per_book['norm_overall'].corr(mean_rating_vs_polarity_per_book['norm_polarity'])


Out[101]:
0.1662440592733814

In [127]:
df_book_summaries.to_csv("../data/processed/007_book_summaries.csv", sep="\t")
df_book_summaries.to_pickle("../data/processed/007_book_summaries.p")

Some Example Summaries


In [112]:
df_book_summaries['asin'][0:1][0]


Out[112]:
'000100039X'


In [111]:
print(df_book_summaries['excerpts'][0:1][0])


['There is much that hints at his birth place, Lebanon where many of the old prophets walked the Earth and where this book project first germinated most likely.Probably becuase it was written in English originally, the writing flows, it is pleasant to read, and the charcoal drawings of the author decorating the pages is a plus.', 'I believe that was my first taste of spirituality and seemed  at the time more relevant than what I was being force-fed by nuns in  catechism class.', "True wisdom comes from within.The prophet's teaching on love is  particularly relevant to me at this stage of my life:&quot;For even as  love crowns you so shall he crucify you.", 'Even as he ascends to your height and caresses your  tenderest branches that quiver in the sun, So shall he descend to your  roots and shake them in their clinging to the earth.', 'He died of cancer in a New York hospital at the very young age of 48.The Prophet is a story about Almustafa (The Prophet) who after living 12 years in Orphalese is about to depart aboard ship to return to his home.', 'There is no political, religious, or commercial enterprise attached to his name bent on winning souls and/or profits.', 'They are written with the aim of being accessible and immediate to the reader and rely mostly on clear metaphors and vivid imagery.Copies of "The Prophet" are not hard to come by.', "A local seeress who knows him best asks him to share his wisdom so that it will endure for generations to come.So, he reveals his wisdom on love, birth, marriage, children, pain, talking, pleasure, death any so much more.It is a profound work, and here is his advice on marriage so you may judge for yourself:You were born together, and together you shall be forevermore.You shall be together when white wings of death scatter your days.Aye, you shall be together even in the silent memory of God.But let there be spaces in your togetherness,And let the winds of the heavens dance between you.Love one another but make not a bond of love:Let it rather be a moving sea between the shores of your souls.Fill each other's cup but drink not from one cup.Give one another of your bread but eat not from the same loaf.Sing and dance together and be joyous, but let each one of you be alone,Even as the strings of a lute are alone though they quiver with the same music.Give your hearts, but not into each other's keeping.For only the hand of Life can contain your hearts.And stand together, yet not too near together:For the pillars of the temple stand apart,And the oak tree and the cypress grow not in each other's shadow.Its not a little similar to theTao Te Ching: A New English Version (Perennial Classics)where a border guard recognises Lao Tzu, and asks him to share his wisdom as he goes into exile.", 'Decadence is not suggested, but the basic purpose of Gibran\'s legacy is to tell us that life is short and must be lived without regrets.It is a book that includes such beautiful metaphors and velvetty language that you are always sucked into reading "just one more section."', "It states what any prophecy should, and allows the religious aspects of the beliefs to take the backseat to the love of life and aspirations.Buy, read, and live by The Prophet's words.", 'This is a very usefull book that can be used as a secondary source for your primary religious texts.', 'Gibran is justly famous for rich metaphors that brilliantly highlight the pursuit of Truth and Goodness amidst all the darkness and light of human nature.This is a book to read alone or with a partner, to give and receive, to go back to again and again.Note: the excerpts available in Amazon\'s "Search This Book" facility unfortunately do not do "The Prophet" justice, since only the book\'s introduction is included, whereas the wisdom does not begin unfolding until a bit later.']

In [119]:
df_book_summaries['asin'][1:2][1]


Out[119]:
'0002051850'


In [121]:
print(df_book_summaries['excerpts'][1:2][1])


['However, as the story progresses, Hemingway\'s usage of the King James-style "Thee" and "Thou" to indicate that a more formal Spanish dialect is being used becomes distractingly gimmicky and wore quite thin by the end of the book.', "What little is left is a cause whose means and ends don't seem to differ from the alternative, and an appeal to virtues of loyalty to the band, or one's responsibility to follow his duty.", "The trouble is, these appeals are made among characters who Jordan - as Hemingway's voice - often considers untrustworthy, repugnant and treacherous.", 'Another consistent theme found in Hemingway is courage under fire or dire circumstances, whether it is in the bull ring, behind enemy lines, or hunting man-eaters in the green hills of Africa.', 'The chief protagonist is an American named Robert Jordon who has been tasked to blow up a bridge behind enemy lines in the Spanish mountains.', 'Some say that Maria represents Spain and her gang rape represent the despoilage of Spain by the Fascists.Robert and Maria fall in love at first sight.', 'Many Spanish words and phrases were translated literally word for word which gave a sense of the Spanish but sounds archaic and stilted to our English hearing ears.', 'For example, the Spanish characters in the novel referred to each other as thee and thou.', 'The traditional second person singular in English is "thou/thee/thy".', "Like Crane's The Red Badge of Courage, this book deals with the psychological and human ramifications of war.The hero is Robert Jordan, an American idealist fighting in Spain on the side of the `Republic,' or Communist, party.", 'The quote from John Donne "never send to know for whom the bell tolls"  looms ominously throughout the novel, and the battle of the bridge and the final, chilling moment of truth for Robert Jordan drives home the harrowing fact that is the reality of war: "it tolls for thee."', "Hemingway's magnificent novel has something for everyone:  an action tale, an anti-war protest, a love story, subtle ironies, a magnificent short story within the novel, political criticism of communism and fascism, a philosophy of life, and beautiful descriptions of life that leave you gasping.", 'But with the action packed into that time and extensive use of flashbacks, it becomes a tapestry of all humankind.', 'After you start to notice the individual threads in the tapestry, be sure to step back and see the whole.', 'For the remarkably balanced and connected artistry of the themes and directions in the story is what makes this book great.If you are disturbed by descriptions of violence, brutality, and inhumanity, you will not enjoy this book.Robert Jordan is an American who has joined the republican side of the Spanish civil war.', 'Now, he is transformed into a demolitions expert who can blow up trains and bridges.', "The reader is given a fascinating in-depth look into his psyche, where he reflects on his fellow soldiers, plans for warmaking and the justifications and rationalizations along the war trail.Pablo's another character who is burned into your consciousness long after you've finished reading.", "There is an incredible flashback story detailing Pablo's orchestration of the sadistic torture and humiliation of a group of wealthy fascist sympathizers.", "The vivid description involved in this passage is nothing short of extraordinary.Of course the true democrats, fighting for the Republic against the well financed and better armed fascist military, eventually lose the war, but having that foreknowledge does nothing to detract from the cliff-hanger like feelings brought about by the various battles and journeys the rebel crew embarks on.This is the best, and most accessible of all of Hemingway's works.", "He doesn't appear to be completely sold on this cause in Spain, and while he seemingly never leaves this area physically, mentally he challenges the essence of being there time and time again so he can reassure himself that this is the just thing to do, and for the right cause.This is not an easy read in the sense that there is much more to what is going on than meets the eye.", '(The Western democracies - who might have prevented Spain from going fascist - followed a pusillanimous "hands off" policy which only emboldened the insurgents and their supporters.)', 'Into this vortex came many writers and intellectuals.', 'From this self-inflicted literary ambush there is no escape for Hemingway: you either need excellent descriptive prose or superb psychological insight to carve a good story from such crooked timber, for, after all, what else is left to describe in such a situation save inner musings and the outer landscape?The prose is the next problem.', '174]"So a woman like that Pilar practically pushed this girl into your sleeping bag and what happens?', 'And of course, "For Whom The Bell Tolls," set against the brutal violence of the Spanish Civil War, is probably the definitive work of fiction about this pivotal period in European, and world history.Generalissimo Francisco Franco\'s fascist troops invaded Spain in July 1936 in order to overthrow the newly established Republic headed by the Popular Front, (composed of liberal democrats, socialists, anarchists, trade unionists, communists and secularists.', ")The country was basically divided into Red Spain - the Republicans, and Black Spain, represented by the landed elite, committed to a feudal system and Franco's cause, Fascists, the urban bourgeoisie, the Roman Catholic Church, and other conservative sectors.", 'Those who fought with the Abraham Lincoln Brigade, from 1937 through 1938, believed the defense of the Republic represented the last hope of stopping the spread of international fascism.', 'Most of the volunteers were not political, but idealists who were determined to "make Madrid the tomb of fascism."', "Hemingway's protagonist Robert Jordan, an American professor of Spanish from Missoula, Montana, was one such volunteer.Robert Jordan, an explosives expert, has been ordered to make contact with a small band of partisan fighters in the Guadarrama Mountains of fascist controlled southern Spain.", "He undergoes several changes during the 3 days and 3 nights in which the story takes place.Pilar is Pablo's woman, an extremely strong and savvy person, she is steeped in gypsy lore and superstition, and is probably the novel's most colorful character.", 'When Robert Jordon joins them, Pilar takes the leadership position over from Pablo, whom she no longer trusts, but still loves.', "Pilar, relates various war stories, and anecdotes, which reflect the cruelty and inhumanity of civil war.Mar&iacute;a's life was shattered by the outbreak of the war.", 'Since her mother was not a Republican, but a devout Catholic, she shouted, "Viva my husband, the town\'s mayor," before she died, rather than the more typical, "Viva La Republica!"', 'Hemingway worked as a correspondent in Spain during the Civil War, as a reporter for the North American Newspaper Alliance (NANA).', 'When it comes to men at war, the book shines.A technique that I found interesting was the way that Hemingway created the absent character of Kashkin.', 'The sex, the declarations of love, the intimacy, it all seems hollow.In every other place in the novel there is complexity, nuance.', 'And in a novel that creates such a real portrait of war and moral ambiguity; complexity in loyalty, politics, allegiance, nationality, and idealism, to offer the reader such an ordinary, pop-song rendition of love nearly justifies skipping every section where one sees the words "little rabbit.', '"Hemingway attempts to integrate language into the story by employing the occasional Spanish word along with an antiquated sort of English, full of thou and  thee.', 'The whole novel, except for some flashbacks and reminiscences of various characters, covers just a few days.Although the novel focuses on a small number of characters in a fairly compressed time period, Hemingway attains a real epic feel with this book.', 'It offers a compelling perspective on war from the viewpoint of guerrilla forces, rather than conventional forces (interested readers might want to check out Mao Tse-Tung\'s "On Guerrilla Warfare" for some theoretical and historical perspective).', 'Other significant issues include loyalty, leadership, communications, military hardware, the impact of weather and terrain, and the connection between guerrilla and conventional forces.', 'The novel follows his experiences with a band of guerrilla fighters as he undertakes a mission to blow up a strategic bridge.', 'Thus the book should interest not just lovers of literature, but also serious military professionals and students of the history of warfare.Hemingway offers a grim and graphic look at the brutality of 20th century warfare.', "In Hemingway's world storytelling is as essential a human activity as eating, fighting, and lovemaking.Hemingway's writing appeals to all the senses as he creates some vivid scenes.", 'For intriguing companion texts that also deal with the Spanish Civil War, I recommend "Spain\'s Cause Was Mine: A Memoir of an American Medic in the Spanish Civil War," by Hank Rubin, and "The Confessions of Senora Francesca Navarro and Other Stories," by Natalie L. M. Petesch.', 'Once the fiercest of the Republicans, he is now well-fed and content in his mountain hideaway, has a dozen or so horses that make him rich, and knows that the actions contemplated by Roberto will bring an end to his safety.', 'She describes in memorable detail her love affair with a matador in Valencia, and how she drank cold beer with the sweat dripping off the glass while he napped in the room behind her.', 'Along the way Jordan will learn the revolting pasts of several of the guerrillas, fall in love with one of them, and spend quite some time meditating on "truths" he was once sure he knew.No sooner has Jordan met with the guerrillas than he discovers that one of them, fearful of being hunted down by the fascist forces, stands against him and threatens to take the entire group away.', 'Even as they curse and spit at the fascists several of these guerrillas communicate, through their stories, arguments against the futility and cruelty of the war they have willfully taken up.', 'The answers are so well bound with the narrative that often one hardly notices a metaphysical discussion has occurred, but those who give the text a second look will find a philosophical subtext as gripping as the plot line.As for the story,  Hemingway creates out of his assortment of characters a narrative of breathtaking beauty.', 'Jordan, an American, is in Spain fighting on the side of the Republicans in 1937 during the Spanish Civil War.', 'He is a Spanish teacher from Montana who loves Spain, and is fighting, carrying out explosives missions, against the Fascists, who have a vast war machine.At the beginning of the novel, Robert Jordan is teamed up with a band of guerrilla fighters in the mountains near a bridge he must blow as part of a Republican offensive.', 'Other members of the band include Pablo, a formerly great fighter, we are told, who has now &quot;gone bad.&quot; He cares primarily for his horses.', "The story is brutal and demonstrates the atrocities committed by the Republicans in the war as they bludgeon the town's Fascists to save bullets.", "She is the &quot;love interest.&quot;I love Hemingway's voice, and this novel continues to demonstrate his ability, with that spare, journalistic style, to narrate loneliness like no one else.", 'The seemingly simplistic style evokes a real pathos, and is especially suited to writing of war and the human spiritual conflicts such situations impose upon its participants.', 'I laughed allowed at the absudities, but was struck by the dire consequences of these ridiculous desicions and actions.', "These situations show the war machine's indifference to individual human life and the ridiculous scenarios that arise from various leaders' individual conceits and worries.I think that the book's time frame of only three days makes a strong point about war and the people one serves with.", 'For the reader, the band in the mountains are basically the only people Robert Jordan knows (though there are brief flashbacks).', 'Perhaps the bell tolled for Ernest and his bid for literary greatness with the passing of this book.', 'To the extent that Hemingway wanted to reach the apex of truth in storytelling, and to find a suitable language to express it, this book is a great achievement.', 'Hemingway chooses Spanish modulation of English words to power his narrative - from the start, the reader senses the honor, strength, and spirit every sentence spoken carries with it.Ernest is not just translating from Spanish to wow us; the reader feels that when he wrote this book, Spanish was the only language that could express his happiness, his sadness, his pleasure and suffering.', "Maybe Spanish contained the words and meanings Hemingway and those involved in the war for Spanish liberty sought desperately every night by the campfire - words of fear and love.To me, this book was Hemingway's most significant attempt at articulating his life philosophy.", "You might argue that El Sordo's last stand is followed by Hemingway's personal literary last stand - against fascism, fear, and life's various illusions."]

In [122]:
df_book_summaries['asin'][2:3][2]


Out[122]:
'0002113570'


In [123]:
df_book_summaries['excerpts'][2:3][2]


Out[123]:
['That an English woman scientist would journey to Tanzania to engage in this type of research is unusual and certainly puts her at "the top of her class".She follows the lives and behavior patterns of her subjects until her research sounds like a Michener novel with its generational emphasis and timelines of family heritage.',
 'The squabbles and fighting behavior could be that of any large Homo Sapien family.',
 'Jane Goodall deserves every accolade she gets for bringing us a lens through which to observe another geneological line of a species that has developed from our common ancestors.Her work suggests that we should rethink our medical research toward more humane treatment of these animals whose behavior is  too similar to ours to ignore.']

In [124]:
df_book_summaries['asin'][3:4][3]


Out[124]:
'0002117088'


In [126]:
df_book_summaries['excerpts'][3:4][3]


Out[126]:
["We adopted &quot;Renoir, My Father&quot; as bedside reading while my wife was recovering from hip surgery, and (aside, perhaps, from &quot;Goodnight, Moon,&quot;) I can't imagine better therapy.",
 'None of the rough edges have been smoothed off which, come to think of it, is just as Claude would have wanted: Jean speaks with his own voice.',
 'There is even an index of sorts (I assume from the original translator) but it is patchy and incomplete.',
 "That last is a shortcoming, but forgivable in light of the book's other virtues."]

In [132]:
df_book_summaries['asin'][6:7][6]


Out[132]:
'000222383X'


In [131]:
df_book_summaries['excerpts'][6:7][6]


Out[131]:
["The Patrick O'Brian naval series of books are an acquired taste.",
 'While the books appear on the outside to be simple naval adventure tales, they are really deep studies in character development of a British naval officer and his best friend/ship surgeon/intelligence operative.The Mauritius Command is one of the best books in the series.',
 "As is usually the case, despite great achievements in the past, Jack is shackled and insufficiently rewarded by his superiors in the admiralty, and his supposed connections, through his father in the Parliament, are of little help.O'Brian seems to assume a good bit of nautical knowledge by the reader, and this landlubber sometimes got a little lost in the naval warfare scenes.",
 "The most engaging aspects of the novel seemed to me the differences in character, and the seething one-upsmanship among the various ship captains under Jack's overall command including Captains Pym, Clonfert and Corbett.",
 "The problem was, just when the author whets your appetite for some great internal conflict or drama between the brutal Corbett and the popular Clonfert, Corbett is sent from the area.Moreover, the final battle scenes are almost thrown together in summary form, as if the culmination of the mission did not really concern O'Brian as much as the hassles of getting there, and so there was a bit of a letdown at the end.",
 'For readers unfamiliar with these books, they describe the experiences of a Royal Navy officer and his close friend and traveling companion, a naval surgeon.',
 "Aubrey's navy is an organization reflecting its society; an order based on deference, rigid hierarchy, primitive notions of honor, favoritism, and very, very corrupt.",
 "In some ways, it was a ruthless meritocracy whose structure and success anticipates the great expansion of government power and capacity seen in the rest of the 19th century.O'Brian is also the great writer about male friendship.",
 'He has not one but two major protagonists.',
 "This is quite difficult and I'm not aware of any other writer who has been able to accomplish such sustained development of two major protagonists for such a prolonged period."]

In [120]:
## END_OF_FILE

In [ ]: