Grouping Domain Synonyms

This note book is used for identifying and grouping domain synonyms in book reviews for each book. The approach relies on creating an nltk.context per book which will be used to compare only nouns. Nouns that appear to have a highly similar context, will be grouped together under the same name (either the one's or the other's).



In [1]:

    
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)



In [2]:

    
import pandas as pd
df0 = pd.read_csv("../data/interim/002_keyed_nouns.csv", sep="\t", low_memory=False)
df0.head()









    Out[2]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      ['timeless', ' gibran', ' backs', ' content', ...
    
    
      1
      AF7CSSGV93RXN##000100039X
      [' prophet', ' kahlil', ' gibran', ' thirty', ...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      [' first', ' books', ' recall', ' collection',...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      ['prophet', ' kahlil', ' work', ' world', ' mi...
    
    
      4
      AWLFVCT9128JV##000100039X
      ['gibran', ' khalil', ' gibran', ' born', ' on...

Convert back to a string list.



In [3]:

    
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").replace("\t","").split(",")

# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()









    



Progress:: 100%|██████████| 582711/582711 [00:10<00:00, 55787.46it/s]






    Out[3]:





0    [timeless,   gibran,   backs,   content,   mea...
1    [ prophet,   kahlil,   gibran,   thirty,   yea...
2    [ first,   books,   recall,   collection,   gi...
3    [prophet,   kahlil,   work,   world,   million...
4    [gibran,   khalil,   gibran,   born,   one tho...
Name: reviewText, dtype: object

Split unique key to asin and unserId.



In [4]:

    
df1 = pd.DataFrame(df0.uniqueKey.str.split('##',1).tolist(),columns = ['userId','asin'])
df1.head()









    Out[4]:







  
    
      
      userId
      asin
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
    
    
      1
      AF7CSSGV93RXN
      000100039X
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
    
    
      3
      A3IS4WGMFR4X65
      000100039X
    
    
      4
      AWLFVCT9128JV
      000100039X



In [5]:

    
df_reviewText = pd.DataFrame(df0['reviewText'])
df_reviewText.head()









    Out[5]:







  
    
      
      reviewText
    
  
  
    
      0
      [timeless,   gibran,   backs,   content,   mea...
    
    
      1
      [ prophet,   kahlil,   gibran,   thirty,   yea...
    
    
      2
      [ first,   books,   recall,   collection,   gi...
    
    
      3
      [prophet,   kahlil,   work,   world,   million...
    
    
      4
      [gibran,   khalil,   gibran,   born,   one tho...

Create new dataframe with userId, asin and reviewText.



In [6]:

    
df_new = pd.concat([df1, df_reviewText], axis=1)



In [7]:

    
df_new.head()









    Out[7]:







  
    
      
      userId
      asin
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
      [timeless,   gibran,   backs,   content,   mea...
    
    
      1
      AF7CSSGV93RXN
      000100039X
      [ prophet,   kahlil,   gibran,   thirty,   yea...
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
      [ first,   books,   recall,   collection,   gi...
    
    
      3
      A3IS4WGMFR4X65
      000100039X
      [prophet,   kahlil,   work,   world,   million...
    
    
      4
      AWLFVCT9128JV
      000100039X
      [gibran,   khalil,   gibran,   born,   one tho...

Drop userId and groupby the same book.



In [8]:

    
df_books = df_new.drop(columns=['userId'])



In [9]:

    
df_books_bigReviews = df_books.groupby(['asin'])['reviewText'].progress_apply(list)









    



Progress:: 100%|█████████▉| 59324/59325 [00:02<00:00, 20277.14it/s]



In [10]:

    
df_books_bigReviews_df = pd.DataFrame(df_books_bigReviews).reset_index()
df_books_bigReviews_df.head()









    Out[10]:







  
    
      
      asin
      reviewText
    
  
  
    
      0
      000100039X
      [[timeless,   gibran,   backs,   content,   me...
    
    
      1
      0002051850
      [[ book,   takes,   civil,   war,   descriptio...
    
    
      2
      0002113570
      [[ book,   great,   woman,   done,   great,   ...
    
    
      3
      0002117088
      [[ renoir,   father,   quot,   bedside,   surg...
    
    
      4
      000215725X
      [[ dalrymple,   great,   apetite,   context,  ...



In [11]:

    
def merge_list(reviewsList):
    new_list = []
    for review in reviewsList:
        new_list = new_list + review
    return list(set(new_list))



In [12]:

    
df_books_bigReviews_single_list_df = df_books_bigReviews.progress_apply(lambda reviewsList: merge_list(reviewsList))
df_books_bigReviews_single_list_df.head()









    



Progress:: 100%|██████████| 59324/59324 [00:04<00:00, 12491.02it/s]






    Out[12]:





asin
000100039X    [  endure,   needs,   one thousand,   melody, ...
0002051850    [  cool,   hills,   empire,   atrocities,   ac...
0002113570    [  knew,   one thousand,   actions,   attentio...
0002117088    [  needs,  imagine,   aline,   one thousand,  ...
000215725X    [  one thousand,   part,   empire,   balvinder...
Name: reviewText, dtype: object



In [13]:

    
df_books_vs_bigreviews = pd.DataFrame(df_books_bigReviews_single_list_df).reset_index()
df_books_vs_bigreviews.head()









    Out[13]:







  
    
      
      asin
      reviewText
    
  
  
    
      0
      000100039X
      [  endure,   needs,   one thousand,   melody, ...
    
    
      1
      0002051850
      [  cool,   hills,   empire,   atrocities,   ac...
    
    
      2
      0002113570
      [  knew,   one thousand,   actions,   attentio...
    
    
      3
      0002117088
      [  needs,  imagine,   aline,   one thousand,  ...
    
    
      4
      000215725X
      [  one thousand,   part,   empire,   balvinder...



In [14]:

    
df2 = df_books_vs_bigreviews



In [15]:

    
len(df2.reviewText[0])









    Out[15]:





974



In [16]:

    
from nltk.corpus import wordnet as wn
from itertools import product

def get_synonyms_dict(bigReview, theta):
    
    synonyms = {}
    
    for i in range(len(bigReview)):
        wordx = bigReview[i]
        for j in range(i,len(bigReview)):
            wordy = bigReview[j]
            
            # don't compare with the same word
            if(wordx == wordy):
                continue
            
            sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
            prod = list(product(*[sem1,sem2]))
            
            maxscore = 0.0
            for k,l in prod:
                score = k.wup_similarity(l) # Wu-Palmer Similarity
                if score is not None:
                    if maxscore < score:
                        maxscore = score
            
            if maxscore > theta and wordy not in synonyms:
                synonyms[wordx] = wordy
    return synonyms

From this point onwards computation needs increase dramatically, so I will reduce the dataset I am using to just keep in 1000/59324 books.



In [119]:

    
# Get Synonym Dicts per Book Reviews
df3 = df2[0:1000].assign(synDict = df2['reviewText'][0:1000].progress_apply(lambda big_review: get_synonyms_dict(big_review, 0.9)))
df3.head()









    



Progress:: 100%|██████████| 1000/1000 [18:14:21<00:00,  2.60s/it]     






    Out[119]:







  
    
      
      asin
      reviewText
      synDict
    
  
  
    
      0
      000100039X
      [  endure,   needs,   one thousand,   melody, ...
      {'gibran': 'gibrans'}
    
    
      1
      0002051850
      [  cool,   hills,   empire,   atrocities,   ac...
      {}
    
    
      2
      0002113570
      [  knew,   one thousand,   actions,   attentio...
      {}
    
    
      3
      0002117088
      [  needs,  imagine,   aline,   one thousand,  ...
      {}
    
    
      4
      000215725X
      [  one thousand,   part,   empire,   balvinder...
      {}



In [132]:

    
df4 = df3.drop(columns=['reviewText'])
df4.head()









    Out[132]:







  
    
      
      asin
      synDict
    
  
  
    
      0
      000100039X
      {'gibran': 'gibrans'}
    
    
      1
      0002051850
      {}
    
    
      2
      0002113570
      {}
    
    
      3
      0002117088
      {}
    
    
      4
      000215725X
      {}



In [133]:

    
df5 = pd.merge(df_new[0:1000], df4, how='inner', on='asin')
df5.head()









    Out[133]:







  
    
      
      userId
      asin
      reviewText
      synDict
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
      [timeless,   gibran,   backs,   content,   mea...
      {'gibran': 'gibrans'}
    
    
      1
      AF7CSSGV93RXN
      000100039X
      [ prophet,   kahlil,   gibran,   thirty,   yea...
      {'gibran': 'gibrans'}
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
      [ first,   books,   recall,   collection,   gi...
      {'gibran': 'gibrans'}
    
    
      3
      A3IS4WGMFR4X65
      000100039X
      [prophet,   kahlil,   work,   world,   million...
      {'gibran': 'gibrans'}
    
    
      4
      AWLFVCT9128JV
      000100039X
      [gibran,   khalil,   gibran,   born,   one tho...
      {'gibran': 'gibrans'}



In [134]:

    
matrix_m01 = df5.as_matrix()



In [135]:

    
for i in range(5):
    new_list = []
    for word in matrix_m01[i][2]:
        clean_word = word.replace(" ", "");
        if clean_word in matrix_m01[i][3].keys():
            new_list.append(matrix_m01[i][3][clean_word])
        else:
            new_list.append(clean_word)
    matrix_m01[i][2] =  new_list



In [137]:

    
df_final = pd.DataFrame(matrix_m01)
df_final.head()









    Out[137]:







  
    
      
      0
      1
      2
      3
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
      [timeless, gibrans, backs, content, means, cen...
      {'gibran': 'gibrans'}
    
    
      1
      AF7CSSGV93RXN
      000100039X
      [prophet, kahlil, gibrans, thirty, years, ago,...
      {'gibran': 'gibrans'}
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
      [first, books, recall, collection, gibrans, se...
      {'gibran': 'gibrans'}
    
    
      3
      A3IS4WGMFR4X65
      000100039X
      [prophet, kahlil, work, world, million, copies...
      {'gibran': 'gibrans'}
    
    
      4
      AWLFVCT9128JV
      000100039X
      [gibrans, khalil, gibrans, born, onethousand, ...
      {'gibran': 'gibrans'}



In [139]:

    
df_final.columns = ['userId','asin', 'reviewText', 'synDict']
df_final.head()









    Out[139]:







  
    
      
      userId
      asin
      reviewText
      synDict
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
      [timeless, gibrans, backs, content, means, cen...
      {'gibran': 'gibrans'}
    
    
      1
      AF7CSSGV93RXN
      000100039X
      [prophet, kahlil, gibrans, thirty, years, ago,...
      {'gibran': 'gibrans'}
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
      [first, books, recall, collection, gibrans, se...
      {'gibran': 'gibrans'}
    
    
      3
      A3IS4WGMFR4X65
      000100039X
      [prophet, kahlil, work, world, million, copies...
      {'gibran': 'gibrans'}
    
    
      4
      AWLFVCT9128JV
      000100039X
      [gibrans, khalil, gibrans, born, onethousand, ...
      {'gibran': 'gibrans'}



In [140]:

    
df_final = df_final.drop(columns=['synDict'])



In [141]:

    
df_final.head()









    Out[141]:







  
    
      
      userId
      asin
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT
      000100039X
      [timeless, gibrans, backs, content, means, cen...
    
    
      1
      AF7CSSGV93RXN
      000100039X
      [prophet, kahlil, gibrans, thirty, years, ago,...
    
    
      2
      A1NPNGWBVD9AK3
      000100039X
      [first, books, recall, collection, gibrans, se...
    
    
      3
      A3IS4WGMFR4X65
      000100039X
      [prophet, kahlil, work, world, million, copies...
    
    
      4
      AWLFVCT9128JV
      000100039X
      [gibrans, khalil, gibrans, born, onethousand, ...



In [142]:

    
df_final.to_csv("../data/interim/004_synonyms_grouped_1k.csv", sep='\t', header=True, index=False);



In [143]:

    
df_final.to_pickle("../data/interim/004_synonyms_grouped_1k.p")



In [129]:

    
# END OF FILE



In [ ]:

	uniqueKey	reviewText
0	A2XQ5LZHTD4AFT##000100039X	['timeless', ' gibran', ' backs', ' content', ...
1	AF7CSSGV93RXN##000100039X	[' prophet', ' kahlil', ' gibran', ' thirty', ...
2	A1NPNGWBVD9AK3##000100039X	[' first', ' books', ' recall', ' collection',...
3	A3IS4WGMFR4X65##000100039X	['prophet', ' kahlil', ' work', ' world', ' mi...
4	AWLFVCT9128JV##000100039X	['gibran', ' khalil', ' gibran', ' born', ' on...

	userId	asin
0	A2XQ5LZHTD4AFT	000100039X
1	AF7CSSGV93RXN	000100039X
2	A1NPNGWBVD9AK3	000100039X
3	A3IS4WGMFR4X65	000100039X
4	AWLFVCT9128JV	000100039X

	reviewText
0	[timeless, gibran, backs, content, mea...
1	[ prophet, kahlil, gibran, thirty, yea...
2	[ first, books, recall, collection, gi...
3	[prophet, kahlil, work, world, million...
4	[gibran, khalil, gibran, born, one tho...

	asin	reviewText
0	000100039X	[[timeless, gibran, backs, content, me...
1	0002051850	[[ book, takes, civil, war, descriptio...
2	0002113570	[[ book, great, woman, done, great, ...
3	0002117088	[[ renoir, father, quot, bedside, surg...
4	000215725X	[[ dalrymple, great, apetite, context, ...

	asin	reviewText
0	000100039X	[ endure, needs, one thousand, melody, ...
1	0002051850	[ cool, hills, empire, atrocities, ac...
2	0002113570	[ knew, one thousand, actions, attentio...
3	0002117088	[ needs, imagine, aline, one thousand, ...
4	000215725X	[ one thousand, part, empire, balvinder...

	asin	reviewText	synDict
0	000100039X	[ endure, needs, one thousand, melody, ...	{'gibran': 'gibrans'}
1	0002051850	[ cool, hills, empire, atrocities, ac...	{}
2	0002113570	[ knew, one thousand, actions, attentio...	{}
3	0002117088	[ needs, imagine, aline, one thousand, ...	{}
4	000215725X	[ one thousand, part, empire, balvinder...	{}

	0	1	2	3
0	A2XQ5LZHTD4AFT	000100039X	[timeless, gibrans, backs, content, means, cen...	{'gibran': 'gibrans'}
1	AF7CSSGV93RXN	000100039X	[prophet, kahlil, gibrans, thirty, years, ago,...	{'gibran': 'gibrans'}
2	A1NPNGWBVD9AK3	000100039X	[first, books, recall, collection, gibrans, se...	{'gibran': 'gibrans'}
3	A3IS4WGMFR4X65	000100039X	[prophet, kahlil, work, world, million, copies...	{'gibran': 'gibrans'}
4	AWLFVCT9128JV	000100039X	[gibrans, khalil, gibrans, born, onethousand, ...	{'gibran': 'gibrans'}