Pos-Tagging & Feature Extraction

Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

POS-tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The nltk library provides its own pre-trained POS-tagger. Let's see how it is used.



In [1]:

    
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews.csv", sep="\t", low_memory=False)
df0.head()









    Out[1]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      ['timeless', 'classic', 'demanding', 'assuming...
    
    
      1
      AF7CSSGV93RXN##000100039X
      ['first', 'read', 'prophet', 'kahlil', 'gibran...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      ['one', 'first', 'literary', 'books', 'recall'...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      ['prophet', 'kahlil', 'gibrans', 'best', 'know...
    
    
      4
      AWLFVCT9128JV##000100039X
      ['gibran', 'khalil', 'gibran', 'born', 'one th...



In [2]:

    
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)



In [3]:

    
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").replace("\t","").split(",")



In [4]:

    
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()









    



Progress:: 100%|██████████| 582711/582711 [00:17<00:00, 33284.66it/s]






    Out[4]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object



In [5]:

    
df0['reviewText'][12]









    Out[5]:





['father',
 ' huge',
 ' book',
 ' collection',
 ' remember',
 ' around',
 ' twelve',
 ' found',
 ' book',
 ' amidst',
 ' sea',
 ' books',
 ' thought',
 ' pretty',
 ' small',
 ' read',
 ' one',
 ' sit',
 ' changed',
 ' life',
 ' talks',
 ' life',
 ' love',
 ' friendship',
 ' death',
 ' etc',
 ' answers',
 ' spend',
 ' entire',
 ' life',
 ' searching',
 ' book',
 ' tells',
 ' things',
 ' already',
 ' know',
 ' life',
 ' somehow',
 ' keep',
 ' back',
 ' heads',
 ' dont_have',
 ' right',
 ' thing',
 ' ignoring',
 ' make',
 ' us',
 ' human',
 ' default',
 ' no_place',
 ' go',
 ' book',
 ' doesnt_follow',
 ' remember',
 ' college',
 ' teacher',
 ' despised',
 ' much',
 ' class',
 ' talking',
 ' favorite',
 ' book',
 ' came',
 ' turn',
 ' said',
 ' prophet',
 ' kahlil',
 ' gibran',
 ' got',
 ' excited',
 ' said',
 ' thats',
 ' favorite',
 ' book',
 ' like',
 ' bible',
 ' remember',
 ' thinking',
 ' something',
 ' important',
 ' common',
 ' someone',
 ' despise',
 ' guess',
 ' thats',
 ' thing',
 ' life',
 ' never_forget']



In [6]:

    
import nltk
nltk.__version__









    Out[6]:





'3.2.5'



In [7]:

    
# Split negs
def split_neg(review):
    new_review = []
    for token in review:
        if '_' in token:
            split_words = token.split("_")
            new_review.append(split_words[0])
            new_review.append(split_words[1])
        else:
            new_review.append(token)
    return new_review



In [8]:

    
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: split_neg(review))
df0["reviewText"].head()









    



Progress:: 100%|██████████| 582711/582711 [00:15<00:00, 37933.46it/s]






    Out[8]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object



In [9]:

    
### Remove Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(review):
    return [token for token in review if not token in stop_words]



In [10]:

    
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: remove_stopwords(review))
df0["reviewText"].head()









    



Progress:: 100%|██████████| 582711/582711 [00:12<00:00, 47896.49it/s]






    Out[10]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History



In [11]:

    
# from nltk.tag import StanfordPOSTagger
# from nltk import word_tokenize

# # import os
# # os.getcwd()

# # Add the jar and model via their path (instead of setting environment variables):
# jar = '../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
# model = '../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

# pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')



In [12]:

    
# def pos_tag(review):
#     if(len(review)>0):
#         return pos_tagger.tag(review)



In [13]:

    
## Example
# text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
# print(text)



In [14]:

    
# tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
# tagged_df.head()

Unfortunatelly, this tagger, though much better and accurate, takes a lot of time. In order to process the above data set it would need close to 3 days of running.



In [15]:

    
# from textblob import TextBlob

# def blob_pos_tagger(review):
#     blob = TextBlob(" ".join(review))
#     return blob.tags

# blob_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: blob_pos_tagger(review)))
# blob_tagged_df.head()



In [16]:

    
nltk_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
nltk_tagged_df.head()









    



Progress:: 100%|██████████| 582711/582711 [1:08:29<00:00, 141.81it/s]






    Out[16]:







  
    
      
      reviewText
    
  
  
    
      0
      [(timeless, NN), ( classic, JJ), ( demanding, ...
    
    
      1
      [(first, RB), ( read, JJ), ( prophet, NNP), ( ...
    
    
      2
      [(one, CD), ( first, NNP), ( literary, JJ), ( ...
    
    
      3
      [(prophet, NN), ( kahlil, NN), ( gibrans, VBZ)...
    
    
      4
      [(gibran, NN), ( khalil, NNP), ( gibran, NNP),...

Thankfully, nltk provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’), or a regular expression. nltk also provides batch pos-tagging method for document pos-tagging:



In [17]:

    
nltk_tagged_df['reviewText'][8]









    Out[17]:





[('prophet', 'NN'),
 (' dispenses', 'NNS'),
 (' ultimate', 'VBP'),
 (' wisdom', 'NN'),
 (' loved', 'VBN'),
 (' ones', 'NNS'),
 (' bids', 'NNS'),
 (' fare', 'VBP'),
 (' well', 'NNP'),
 (' khalil', 'NNP'),
 (' gibran', 'NNP'),
 (' defines', 'VBZ'),
 (' never', 'RB'),
 (' words', 'NNS'),
 (' define', 'VBP'),
 (' appropriately', 'RB'),
 (' good', 'JJ'),
 (' sense', 'JJ'),
 (' define', 'NN'),
 (' discovered', 'VBN'),
 (' book', 'NNP'),
 (' back', 'NNP'),
 (' took', 'NNP'),
 (' long', 'NNP'),
 (' time', 'NNP'),
 (' read', 'JJ'),
 (' since', 'NN'),
 (' refused', 'VBD'),
 (' rush', 'JJ'),
 (' read', 'JJ'),
 (' lesson', 'NN'),
 (' time', 'NN'),
 (' understanding', 'VBG'),
 (' best', 'JJS'),
 (' ability', 'NN'),
 (' found', 'NN'),
 (' way', 'NNP'),
 (' life', 'NNP'),
 (' words', 'NNS'),
 (' could', 'MD'),
 (' read', 'VB'),
 (' everyday', 'JJ'),
 (' time', 'JJ'),
 (' words', 'NNS'),
 (' would', 'MD'),
 (' dispense', 'VB'),
 (' new', 'NNP'),
 (' lesson', 'NNP'),
 (' like', 'NNP'),
 (' never', 'NNP'),
 ('ending', 'VBG'),
 (' treasure', 'NN')]

The list of all possible tags appears below:

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	ExistentialĘthere
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP*	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund or present participle
VBN	Verb, past participle
VBP	Verb, non-3rd person singular present
VBZ	Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP*	Possessive wh-pronoun
WRB	Wh-adverb

Notice: where you see * replace with $.



In [18]:

    
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()









    Out[18]:







  
    
      
      uniqueKey
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
    
    
      1
      AF7CSSGV93RXN##000100039X
    
    
      2
      A1NPNGWBVD9AK3##000100039X
    
    
      3
      A3IS4WGMFR4X65##000100039X
    
    
      4
      AWLFVCT9128JV##000100039X



In [19]:

    
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, nltk_tagged_df], axis=1);
pos_tagged_keyed_reviews.head()









    Out[19]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      [(timeless, NN), ( classic, JJ), ( demanding, ...
    
    
      1
      AF7CSSGV93RXN##000100039X
      [(first, RB), ( read, JJ), ( prophet, NNP), ( ...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      [(one, CD), ( first, NNP), ( literary, JJ), ( ...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      [(prophet, NN), ( kahlil, NN), ( gibrans, VBZ)...
    
    
      4
      AWLFVCT9128JV##000100039X
      [(gibran, NN), ( khalil, NNP), ( gibran, NNP),...



In [20]:

    
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);



In [21]:

    
# Save a dictionary into a pickle file.
pos_tagged_keyed_reviews.to_pickle("../data/interim/002_pos_tagged_keyed_reviews.p")

Nouns

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.



In [22]:

    
def noun_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]



In [23]:

    
nouns_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()









    



Progress:: 100%|██████████| 582711/582711 [00:21<00:00, 27257.39it/s] 






    Out[23]:







  
    
      
      reviewText
    
  
  
    
      0
      [timeless,  gibran,  backs,  content,  means, ...
    
    
      1
      [ prophet,  kahlil,  gibran,  thirty,  years, ...
    
    
      2
      [ first,  books,  recall,  collection,  gibran...
    
    
      3
      [prophet,  kahlil,  work,  world,  million,  c...
    
    
      4
      [gibran,  khalil,  gibran,  born,  one thousan...



In [24]:

    
keyed_nouns_df = pd.concat([uniqueKey_series_df, nouns_df], axis=1);
keyed_nouns_df.head()









    Out[24]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      [timeless,  gibran,  backs,  content,  means, ...
    
    
      1
      AF7CSSGV93RXN##000100039X
      [ prophet,  kahlil,  gibran,  thirty,  years, ...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      [ first,  books,  recall,  collection,  gibran...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      [prophet,  kahlil,  work,  world,  million,  c...
    
    
      4
      AWLFVCT9128JV##000100039X
      [gibran,  khalil,  gibran,  born,  one thousan...



In [25]:

    
keyed_nouns_df.to_csv("../data/interim/002_keyed_nouns.csv", sep='\t', header=True, index=False);



In [26]:

    
# Save a dictionary into a pickle file.
keyed_nouns_df.to_pickle("../data/interim/002_keyed_nouns.p")

Adjectives



In [27]:

    
def adjective_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'JJR', 'JJS'}]



In [28]:

    
adjectives_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()









    



Progress:: 100%|██████████| 582711/582711 [00:11<00:00, 52023.77it/s]






    Out[28]:







  
    
      
      reviewText
    
  
  
    
      0
      [timeless,  gibran,  backs,  content,  means, ...
    
    
      1
      [ prophet,  kahlil,  gibran,  thirty,  years, ...
    
    
      2
      [ first,  books,  recall,  collection,  gibran...
    
    
      3
      [prophet,  kahlil,  work,  world,  million,  c...
    
    
      4
      [gibran,  khalil,  gibran,  born,  one thousan...



In [29]:

    
keyed_adjectives_df = pd.concat([uniqueKey_series_df, adjectives_df], axis=1);
keyed_adjectives_df.head()









    Out[29]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      [timeless,  gibran,  backs,  content,  means, ...
    
    
      1
      AF7CSSGV93RXN##000100039X
      [ prophet,  kahlil,  gibran,  thirty,  years, ...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      [ first,  books,  recall,  collection,  gibran...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      [prophet,  kahlil,  work,  world,  million,  c...
    
    
      4
      AWLFVCT9128JV##000100039X
      [gibran,  khalil,  gibran,  born,  one thousan...



In [30]:

    
keyed_adjectives_df.to_csv("../data/interim/002_adjectives_nouns.csv", sep='\t', header=True, index=False);



In [31]:

    
# Save a dictionary into a pickle file.
keyed_adjectives_df.to_pickle("../data/interim/002_keyed_adjectives.p")



In [ ]:

    
# END OF FILE



In [ ]:

	uniqueKey	reviewText
0	A2XQ5LZHTD4AFT##000100039X	['timeless', 'classic', 'demanding', 'assuming...
1	AF7CSSGV93RXN##000100039X	['first', 'read', 'prophet', 'kahlil', 'gibran...
2	A1NPNGWBVD9AK3##000100039X	['one', 'first', 'literary', 'books', 'recall'...
3	A3IS4WGMFR4X65##000100039X	['prophet', 'kahlil', 'gibrans', 'best', 'know...
4	AWLFVCT9128JV##000100039X	['gibran', 'khalil', 'gibran', 'born', 'one th...

	reviewText
0	[(timeless, NN), ( classic, JJ), ( demanding, ...
1	[(first, RB), ( read, JJ), ( prophet, NNP), ( ...
2	[(one, CD), ( first, NNP), ( literary, JJ), ( ...
3	[(prophet, NN), ( kahlil, NN), ( gibrans, VBZ)...
4	[(gibran, NN), ( khalil, NNP), ( gibran, NNP),...

	reviewText
0	[timeless, gibran, backs, content, means, ...
1	[ prophet, kahlil, gibran, thirty, years, ...
2	[ first, books, recall, collection, gibran...
3	[prophet, kahlil, work, world, million, c...
4	[gibran, khalil, gibran, born, one thousan...