Pos-Tagging & Feature Extraction

Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

POS-tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The nltk library provides its own pre-trained POS-tagger. Let's see how it is used.



In [1]:

    
import pandas as pd
df0 = pd.read_csv("../../data/interim/001_normalised_keyed_reviews.csv", sep="\t", low_memory=False)
df0.head()









    Out[1]:







  
    
      
      uniqueKey
      reviewText
    
  
  
    
      0
      A2XQ5LZHTD4AFT##000100039X
      ['timeless', 'classic', 'demanding', 'assuming...
    
    
      1
      AF7CSSGV93RXN##000100039X
      ['first', 'read', 'prophet', 'kahlil', 'gibran...
    
    
      2
      A1NPNGWBVD9AK3##000100039X
      ['one', 'first', 'literary', 'books', 'recall'...
    
    
      3
      A3IS4WGMFR4X65##000100039X
      ['prophet', 'kahlil', 'gibrans', 'best', 'know...
    
    
      4
      AWLFVCT9128JV##000100039X
      ['gibran', 'khalil', 'gibran', 'born', 'one th...



In [2]:

    
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)



In [3]:

    
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").split(",")



In [4]:

    
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()









    



Progress:: 100%|██████████| 582711/582711 [00:17<00:00, 33818.55it/s]






    Out[4]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object



In [5]:

    
df0['reviewText'][12]









    Out[5]:





['father',
 ' huge',
 ' book',
 ' collection',
 ' remember',
 ' around',
 ' twelve',
 ' found',
 ' book',
 ' amidst',
 ' sea',
 ' books',
 ' thought',
 ' pretty',
 ' small',
 ' read',
 ' one',
 ' sit',
 ' changed',
 ' life',
 ' talks',
 ' life',
 ' love',
 ' friendship',
 ' death',
 ' etc',
 ' answers',
 ' spend',
 ' entire',
 ' life',
 ' searching',
 ' book',
 ' tells',
 ' things',
 ' already',
 ' know',
 ' life',
 ' somehow',
 ' keep',
 ' back',
 ' heads',
 ' dont_have',
 ' right',
 ' thing',
 ' ignoring',
 ' make',
 ' us',
 ' human',
 ' default',
 ' no_place',
 ' go',
 ' book',
 ' doesnt_follow',
 ' remember',
 ' college',
 ' teacher',
 ' despised',
 ' much',
 ' class',
 ' talking',
 ' favorite',
 ' book',
 ' came',
 ' turn',
 ' said',
 ' prophet',
 ' kahlil',
 ' gibran',
 ' got',
 ' excited',
 ' said',
 ' thats',
 ' favorite',
 ' book',
 ' like',
 ' bible',
 ' remember',
 ' thinking',
 ' something',
 ' important',
 ' common',
 ' someone',
 ' despise',
 ' guess',
 ' thats',
 ' thing',
 ' life',
 ' never_forget']



In [6]:

    
import nltk
nltk.__version__









    Out[6]:





'3.2.4'



In [7]:

    
# Split negs
def split_neg(review):
    new_review = []
    for token in review:
        if '_' in token:
            split_words = token.split("_")
            new_review.append(split_words[0])
            new_review.append(split_words[1])
        else:
            new_review.append(token)
    return new_review



In [8]:

    
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: split_neg(review))
df0["reviewText"].head()









    



Progress:: 100%|██████████| 582711/582711 [00:14<00:00, 40001.85it/s]






    Out[8]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object



In [9]:

    
### Remove Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(review):
    return [token for token in review if not token in stop_words]



In [10]:

    
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: remove_stopwords(review))
df0["reviewText"].head()









    



Progress:: 100%|██████████| 582711/582711 [00:12<00:00, 48007.55it/s]






    Out[10]:





0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

Unfortunatelly, this tagger, though much better and accurate, takes a lot of time. In order to process the above data set it would need close to 3 days of running.

Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History



In [12]:

    
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

# import os
# os.getcwd()

# Add the jar and model via their path (instead of setting environment variables):
jar = '../../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
model = '../../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')



In [13]:

    
def pos_tag(review):
    if(len(review)>0):
        return pos_tagger.tag(review)



In [14]:

    
# Example
text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
print(text)









    



[('What', 'WP'), ("'s", 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]



In [15]:

    
tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
tagged_df.head()









    



Progress::   0%|          | 72/582711 [01:09<158:51:58,  1.02it/s]





    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-15-013aa4904155> in <module>()
----> 1 tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
      2 tagged_df.head()

~/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py in inner(df, func, *args, **kwargs)
    628                 # Apply the provided function (in *args and **kwargs)
    629                 # on the df using our wrapper (which provides bar updating)
--> 630                 result = getattr(df, df_function)(wrapper, *args, **kwargs)
    631 
    632                 # Close bar and return pandas calculation result

~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

~/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py in wrapper(*args, **kwargs)
    624                 def wrapper(*args, **kwargs):
    625                     t.update()
--> 626                     return func(*args, **kwargs)
    627 
    628                 # Apply the provided function (in *args and **kwargs)

<ipython-input-15-013aa4904155> in <lambda>(review)
----> 1 tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
      2 tagged_df.head()

<ipython-input-13-a7a040836b8b> in pos_tag(review)
      1 def pos_tag(review):
      2     if(len(review)>0):
----> 3         return pos_tagger.tag(review)

~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py in tag(self, tokens)
     74     def tag(self, tokens):
     75         # This function should return list of tuple rather than list of list
---> 76         return sum(self.tag_sents([tokens]), [])
     77 
     78     def tag_sents(self, sentences):

~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py in tag_sents(self, sentences)
     97         # Run the tagger and get the output
     98         stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
---> 99                                        stdout=PIPE, stderr=PIPE)
    100         stanpos_output = stanpos_output.decode(encoding)
    101 

~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py in java(cmd, classpath, stdin, stdout, stderr, blocking)
    129     p = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
    130     if not blocking: return p
--> 131     (stdout, stderr) = p.communicate()
    132 
    133     # Check the return code.

~/anaconda3/lib/python3.6/subprocess.py in communicate(self, input, timeout)
    836 
    837             try:
--> 838                 stdout, stderr = self._communicate(input, endtime, timeout)
    839             finally:
    840                 self._communication_started = True

~/anaconda3/lib/python3.6/subprocess.py in _communicate(self, input, endtime, orig_timeout)
   1501                         raise TimeoutExpired(self.args, orig_timeout)
   1502 
-> 1503                     ready = selector.select(timeout)
   1504                     self._check_timeout(endtime, orig_timeout)
   1505 

~/anaconda3/lib/python3.6/selectors.py in select(self, timeout)
    374             ready = []
    375             try:
--> 376                 fd_event_list = self._poll.poll(timeout)
    377             except InterruptedError:
    378                 return ready

KeyboardInterrupt:



In [ ]:

    
# tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
# tagged_df.head()

Thankfully, nltk provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’), or a regular expression. nltk also provides batch pos-tagging method for document pos-tagging:



In [ ]:

    
tagged_df['reviewText'][8]

The list of all possible tags appears below:

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	ExistentialĘthere
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP*	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
TO	to
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund or present participle
VBN	Verb, past participle
VBP	Verb, non-3rd person singular present
VBZ	Verb, 3rd person singular present
WDT	Wh-determiner
WP	Wh-pronoun
WP*	Possessive wh-pronoun
WRB	Wh-adverb

Notice: where you see * replace with $.



In [ ]:

    
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()



In [ ]:

    
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, tagged_df], axis=1);
pos_tagged_keyed_reviews.head()



In [ ]:

    
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);

Nouns

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.



In [ ]:

    
def noun_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]



In [ ]:

    
nouns_df = pd.DataFrame(tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()



In [ ]:

    
keyed_nouns_df = pd.concat([uniqueKey_series_df, nouns_df], axis=1);
keyed_nouns_df.head()



In [ ]:

    
keyed_nouns_df.to_csv("../../data/interim/002_keyed_nouns_stanford.csv", sep='\t', header=True, index=False);



In [ ]:

    
## END_OF_FILE



In [ ]:

	uniqueKey	reviewText
0	A2XQ5LZHTD4AFT##000100039X	['timeless', 'classic', 'demanding', 'assuming...
1	AF7CSSGV93RXN##000100039X	['first', 'read', 'prophet', 'kahlil', 'gibran...
2	A1NPNGWBVD9AK3##000100039X	['one', 'first', 'literary', 'books', 'recall'...
3	A3IS4WGMFR4X65##000100039X	['prophet', 'kahlil', 'gibrans', 'best', 'know...
4	AWLFVCT9128JV##000100039X	['gibran', 'khalil', 'gibran', 'born', 'one th...