Pos-Tagging & Feature Extraction

Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

POS-tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The nltk library provides its own pre-trained POS-tagger. Let's see how it is used.


In [1]:
import pandas as pd
df0 = pd.read_csv("../../data/interim/001_normalised_keyed_reviews.csv", sep="\t", low_memory=False)
df0.head()


Out[1]:
uniqueKey reviewText
0 A2XQ5LZHTD4AFT##000100039X ['timeless', 'classic', 'demanding', 'assuming...
1 AF7CSSGV93RXN##000100039X ['first', 'read', 'prophet', 'kahlil', 'gibran...
2 A1NPNGWBVD9AK3##000100039X ['one', 'first', 'literary', 'books', 'recall'...
3 A3IS4WGMFR4X65##000100039X ['prophet', 'kahlil', 'gibrans', 'best', 'know...
4 AWLFVCT9128JV##000100039X ['gibran', 'khalil', 'gibran', 'born', 'one th...

In [2]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [3]:
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").split(",")

In [4]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()


Progress:: 100%|██████████| 582711/582711 [00:17<00:00, 33818.55it/s]
Out[4]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

In [5]:
df0['reviewText'][12]


Out[5]:
['father',
 ' huge',
 ' book',
 ' collection',
 ' remember',
 ' around',
 ' twelve',
 ' found',
 ' book',
 ' amidst',
 ' sea',
 ' books',
 ' thought',
 ' pretty',
 ' small',
 ' read',
 ' one',
 ' sit',
 ' changed',
 ' life',
 ' talks',
 ' life',
 ' love',
 ' friendship',
 ' death',
 ' etc',
 ' answers',
 ' spend',
 ' entire',
 ' life',
 ' searching',
 ' book',
 ' tells',
 ' things',
 ' already',
 ' know',
 ' life',
 ' somehow',
 ' keep',
 ' back',
 ' heads',
 ' dont_have',
 ' right',
 ' thing',
 ' ignoring',
 ' make',
 ' us',
 ' human',
 ' default',
 ' no_place',
 ' go',
 ' book',
 ' doesnt_follow',
 ' remember',
 ' college',
 ' teacher',
 ' despised',
 ' much',
 ' class',
 ' talking',
 ' favorite',
 ' book',
 ' came',
 ' turn',
 ' said',
 ' prophet',
 ' kahlil',
 ' gibran',
 ' got',
 ' excited',
 ' said',
 ' thats',
 ' favorite',
 ' book',
 ' like',
 ' bible',
 ' remember',
 ' thinking',
 ' something',
 ' important',
 ' common',
 ' someone',
 ' despise',
 ' guess',
 ' thats',
 ' thing',
 ' life',
 ' never_forget']

In [6]:
import nltk
nltk.__version__


Out[6]:
'3.2.4'

In [7]:
# Split negs
def split_neg(review):
    new_review = []
    for token in review:
        if '_' in token:
            split_words = token.split("_")
            new_review.append(split_words[0])
            new_review.append(split_words[1])
        else:
            new_review.append(token)
    return new_review

In [8]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: split_neg(review))
df0["reviewText"].head()


Progress:: 100%|██████████| 582711/582711 [00:14<00:00, 40001.85it/s]
Out[8]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

In [9]:
### Remove Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(review):
    return [token for token in review if not token in stop_words]

In [10]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: remove_stopwords(review))
df0["reviewText"].head()


Progress:: 100%|██████████| 582711/582711 [00:12<00:00, 48007.55it/s]
Out[10]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

Unfortunatelly, this tagger, though much better and accurate, takes a lot of time. In order to process the above data set it would need close to 3 days of running.

Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History


In [12]:
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

# import os
# os.getcwd()

# Add the jar and model via their path (instead of setting environment variables):
jar = '../../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
model = '../../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')

In [13]:
def pos_tag(review):
    if(len(review)>0):
        return pos_tagger.tag(review)

In [14]:
# Example
text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
print(text)


[('What', 'WP'), ("'s", 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]

In [15]:
tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
tagged_df.head()


Progress::   0%|          | 72/582711 [01:09<158:51:58,  1.02it/s]
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-15-013aa4904155> in <module>()
----> 1 tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
      2 tagged_df.head()

~/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py in inner(df, func, *args, **kwargs)
    628                 # Apply the provided function (in *args and **kwargs)
    629                 # on the df using our wrapper (which provides bar updating)
--> 630                 result = getattr(df, df_function)(wrapper, *args, **kwargs)
    631 
    632                 # Close bar and return pandas calculation result

~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

~/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py in wrapper(*args, **kwargs)
    624                 def wrapper(*args, **kwargs):
    625                     t.update()
--> 626                     return func(*args, **kwargs)
    627 
    628                 # Apply the provided function (in *args and **kwargs)

<ipython-input-15-013aa4904155> in <lambda>(review)
----> 1 tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
      2 tagged_df.head()

<ipython-input-13-a7a040836b8b> in pos_tag(review)
      1 def pos_tag(review):
      2     if(len(review)>0):
----> 3         return pos_tagger.tag(review)

~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py in tag(self, tokens)
     74     def tag(self, tokens):
     75         # This function should return list of tuple rather than list of list
---> 76         return sum(self.tag_sents([tokens]), [])
     77 
     78     def tag_sents(self, sentences):

~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py in tag_sents(self, sentences)
     97         # Run the tagger and get the output
     98         stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
---> 99                                        stdout=PIPE, stderr=PIPE)
    100         stanpos_output = stanpos_output.decode(encoding)
    101 

~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py in java(cmd, classpath, stdin, stdout, stderr, blocking)
    129     p = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
    130     if not blocking: return p
--> 131     (stdout, stderr) = p.communicate()
    132 
    133     # Check the return code.

~/anaconda3/lib/python3.6/subprocess.py in communicate(self, input, timeout)
    836 
    837             try:
--> 838                 stdout, stderr = self._communicate(input, endtime, timeout)
    839             finally:
    840                 self._communication_started = True

~/anaconda3/lib/python3.6/subprocess.py in _communicate(self, input, endtime, orig_timeout)
   1501                         raise TimeoutExpired(self.args, orig_timeout)
   1502 
-> 1503                     ready = selector.select(timeout)
   1504                     self._check_timeout(endtime, orig_timeout)
   1505 

~/anaconda3/lib/python3.6/selectors.py in select(self, timeout)
    374             ready = []
    375             try:
--> 376                 fd_event_list = self._poll.poll(timeout)
    377             except InterruptedError:
    378                 return ready

KeyboardInterrupt: 

In [ ]:
# tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
# tagged_df.head()

Thankfully, nltk provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’), or a regular expression. nltk also provides batch pos-tagging method for document pos-tagging:


In [ ]:
tagged_df['reviewText'][8]

The list of all possible tags appears below:

Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX ExistentialĘthere
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP* Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP* Possessive wh-pronoun
WRB Wh-adverb

Notice: where you see * replace with $.


In [ ]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()

In [ ]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, tagged_df], axis=1);
pos_tagged_keyed_reviews.head()

In [ ]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);

Nouns

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.


In [ ]:
def noun_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]

In [ ]:
nouns_df = pd.DataFrame(tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()

In [ ]:
keyed_nouns_df = pd.concat([uniqueKey_series_df, nouns_df], axis=1);
keyed_nouns_df.head()

In [ ]:
keyed_nouns_df.to_csv("../../data/interim/002_keyed_nouns_stanford.csv", sep='\t', header=True, index=False);

In [ ]:
## END_OF_FILE

In [ ]: