Pos-Tagging & Feature Extraction

Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

POS-tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The nltk library provides its own pre-trained POS-tagger. Let's see how it is used.


In [1]:
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews.csv", sep="\t", low_memory=False)
df0.head()


Out[1]:
uniqueKey reviewText
0 A2XQ5LZHTD4AFT##000100039X ['timeless', 'classic', 'demanding', 'assuming...
1 AF7CSSGV93RXN##000100039X ['first', 'read', 'prophet', 'kahlil', 'gibran...
2 A1NPNGWBVD9AK3##000100039X ['one', 'first', 'literary', 'books', 'recall'...
3 A3IS4WGMFR4X65##000100039X ['prophet', 'kahlil', 'gibrans', 'best', 'know...
4 AWLFVCT9128JV##000100039X ['gibran', 'khalil', 'gibran', 'born', 'one th...

In [2]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [3]:
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").replace("\t","").split(",")

In [4]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()


Progress:: 100%|██████████| 582711/582711 [00:17<00:00, 33284.66it/s]
Out[4]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

In [5]:
df0['reviewText'][12]


Out[5]:
['father',
 ' huge',
 ' book',
 ' collection',
 ' remember',
 ' around',
 ' twelve',
 ' found',
 ' book',
 ' amidst',
 ' sea',
 ' books',
 ' thought',
 ' pretty',
 ' small',
 ' read',
 ' one',
 ' sit',
 ' changed',
 ' life',
 ' talks',
 ' life',
 ' love',
 ' friendship',
 ' death',
 ' etc',
 ' answers',
 ' spend',
 ' entire',
 ' life',
 ' searching',
 ' book',
 ' tells',
 ' things',
 ' already',
 ' know',
 ' life',
 ' somehow',
 ' keep',
 ' back',
 ' heads',
 ' dont_have',
 ' right',
 ' thing',
 ' ignoring',
 ' make',
 ' us',
 ' human',
 ' default',
 ' no_place',
 ' go',
 ' book',
 ' doesnt_follow',
 ' remember',
 ' college',
 ' teacher',
 ' despised',
 ' much',
 ' class',
 ' talking',
 ' favorite',
 ' book',
 ' came',
 ' turn',
 ' said',
 ' prophet',
 ' kahlil',
 ' gibran',
 ' got',
 ' excited',
 ' said',
 ' thats',
 ' favorite',
 ' book',
 ' like',
 ' bible',
 ' remember',
 ' thinking',
 ' something',
 ' important',
 ' common',
 ' someone',
 ' despise',
 ' guess',
 ' thats',
 ' thing',
 ' life',
 ' never_forget']

In [6]:
import nltk
nltk.__version__


Out[6]:
'3.2.5'

In [7]:
# Split negs
def split_neg(review):
    new_review = []
    for token in review:
        if '_' in token:
            split_words = token.split("_")
            new_review.append(split_words[0])
            new_review.append(split_words[1])
        else:
            new_review.append(token)
    return new_review

In [8]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: split_neg(review))
df0["reviewText"].head()


Progress:: 100%|██████████| 582711/582711 [00:15<00:00, 37933.46it/s]
Out[8]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

In [9]:
### Remove Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(review):
    return [token for token in review if not token in stop_words]

In [10]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: remove_stopwords(review))
df0["reviewText"].head()


Progress:: 100%|██████████| 582711/582711 [00:12<00:00, 47896.49it/s]
Out[10]:
0    [timeless,  classic,  demanding,  assuming,  t...
1    [first,  read,  prophet,  kahlil,  gibran,  th...
2    [one,  first,  literary,  books,  recall,  rea...
3    [prophet,  kahlil,  gibrans,  best,  known,  w...
4    [gibran,  khalil,  gibran,  born,  one thousan...
Name: reviewText, dtype: object

Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History


In [11]:
# from nltk.tag import StanfordPOSTagger
# from nltk import word_tokenize

# # import os
# # os.getcwd()

# # Add the jar and model via their path (instead of setting environment variables):
# jar = '../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
# model = '../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

# pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')

In [12]:
# def pos_tag(review):
#     if(len(review)>0):
#         return pos_tagger.tag(review)

In [13]:
## Example
# text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
# print(text)

In [14]:
# tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
# tagged_df.head()

Unfortunatelly, this tagger, though much better and accurate, takes a lot of time. In order to process the above data set it would need close to 3 days of running.


In [15]:
# from textblob import TextBlob

# def blob_pos_tagger(review):
#     blob = TextBlob(" ".join(review))
#     return blob.tags

# blob_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: blob_pos_tagger(review)))
# blob_tagged_df.head()

In [16]:
nltk_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
nltk_tagged_df.head()


Progress:: 100%|██████████| 582711/582711 [1:08:29<00:00, 141.81it/s]
Out[16]:
reviewText
0 [(timeless, NN), ( classic, JJ), ( demanding, ...
1 [(first, RB), ( read, JJ), ( prophet, NNP), ( ...
2 [(one, CD), ( first, NNP), ( literary, JJ), ( ...
3 [(prophet, NN), ( kahlil, NN), ( gibrans, VBZ)...
4 [(gibran, NN), ( khalil, NNP), ( gibran, NNP),...

Thankfully, nltk provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’), or a regular expression. nltk also provides batch pos-tagging method for document pos-tagging:


In [17]:
nltk_tagged_df['reviewText'][8]


Out[17]:
[('prophet', 'NN'),
 (' dispenses', 'NNS'),
 (' ultimate', 'VBP'),
 (' wisdom', 'NN'),
 (' loved', 'VBN'),
 (' ones', 'NNS'),
 (' bids', 'NNS'),
 (' fare', 'VBP'),
 (' well', 'NNP'),
 (' khalil', 'NNP'),
 (' gibran', 'NNP'),
 (' defines', 'VBZ'),
 (' never', 'RB'),
 (' words', 'NNS'),
 (' define', 'VBP'),
 (' appropriately', 'RB'),
 (' good', 'JJ'),
 (' sense', 'JJ'),
 (' define', 'NN'),
 (' discovered', 'VBN'),
 (' book', 'NNP'),
 (' back', 'NNP'),
 (' took', 'NNP'),
 (' long', 'NNP'),
 (' time', 'NNP'),
 (' read', 'JJ'),
 (' since', 'NN'),
 (' refused', 'VBD'),
 (' rush', 'JJ'),
 (' read', 'JJ'),
 (' lesson', 'NN'),
 (' time', 'NN'),
 (' understanding', 'VBG'),
 (' best', 'JJS'),
 (' ability', 'NN'),
 (' found', 'NN'),
 (' way', 'NNP'),
 (' life', 'NNP'),
 (' words', 'NNS'),
 (' could', 'MD'),
 (' read', 'VB'),
 (' everyday', 'JJ'),
 (' time', 'JJ'),
 (' words', 'NNS'),
 (' would', 'MD'),
 (' dispense', 'VB'),
 (' new', 'NNP'),
 (' lesson', 'NNP'),
 (' like', 'NNP'),
 (' never', 'NNP'),
 ('ending', 'VBG'),
 (' treasure', 'NN')]

The list of all possible tags appears below:

Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX ExistentialĘthere
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP* Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP* Possessive wh-pronoun
WRB Wh-adverb

Notice: where you see * replace with $.


In [18]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()


Out[18]:
uniqueKey
0 A2XQ5LZHTD4AFT##000100039X
1 AF7CSSGV93RXN##000100039X
2 A1NPNGWBVD9AK3##000100039X
3 A3IS4WGMFR4X65##000100039X
4 AWLFVCT9128JV##000100039X

In [19]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, nltk_tagged_df], axis=1);
pos_tagged_keyed_reviews.head()


Out[19]:
uniqueKey reviewText
0 A2XQ5LZHTD4AFT##000100039X [(timeless, NN), ( classic, JJ), ( demanding, ...
1 AF7CSSGV93RXN##000100039X [(first, RB), ( read, JJ), ( prophet, NNP), ( ...
2 A1NPNGWBVD9AK3##000100039X [(one, CD), ( first, NNP), ( literary, JJ), ( ...
3 A3IS4WGMFR4X65##000100039X [(prophet, NN), ( kahlil, NN), ( gibrans, VBZ)...
4 AWLFVCT9128JV##000100039X [(gibran, NN), ( khalil, NNP), ( gibran, NNP),...

In [20]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);

In [21]:
# Save a dictionary into a pickle file.
pos_tagged_keyed_reviews.to_pickle("../data/interim/002_pos_tagged_keyed_reviews.p")

Nouns

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.


In [22]:
def noun_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]

In [23]:
nouns_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()


Progress:: 100%|██████████| 582711/582711 [00:21<00:00, 27257.39it/s] 
Out[23]:
reviewText
0 [timeless, gibran, backs, content, means, ...
1 [ prophet, kahlil, gibran, thirty, years, ...
2 [ first, books, recall, collection, gibran...
3 [prophet, kahlil, work, world, million, c...
4 [gibran, khalil, gibran, born, one thousan...

In [24]:
keyed_nouns_df = pd.concat([uniqueKey_series_df, nouns_df], axis=1);
keyed_nouns_df.head()


Out[24]:
uniqueKey reviewText
0 A2XQ5LZHTD4AFT##000100039X [timeless, gibran, backs, content, means, ...
1 AF7CSSGV93RXN##000100039X [ prophet, kahlil, gibran, thirty, years, ...
2 A1NPNGWBVD9AK3##000100039X [ first, books, recall, collection, gibran...
3 A3IS4WGMFR4X65##000100039X [prophet, kahlil, work, world, million, c...
4 AWLFVCT9128JV##000100039X [gibran, khalil, gibran, born, one thousan...

In [25]:
keyed_nouns_df.to_csv("../data/interim/002_keyed_nouns.csv", sep='\t', header=True, index=False);

In [26]:
# Save a dictionary into a pickle file.
keyed_nouns_df.to_pickle("../data/interim/002_keyed_nouns.p")

Adjectives


In [27]:
def adjective_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'JJR', 'JJS'}]

In [28]:
adjectives_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()


Progress:: 100%|██████████| 582711/582711 [00:11<00:00, 52023.77it/s]
Out[28]:
reviewText
0 [timeless, gibran, backs, content, means, ...
1 [ prophet, kahlil, gibran, thirty, years, ...
2 [ first, books, recall, collection, gibran...
3 [prophet, kahlil, work, world, million, c...
4 [gibran, khalil, gibran, born, one thousan...

In [29]:
keyed_adjectives_df = pd.concat([uniqueKey_series_df, adjectives_df], axis=1);
keyed_adjectives_df.head()


Out[29]:
uniqueKey reviewText
0 A2XQ5LZHTD4AFT##000100039X [timeless, gibran, backs, content, means, ...
1 AF7CSSGV93RXN##000100039X [ prophet, kahlil, gibran, thirty, years, ...
2 A1NPNGWBVD9AK3##000100039X [ first, books, recall, collection, gibran...
3 A3IS4WGMFR4X65##000100039X [prophet, kahlil, work, world, million, c...
4 AWLFVCT9128JV##000100039X [gibran, khalil, gibran, born, one thousan...

In [30]:
keyed_adjectives_df.to_csv("../data/interim/002_adjectives_nouns.csv", sep='\t', header=True, index=False);

In [31]:
# Save a dictionary into a pickle file.
keyed_adjectives_df.to_pickle("../data/interim/002_keyed_adjectives.p")

In [ ]:
# END OF FILE

In [ ]: