Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.
The nltk
library provides its own pre-trained POS-tagger
. Let's see how it is used.
In [1]:
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews.csv", sep="\t", low_memory=False)
df0.head()
Out[1]:
In [2]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas
# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0
# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")
# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)
In [3]:
def convert_text_to_list(review):
return review.replace("[","").replace("]","").replace("'","").replace("\t","").split(",")
In [4]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()
Out[4]:
In [5]:
df0['reviewText'][12]
Out[5]:
In [6]:
import nltk
nltk.__version__
Out[6]:
In [7]:
# Split negs
def split_neg(review):
new_review = []
for token in review:
if '_' in token:
split_words = token.split("_")
new_review.append(split_words[0])
new_review.append(split_words[1])
else:
new_review.append(token)
return new_review
In [8]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: split_neg(review))
df0["reviewText"].head()
Out[8]:
In [9]:
### Remove Stop Words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(review):
return [token for token in review if not token in stop_words]
In [10]:
df0["reviewText"] = df0["reviewText"].progress_apply(lambda review: remove_stopwords(review))
df0["reviewText"].head()
Out[10]:
Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History
In [11]:
# from nltk.tag import StanfordPOSTagger
# from nltk import word_tokenize
# # import os
# # os.getcwd()
# # Add the jar and model via their path (instead of setting environment variables):
# jar = '../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
# model = '../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'
# pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
In [12]:
# def pos_tag(review):
# if(len(review)>0):
# return pos_tagger.tag(review)
In [13]:
## Example
# text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
# print(text)
In [14]:
# tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
# tagged_df.head()
Unfortunatelly, this tagger, though much better and accurate, takes a lot of time. In order to process the above data set it would need close to 3 days of running.
In [15]:
# from textblob import TextBlob
# def blob_pos_tagger(review):
# blob = TextBlob(" ".join(review))
# return blob.tags
# blob_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: blob_pos_tagger(review)))
# blob_tagged_df.head()
In [16]:
nltk_tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
nltk_tagged_df.head()
Out[16]:
Thankfully, nltk
provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’)
, or a regular expression. nltk
also provides batch pos-tagging method for document pos-tagging:
In [17]:
nltk_tagged_df['reviewText'][8]
Out[17]:
The list of all possible tags appears below:
Tag | Description |
---|---|
CC | Coordinating conjunction |
CD | Cardinal number |
DT | Determiner |
EX | ExistentialĘthere |
FW | Foreign word |
IN | Preposition or subordinating conjunction |
JJ | Adjective |
JJR | Adjective, comparative |
JJS | Adjective, superlative |
LS | List item marker |
MD | Modal |
NN | Noun, singular or mass |
NNS | Noun, plural |
NNP | Proper noun, singular |
NNPS | Proper noun, plural |
PDT | Predeterminer |
POS | Possessive ending |
PRP | Personal pronoun |
PRP* | Possessive pronoun |
RB | Adverb |
RBR | Adverb, comparative |
RBS | Adverb, superlative |
RP | Particle |
SYM | Symbol |
TO | to |
UH | Interjection |
VB | Verb, base form |
VBD | Verb, past tense |
VBG | Verb, gerund or present participle |
VBN | Verb, past participle |
VBP | Verb, non-3rd person singular present |
VBZ | Verb, 3rd person singular present |
WDT | Wh-determiner |
WP | Wh-pronoun |
WP* | Possessive wh-pronoun |
WRB | Wh-adverb |
Notice: where you see *
replace with $
.
In [18]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()
Out[18]:
In [19]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, nltk_tagged_df], axis=1);
pos_tagged_keyed_reviews.head()
Out[19]:
In [20]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);
In [21]:
# Save a dictionary into a pickle file.
pos_tagged_keyed_reviews.to_pickle("../data/interim/002_pos_tagged_keyed_reviews.p")
Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.
The simplified noun tags are N
for common nouns like book, and NP
for proper nouns like Scotland.
In [22]:
def noun_collector(word_tag_list):
if(len(word_tag_list)>0):
return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]
In [23]:
nouns_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()
Out[23]:
In [24]:
keyed_nouns_df = pd.concat([uniqueKey_series_df, nouns_df], axis=1);
keyed_nouns_df.head()
Out[24]:
In [25]:
keyed_nouns_df.to_csv("../data/interim/002_keyed_nouns.csv", sep='\t', header=True, index=False);
In [26]:
# Save a dictionary into a pickle file.
keyed_nouns_df.to_pickle("../data/interim/002_keyed_nouns.p")
In [27]:
def adjective_collector(word_tag_list):
if(len(word_tag_list)>0):
return [word for (word, tag) in word_tag_list if tag in {'NN', 'JJR', 'JJS'}]
In [28]:
adjectives_df = pd.DataFrame(nltk_tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()
Out[28]:
In [29]:
keyed_adjectives_df = pd.concat([uniqueKey_series_df, adjectives_df], axis=1);
keyed_adjectives_df.head()
Out[29]:
In [30]:
keyed_adjectives_df.to_csv("../data/interim/002_adjectives_nouns.csv", sep='\t', header=True, index=False);
In [31]:
# Save a dictionary into a pickle file.
keyed_adjectives_df.to_pickle("../data/interim/002_keyed_adjectives.p")
In [ ]:
# END OF FILE
In [ ]: