Purpose: To experiment with Python's Natural Language Toolkit.
NLTK is a leading platform for building Python programs to work with human language data
In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from collections import Counter
In [2]:
bloboftext = """
This little piggy went to market,
This little piggy stayed home,
This little piggy had roast beef,
This little piggy had none,
And this little piggy went wee wee wee all the way home.
"""
In [3]:
## Tokenization
bagofwords = nltk.word_tokenize(bloboftext.lower())
print len(bagofwords)
In [4]:
## Stop word removal
stop = stopwords.words('english')
bagofwords = [i for i in bagofwords if i not in stop]
print len(bagofwords)
Stemming to reduce a word to its roots
Lemmatisation to determine a word's lemma/canonical form
English Stemmers and Lemmatizers
For stemming English words with NLTK, you can choose between the PorterStemmer or the LancasterStemmer. The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. The Lancaster Stemming Algorithm is much newer, published in 1990, and can be more aggressive than the Porter stemming algorithm.
The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
In [5]:
snowball_stemmer = SnowballStemmer("english")
## What words was stemmed?
_original = set(bagofwords)
_stemmed = set([snowball_stemmer.stem(i) for i in bagofwords])
print 'BEFORE:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _original-_stemmed))
print ' AFTER:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _stemmed-_original))
del _original, _stemmed
## Proceed with stemming
bagofwords = [snowball_stemmer.stem(i) for i in bagofwords]
NN Noun, singular or mass
VBD Verb, past tense
In [6]:
for token, count in Counter(bagofwords).most_common():
print '%d\t%s\t%s' % (count, nltk.pos_tag([token])[0][1], token)
In [7]:
record = {}
for token, count in Counter(bagofwords).most_common():
postag = nltk.pos_tag([token])[0][1]
if record.has_key(postag):
record[postag] += count
else:
record[postag] = count
recordpd = pd.DataFrame.from_dict([record]).T
recordpd.columns = ['count']
N = sum(recordpd['count'])
recordpd['percent'] = recordpd['count']/N*100
recordpd
Out[7]: