Purpose: To experiment with Python's Natural Language Toolkit.

NLTK is a leading platform for building Python programs to work with human language data



In [1]:

    
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from collections import Counter

Input



In [2]:

    
bloboftext = """
This little piggy went to market,
This little piggy stayed home,
This little piggy had roast beef,
This little piggy had none,
And this little piggy went wee wee wee all the way home.
"""

Workflow

Tokenization to break text into units e.g. words, phrases, or symbols
Stop word removal to get rid of common words
- e.g. this, a, is



In [3]:

    
## Tokenization 
bagofwords = nltk.word_tokenize(bloboftext.lower())
print len(bagofwords)



In [4]:

    
## Stop word removal
stop = stopwords.words('english')
bagofwords = [i for i in bagofwords if i not in stop]
print len(bagofwords)

About stemmers and lemmatisation

Stemming to reduce a word to its roots
- e.g. having => hav
Lemmatisation to determine a word's lemma/canonical form
- e.g. having => have

English Stemmers and Lemmatizers

For stemming English words with NLTK, you can choose between the PorterStemmer or the LancasterStemmer. The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. The Lancaster Stemming Algorithm is much newer, published in 1990, and can be more aggressive than the Porter stemming algorithm.

The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.

Resources:



In [5]:

    
snowball_stemmer = SnowballStemmer("english")

## What words was stemmed?
_original = set(bagofwords) 
_stemmed  = set([snowball_stemmer.stem(i) for i in bagofwords])

print 'BEFORE:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _original-_stemmed))
print ' AFTER:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _stemmed-_original))

del _original, _stemmed

## Proceed with stemming
bagofwords = [snowball_stemmer.stem(i) for i in bagofwords]









    



BEFORE:	"little", "piggy", "stayed"
 AFTER:	"piggi", "littl", "stay"

Count & POS tag of each stemmed/non-stop word

meaning of POS tags: Penn Part of Speech Tags

NN   Noun, singular or mass
VBD Verb, past tense



In [6]:

    
for token, count in Counter(bagofwords).most_common():
    print '%d\t%s\t%s' % (count, nltk.pos_tag([token])[0][1], token)









    



5	NN	piggi
5	NN	littl
4	,	,
3	NN	wee
2	NN	home
2	VBD	went
1	NN	none
1	NN	beef
1	NN	stay
1	NN	way
1	NN	roast
1	.	.
1	NN	market

Proportion of POS tags



In [7]:

    
record = {}
for token, count in Counter(bagofwords).most_common():
    postag = nltk.pos_tag([token])[0][1]

    if record.has_key(postag):
        record[postag] += count
    else:
        record[postag] = count

recordpd = pd.DataFrame.from_dict([record]).T
recordpd.columns = ['count']
N = sum(recordpd['count'])
recordpd['percent'] = recordpd['count']/N*100
recordpd

	count	percent
,	4	14.285714
.	1	3.571429
NN	21	75.000000
VBD	2	7.142857