Purpose: To experiment with Python's Natural Language Toolkit.

NLTK is a leading platform for building Python programs to work with human language data


In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from collections import Counter

Input


In [2]:
bloboftext = """
This little piggy went to market,
This little piggy stayed home,
This little piggy had roast beef,
This little piggy had none,
And this little piggy went wee wee wee all the way home.
"""

Workflow

  • Tokenization to break text into units e.g. words, phrases, or symbols
  • Stop word removal to get rid of common words
    • e.g. this, a, is

In [3]:
## Tokenization 
bagofwords = nltk.word_tokenize(bloboftext.lower())
print len(bagofwords)


39

In [4]:
## Stop word removal
stop = stopwords.words('english')
bagofwords = [i for i in bagofwords if i not in stop]
print len(bagofwords)


28

About stemmers and lemmatisation

  • Stemming to reduce a word to its roots

    • e.g. having => hav
  • Lemmatisation to determine a word's lemma/canonical form

    • e.g. having => have

English Stemmers and Lemmatizers

For stemming English words with NLTK, you can choose between the PorterStemmer or the LancasterStemmer. The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. The Lancaster Stemming Algorithm is much newer, published in 1990, and can be more aggressive than the Porter stemming algorithm.

The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.


In [5]:
snowball_stemmer = SnowballStemmer("english")

## What words was stemmed?
_original = set(bagofwords) 
_stemmed  = set([snowball_stemmer.stem(i) for i in bagofwords])

print 'BEFORE:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _original-_stemmed))
print ' AFTER:\t%s' % ', '.join(map(lambda x:'"%s"'%x, _stemmed-_original))

del _original, _stemmed

## Proceed with stemming
bagofwords = [snowball_stemmer.stem(i) for i in bagofwords]


BEFORE:	"little", "piggy", "stayed"
 AFTER:	"piggi", "littl", "stay"

Count & POS tag of each stemmed/non-stop word


In [6]:
for token, count in Counter(bagofwords).most_common():
    print '%d\t%s\t%s' % (count, nltk.pos_tag([token])[0][1], token)


5	NN	piggi
5	NN	littl
4	,	,
3	NN	wee
2	NN	home
2	VBD	went
1	NN	none
1	NN	beef
1	NN	stay
1	NN	way
1	NN	roast
1	.	.
1	NN	market

Proportion of POS tags


In [7]:
record = {}
for token, count in Counter(bagofwords).most_common():
    postag = nltk.pos_tag([token])[0][1]

    if record.has_key(postag):
        record[postag] += count
    else:
        record[postag] = count

recordpd = pd.DataFrame.from_dict([record]).T
recordpd.columns = ['count']
N = sum(recordpd['count'])
recordpd['percent'] = recordpd['count']/N*100
recordpd


Out[7]:
count percent
, 4 14.285714
. 1 3.571429
NN 21 75.000000
VBD 2 7.142857