Quick Reference Guide to NLTK

In progress...

*By [Diego Marinho de Oliveira](mailto:dmztheone@gmail.com)
Last Update: 08-18-2015*

This notebook is only to demostrante simple and quick usefull examples of what we can do with NLTK. It also can used as a reference guide. Its not intendend to explore exaustively NLTK package. Many examples were extracted directly from NLTK Book written by Steven Bird, Ewan Klein, and Edward Loper distributed under the terms of the Creative Commons Attribution Noncommercial No-Derivative-Works 3.0 US License.

Import Labraries


In [1]:
import nltk
from __future__ import division
import matplotlib as mpl
from matplotlib import pyplot as plt
from nltk.book import *
from nltk.corpus import brown
from nltk.corpus import udhr
from nltk.corpus import wordnet as wn
from numpy import arange
import networkx as nx
%matplotlib inline


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-e8d8cbc3b991> in <module>()
      2 from __future__ import division
      3 import matplotlib as mpl
----> 4 from matplotlib import pyplot as plt
      5 from nltk.book import *
      6 from nltk.corpus import brown

/Library/Python/2.7/site-packages/matplotlib/pyplot.py in <module>()
     32 from matplotlib import docstring
     33 from matplotlib.backend_bases import FigureCanvasBase
---> 34 from matplotlib.figure import Figure, figaspect
     35 from matplotlib.gridspec import GridSpec
     36 from matplotlib.image import imread as _imread

/Library/Python/2.7/site-packages/matplotlib/figure.py in <module>()
     38 import matplotlib.colorbar as cbar
     39 
---> 40 from matplotlib.axes import Axes, SubplotBase, subplot_class_factory
     41 from matplotlib.blocking_input import BlockingMouseInput, BlockingKeyMouseInput
     42 from matplotlib.legend import Legend

/Library/Python/2.7/site-packages/matplotlib/axes/__init__.py in <module>()
      2                         unicode_literals)
      3 
----> 4 from ._subplots import *
      5 from ._axes import *

/Library/Python/2.7/site-packages/matplotlib/axes/_subplots.py in <module>()
      8 from matplotlib import docstring
      9 import matplotlib.artist as martist
---> 10 from matplotlib.axes._axes import Axes
     11 
     12 import warnings

/Library/Python/2.7/site-packages/matplotlib/axes/_axes.py in <module>()
     19 import matplotlib.colors as mcolors
     20 import matplotlib.contour as mcontour
---> 21 import matplotlib.dates as _  # <-registers a date unit converter
     22 from matplotlib import docstring
     23 import matplotlib.image as mimage

/Library/Python/2.7/site-packages/matplotlib/dates.py in <module>()
    124 
    125 
--> 126 from dateutil.rrule import (rrule, MO, TU, WE, TH, FR, SA, SU, YEARLY,
    127                             MONTHLY, WEEKLY, DAILY, HOURLY, MINUTELY,
    128                             SECONDLY)

/Users/diego/Library/Python/2.7/lib/python/site-packages/dateutil/rrule.py in <module>()
     14 
     15 from six import advance_iterator, integer_types
---> 16 from six.moves import _thread
     17 
     18 __all__ = ["rrule", "rruleset", "rrulestr",

ImportError: cannot import name _thread

Ex.3. Similar tokens


In [6]:
text1.similar("monstrous")


imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

Ex.2. Common Context


In [5]:
text2.common_contexts(["monstrous", "very"])


a_pretty is_pretty a_lucky am_glad be_glad

Ex.3. Dispersion Plot


In [10]:
text4.dispersion_plot(["citizens", :"democracy", "freedom", "duties", "America"])


Ex.4. Calculate Text Diversity


In [11]:
format(len(set(text4))/len(text4)


Out[11]:
0.06692970116993173

Ex.5. Top Most Common Tokens


In [20]:
nltk.FreqDist(text1).most_common(5)


Out[20]:
[(u',', 18713), (u'the', 13721), (u'.', 6862), (u'of', 6536), (u'and', 6024)]

Ex.6. Plot Cummulative Distribution Tokens


In [22]:
nltk.FreqDist(text1).plot(50, cumulative=True)


Ex.7. Filter for Long Tokens


In [27]:
[w for w in text1 if len(w) > 15][:5]


Out[27]:
[u'CIRCUMNAVIGATION',
 u'uncomfortableness',
 u'cannibalistically',
 u'circumnavigations',
 u'superstitiousness']

Ex.8. Collocations


In [34]:
text4.collocations()


United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

Ex.9. Conditional Frequency Distribution


In [7]:
cfd = nltk.ConditionalFreqDist(
            (genre, word)
            for genre in brown.categories()
            for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)


                 can could  may might must will 
           news   93   86   66   38   50  389 
       religion   82   59   78   12   54   71 
        hobbies  268   58  131   22   83  264 
science_fiction   16   49    4   12    8   16 
        romance   74  193   11   51   45   43 
          humor   16   30    8    8    9   13 

Ex.10. Plot 1 Conditional Frequency Distribution


In [2]:
cfd = nltk.ConditionalFreqDist(
        (target, fileid)
            for fileid in inaugural.fileids()
            for w in inaugural.words(fileid)
            for target in ['america', 'citizen']
            if w.lower().startswith(target)) [1]
cfd.plot()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-d7d4272dffea> in <module>()
----> 1 cfd = nltk.ConditionalFreqDist(
      2         (target, fileid)
      3             for fileid in inaugural.fileids()
      4             for w in inaugural.words(fileid)
      5             for target in ['america', 'citizen']

NameError: name 'nltk' is not defined

Ex.11. Plot 2 Conditional Frequency Distribution


In [3]:
languages = ['Chickasaw', 'English', 'German_Deutsch',
             'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

cfd = nltk.ConditionalFreqDist(
              (lang, len(word))
              for lang in languages
              for word in udhr.words(lang + '-Latin1'))

cfd.plot(cumulative=True)


Ex.12. Plot 3 Conditional Frequency Distribution


In [4]:
names = nltk.corpus.names
names.fileids()
['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

cfd = nltk.ConditionalFreqDist(
              (fileid, name[-1])
              for fileid in names.fileids()
              for name in names.words(fileid))
cfd.plot()


Ex.13. Generate Random Text with Bigram


In [5]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
...   'and', 'the', 'earth', '.']
list(nltk.bigrams(sent))


Out[5]:
[('In', 'the'),
 ('the', 'beginning'),
 ('beginning', 'God'),
 ('God', 'created'),
 ('created', 'the'),
 ('the', 'heaven'),
 ('heaven', 'and'),
 ('and', 'the'),
 ('the', 'earth'),
 ('earth', '.')]

In [6]:
def generate_model(cfdist, word, num=15):
    result = ''
    for i in range(num):
        result += word + ' '
        word = cfdist[word].max()
        
    print result.strip()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

cfd['living']
nltk.FreqDist({'creature': 7, 'thing': 4, 'substance': 2, ',': 1, '.': 1, 'soul': 1})
generate_model(cfd, 'living')


living creature that he said , and the land of the land of the land

Ex.14. Plot Bar Chart by Frequency Word Distribution


In [10]:
colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black

def bar_chart(categories, words, counts):
    "Plot a bar chart showing counts for each word by category"
    ind = arange(len(words))
    width = 1 / (len(categories) + 1)
    bar_groups = []
    for c in range(len(categories)):
        bars = plt.bar(ind+c*width, counts[categories[c]], width,
                         color=colors[c % len(colors)])
        bar_groups.append(bars)
    plt.xticks(ind+width, words)
    plt.legend([b[0] for b in bar_groups], categories, loc='upper left')
    plt.ylabel('Frequency')
    plt.title('Frequency of Six Modal Verbs by Genre')
    plt.show()
    
genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfdist = nltk.ConditionalFreqDist(
                 (genre, word)
                 for genre in genres
                 for word in nltk.corpus.brown.words(categories=genre)
                 if word in modals)
   
counts = {}
for genre in genres:
    counts[genre] = [cfdist[genre][word] for word in modals]
    
bar_chart(genres, modals, counts)


Ex.15. Words as a Graph


In [12]:
def traverse(graph, start, node):
    graph.depth[node.name] = node.shortest_path_distance(start)
    for child in node.hyponyms():
        graph.add_edge(node.name, child.name)
        traverse(graph, start, child)

def hyponym_graph(start):
    G = nx.Graph()
    G.depth = {}
    traverse(G, start, start)
    return G

def graph_draw(graph):
    nx.draw_graphviz(graph,
         node_size = [16 * graph.degree(n) for n in graph],
         node_color = [graph.depth[n] for n in graph],
         with_labels = False)
    plt.show()

dog = wn.synset('dog.n.01')
graph = hyponym_graph(dog)
graph_draw(graph)