Introduction to PMFs

Copyright 2015 Allen Downey

License: Creative Commons Attribution 4.0 International


In [1]:
# this future import makes this code mostly compatible with Python 2 and 3
from __future__ import print_function, division

import random

import numpy as np
import matplotlib.pyplot as plt

from itertools import product, tee, izip
from collections import Counter, defaultdict
from urllib2 import urlopen

%matplotlib inline

# less awful colors from http://colorbrewer2.org/
LILAC = '#998ec3'
ORANGE = '#f1a340'

A histogram is a map from each possible value to the number of times it appears. A map can be a mathematical function or, as in the examples below, a Python data structure that provides the ability to look up a value and get its probability.

Counter is a data structure provided by Python; I am defining a new data structure, called a Hist, that has all the features of a Counter, plus a few more that I define.


In [2]:
class Hist(Counter):
    
    def __add__(self, other):
        """Returns the Pmf of the sum of elements from self and other."""
        return Hist(x + y for x, y in product(self.elements(), other.elements()))
    
    def choice(self):
        """Chooses a random element."""
        return random.choice(list(self.elements()))
    
    def plot(self, **options):
        """Plots the Pmf."""
        plt.bar(*zip(*self.items()), **options)
        plt.xlabel('values')
        plt.ylabel('counts')
    
    def ranks(self):
        """Returns ranks and counts as lists."""
        return izip(*enumerate(sorted(self.values(), reverse=True)))

As an example, I'll make a Hist of the letters in my name:


In [3]:
hist = Hist('allen')
hist


Out[3]:
Hist({'l': 2, 'a': 1, 'e': 1, 'n': 1})

We can look up a letter and get the corresponding count:


In [4]:
hist['l']


Out[4]:
2

Or loop through all the letters and print their counts:


In [5]:
for letter in hist:
    print(letter, hist[letter])


a 1
e 1
l 2
n 1

Counter provides most_common, which makes a list of (element, count) pairs:


In [6]:
hist.most_common()


Out[6]:
[('l', 2), ('a', 1), ('e', 1), ('n', 1)]

Here they are in a more readable form:


In [7]:
for letter, count in hist.most_common():
    print(letter, count)


l 2
a 1
e 1
n 1

I defined choice, which returns a random element from the Hist. On average, 'l' should appear twice as often as the other letters.


In [103]:
for i in range(10):
    print(hist.choice())


l
l
l
e
e
l
l
l
a
a

One (perhaps surprising) thing you can use Hists for: checking whether two words are anagrams of each other. If two words are anagrams, they have the same Hist:


In [9]:
def is_anagram(word1, word2):
    return Hist(word1) == Hist(word2)

Here's a simple test:


In [10]:
is_anagram('allen', 'nella')


Out[10]:
True

And my favorite anagram pair:


In [11]:
is_anagram('tachymetric', 'mccarthyite')


Out[11]:
True

And here's a false one, just to make sure:


In [12]:
is_anagram('abcd', 'abccd')


Out[12]:
False

So far the elements in the Hists have been letters (actually strings), but in statistics it is more common to work with numerical elements. Here's a Hist that represents the possible outcomes of a six-sided die:


In [13]:
d6 = Hist([1,2,3,4,5,6])
d6


Out[13]:
Hist({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1})

Hist provides a plot function:


In [14]:
d6.plot(color=LILAC)


elements returns an iterator


In [15]:
d6.elements()


Out[15]:
<itertools.chain at 0x7fcd43810410>

Which is easier to see if you convert to a list:


In [16]:
list(d6.elements())


Out[16]:
[1, 2, 3, 4, 5, 6]

The product of two iterators is an iterator that enumerates all pairs:


In [17]:
product(d6.elements(), d6.elements())


Out[17]:
<itertools.product at 0x7fcd438b34b0>

Here are the elements of the product:


In [18]:
list(product(d6.elements(), d6.elements()))


Out[18]:
[(1, 1),
 (1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (1, 6),
 (2, 1),
 (2, 2),
 (2, 3),
 (2, 4),
 (2, 5),
 (2, 6),
 (3, 1),
 (3, 2),
 (3, 3),
 (3, 4),
 (3, 5),
 (3, 6),
 (4, 1),
 (4, 2),
 (4, 3),
 (4, 4),
 (4, 5),
 (4, 6),
 (5, 1),
 (5, 2),
 (5, 3),
 (5, 4),
 (5, 5),
 (5, 6),
 (6, 1),
 (6, 2),
 (6, 3),
 (6, 4),
 (6, 5),
 (6, 6)]

Now we can compute the sum of all pairs:


In [19]:
list(x + y for x, y in product(d6.elements(), d6.elements()))


Out[19]:
[2,
 3,
 4,
 5,
 6,
 7,
 3,
 4,
 5,
 6,
 7,
 8,
 4,
 5,
 6,
 7,
 8,
 9,
 5,
 6,
 7,
 8,
 9,
 10,
 6,
 7,
 8,
 9,
 10,
 11,
 7,
 8,
 9,
 10,
 11,
 12]

And finally make a Hist of the sums:


In [20]:
Hist(x + y for x, y in product(d6.elements(), d6.elements()))


Out[20]:
Hist({7: 6, 6: 5, 8: 5, 5: 4, 9: 4, 4: 3, 10: 3, 3: 2, 11: 2, 2: 1, 12: 1})

But all of that is provided by __add__, which we can call using the + operator:


In [21]:
twice = d6 + d6
twice


Out[21]:
Hist({7: 6, 6: 5, 8: 5, 5: 4, 9: 4, 4: 3, 10: 3, 3: 2, 11: 2, 2: 1, 12: 1})

Now we can plot the histogram of outcomes from rolling two dice:


In [22]:
twice.plot(color=ORANGE)


Or three dice:


In [23]:
thrice = twice + d6
thrice


Out[23]:
Hist({10: 27, 11: 27, 9: 25, 12: 25, 8: 21, 13: 21, 7: 15, 14: 15, 6: 10, 15: 10, 5: 6, 16: 6, 4: 3, 17: 3, 3: 1, 18: 1})

Notice that this is looking more and more like a bell curve:


In [24]:
thrice.plot(color=LILAC)


As the number of dice increases, the result converges to a normal distribution, also known as a Gaussian distribution.

Are first babies more likely to be late?

This is one of the first topics I wrote about in my blog, and still the most popular, with more than 100,000 page views:

http://allendowney.blogspot.com/2011/02/are-first-babies-more-likely-to-be-late.html

I used data from the National Survey of Family Growth (NSFG):


In [25]:
import thinkstats2

dct_file = '2002FemPreg.dct'
dat_file = '2002FemPreg.dat.gz'

dct = thinkstats2.ReadStataDct(dct_file)
preg = dct.ReadFixedWidth(dat_file, compression='gzip')

preg


Out[25]:
caseid pregordr howpreg_n howpreg_p moscurrp nowprgdk pregend1 pregend2 nbrnaliv multbrth ... poverty_i laborfor_i religion_i metro_i basewgt adj_mod_basewgt finalwgt secu_p sest cmintvw
0 1 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3410.389399 3869.349602 6448.271112 2 9 1231
1 1 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3410.389399 3869.349602 6448.271112 2 9 1231
2 2 1 NaN NaN NaN NaN 5 NaN 3 5 ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231
3 2 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231
4 2 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231
5 6 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4870.926435 5325.196999 8874.440799 1 23 1231
6 6 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4870.926435 5325.196999 8874.440799 1 23 1231
7 6 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4870.926435 5325.196999 8874.440799 1 23 1231
8 7 1 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 3409.579565 3787.539000 6911.879921 2 14 1233
9 7 2 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 3409.579565 3787.539000 6911.879921 2 14 1233
10 12 1 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 3612.781968 4146.013572 6909.331618 1 31 1231
11 14 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2418.069494 2810.302771 3039.904507 2 56 1232
12 14 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2418.069494 2810.302771 3039.904507 2 56 1232
13 14 3 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 2418.069494 2810.302771 3039.904507 2 56 1232
14 15 1 NaN NaN NaN NaN 1 NaN NaN NaN ... 0 0 0 0 1667.816099 3200.862017 5553.495599 1 33 1228
15 15 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 1667.816099 3200.862017 5553.495599 1 33 1228
16 15 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 1667.816099 3200.862017 5553.495599 1 33 1228
17 18 1 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 2957.257457 3404.403067 4153.371741 2 14 1230
18 18 2 NaN NaN NaN NaN 1 NaN NaN NaN ... 0 0 0 0 2957.257457 3404.403067 4153.371741 2 14 1230
19 21 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3408.342437 3965.763949 7237.122630 1 48 1233
20 21 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3408.342437 3965.763949 7237.122630 1 48 1233
21 23 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 6210.373020 8120.841310 13533.382043 2 64 1234
22 23 2 NaN NaN NaN NaN 1 NaN NaN NaN ... 0 0 0 0 6210.373020 8120.841310 13533.382043 2 64 1234
23 24 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3409.573258 4068.628645 7424.840414 1 27 1232
24 24 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3409.573258 4068.628645 7424.840414 1 27 1232
25 24 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3409.573258 4068.628645 7424.840414 1 27 1232
26 28 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3407.794208 3808.343516 6949.846082 2 57 1229
27 31 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3405.679025 4272.084519 5211.943113 1 2 1235
28 31 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3405.679025 4272.084519 5211.943113 1 2 1235
29 31 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3405.679025 4272.084519 5211.943113 1 2 1235
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13563 12547 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3453.545517 6628.022524 11499.619080 1 52 1233
13564 12547 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3453.545517 6628.022524 11499.619080 1 52 1233
13565 12550 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3080.452699 3745.326058 5268.550165 1 79 1235
13566 12551 1 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 2418.538866 3653.453268 3951.940400 2 75 1235
13567 12554 1 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 1914.676604 2177.957240 2764.045534 2 75 1231
13568 12554 2 NaN NaN NaN NaN 4 NaN NaN NaN ... 0 0 0 0 1914.676604 2177.957240 2764.045534 2 75 1231
13569 12556 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2474.619764 3250.573384 3965.699528 1 44 1236
13570 12556 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2474.619764 3250.573384 3965.699528 1 44 1236
13571 12556 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2474.619764 3250.573384 3965.699528 1 44 1236
13572 12556 4 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2474.619764 3250.573384 3965.699528 1 44 1236
13573 12561 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2418.089703 2698.650781 4497.301527 1 10 1231
13574 12561 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2418.089703 2698.650781 4497.301527 1 10 1231
13575 12564 1 NaN NaN NaN NaN 3 NaN NaN NaN ... 1 0 0 0 1820.850938 2129.214067 2768.191208 2 44 1230
13576 12565 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3195.641221 3834.241709 6652.409365 1 78 1236
13577 12565 2 35 1 8 NaN NaN NaN NaN NaN ... 0 0 0 0 3195.641221 3834.241709 6652.409365 1 78 1236
13578 12566 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2080.317155 2422.820274 2627.548587 2 2 1235
13579 12566 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2080.317155 2422.820274 2627.548587 2 2 1235
13580 12568 1 NaN NaN NaN NaN 1 NaN NaN NaN ... 0 0 0 0 2734.687353 4258.980140 7772.212858 2 28 1234
13581 12568 2 NaN NaN NaN NaN 5 NaN 1 NaN ... 0 0 0 0 2734.687353 4258.980140 7772.212858 2 28 1234
13582 12568 3 NaN NaN NaN NaN 4 NaN NaN NaN ... 0 0 0 0 2734.687353 4258.980140 7772.212858 2 28 1234
13583 12569 1 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 2580.967613 2925.167116 5075.164946 2 61 1234
13584 12569 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 2580.967613 2925.167116 5075.164946 2 61 1234
13585 12570 1 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 5181.311509 6205.829154 11325.017623 2 40 1236
13586 12570 2 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 5181.311509 6205.829154 11325.017623 2 40 1236
13587 12570 3 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 5181.311509 6205.829154 11325.017623 2 40 1236
13588 12571 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4670.540953 5795.692880 6269.200989 1 78 1227
13589 12571 2 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 4670.540953 5795.692880 6269.200989 1 78 1227
13590 12571 3 NaN NaN NaN NaN 3 NaN NaN NaN ... 0 0 0 0 4670.540953 5795.692880 6269.200989 1 78 1227
13591 12571 4 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4670.540953 5795.692880 6269.200989 1 78 1227
13592 12571 5 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 4670.540953 5795.692880 6269.200989 1 78 1227

13593 rows × 243 columns

The variable outcome encodes the outcome of the pregnancy. Outcome 1 is a live birth.


In [26]:
preg.outcome.value_counts().sort_index()


Out[26]:
1    9148
2    1862
3     120
4    1921
5     190
6     352
dtype: int64

pregorder is 1 for first pregnancies, 2 for others.


In [27]:
preg.pregordr.value_counts().sort_index()


Out[27]:
1     5033
2     3766
3     2334
4     1224
5      613
6      308
7      158
8       78
9       38
10      17
11       8
12       5
13       3
14       3
15       1
16       1
17       1
18       1
19       1
dtype: int64

I selected live births, then split into first babies and others.


In [28]:
live = preg[preg.outcome == 1]
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
len(firsts), len(others)


Out[28]:
(4413, 4735)

The mean pregnancy lengths are slightly different:


In [29]:
firsts.prglngth.mean(), others.prglngth.mean()


Out[29]:
(38.600951733514613, 38.522914466737063)

The difference is 0.078 weeks:


In [30]:
diff = firsts.prglngth.mean() - others.prglngth.mean()
diff


Out[30]:
0.078037266777549519

Which is 13 hours. Note: the best units to report are often not the units you computed.


In [31]:
diff * 7 * 24


Out[31]:
13.110260818628319

Let's see if we can visualize the difference in the histograms:


In [32]:
first_hist = Hist(firsts.prglngth)
other_hist = Hist(others.prglngth)

I used some plotting options to put two bar charts side-by-side:


In [33]:
def plot_distributions(dist1, dist2):
    dist1.plot(width=-0.45, align='edge', color=LILAC, label='firsts')
    dist2.plot(width=0.45, align='edge', color=ORANGE, label='others')
    plt.xlim(33.5, 43.5)
    plt.legend()

Here are the two histograms:


In [34]:
plot_distributions(first_hist, other_hist)


Remember that the vertical axis is counts. In this case, we are comparing counts with different totals, which might be misleading.

An alternative is to compute a probability mass function (PMF), which divides the counts by the totals, yielding a map from each element to its probability.

The probabilities are "normalized" to add up to 1.


In [35]:
class Pmf(Hist):
    
    def normalize(self):
        total = sum(self.values())
        for element in self:
            self[element] /= total
        return self
    
    def plot_cumulative(self, **options):
        xs, ps = zip(*sorted(self.iteritems()))
        cs = np.cumsum(ps, dtype=np.float)
        cs /= cs[-1]
        plt.plot(xs, cs, **options)

Now we can compare PMFs fairly.


In [36]:
first_pmf = Pmf(firsts.prglngth).normalize()
other_pmf = Pmf(others.prglngth).normalize()

Using PMFs, we see that some of the difference at 39 weeks was an artifact of the different samples sizes:


In [37]:
plot_distributions(first_pmf, other_pmf)


Even so, it is not easy to compare PMFs. One more alternative is the cumulative mass function (CMF), which shows, for each $t$, the total probability up to and including $t$.


In [38]:
first_pmf.plot_cumulative(linewidth=4, color=LILAC, label='firsts')
other_pmf.plot_cumulative(linewidth=4, color=ORANGE, label='others')
plt.xlim(23.5, 44.5)
plt.legend(loc='upper left')


Out[38]:
<matplotlib.legend.Legend at 0x7fcd33e5c6d0>

The CDFs are similar up to week 38. After that, first babies are more likely to be born late.

Note: don't be afraid of thick lines. Differences that are only visible with thin lines are unlikely to be real.

One other thought: cumulative curves are often a good option for visualizing noisy series. For example, the graphic below works pretty well despite some questionable aesthetic choices.


In [46]:
from IPython.display import Image
Image(filename='cumulative_snowfall.png')


Out[46]:

Word Frequencies

Next topic: let's look at histograms of words, bigrams and trigrams.

The following function reads lines from a file or URL and splits them into words:


In [82]:
def iterate_words(filename):
    """Read lines from a file and split them into words."""
    for line in open(filename):
        for word in line.split():
            yield word.strip()

Here's an example using a book from Project Gutenberg. wc is a histogram of word counts:


In [85]:
# FAIRY TALES
# By The Brothers Grimm
# http://www.gutenberg.org/cache/epub/2591/pg2591.txt'
wc = Hist(iterate_words('pg2591.txt'))

Here are the 20 most common words:


In [41]:
wc.most_common(20)


Out[41]:
[('the', 6507),
 ('and', 5250),
 ('to', 2707),
 ('a', 1932),
 ('he', 1817),
 ('of', 1450),
 ('was', 1337),
 ('in', 1080),
 ('she', 1049),
 ('that', 1021),
 ('his', 1014),
 ('you', 941),
 ('it', 881),
 ('her', 880),
 ('had', 827),
 ('I', 755),
 ('they', 751),
 ('for', 721),
 ('with', 720),
 ('as', 718)]

Word frequencies in natural languages follow a predictable pattern called Zipf's law (which is an instance of Stigler's law, which is also an instance of Stigler's law).

We can see the pattern by lining up the words in descending order of frequency and plotting their ranks (1st, 2nd, 3rd, ...) versus counts (6507, 5250, 2707):


In [42]:
ranks, counts = wc.ranks()
plt.plot(ranks, counts, linewidth=10, color=ORANGE)
plt.xlabel('rank')
plt.ylabel('count')


Out[42]:
<matplotlib.text.Text at 0x7fcd43690c90>

Huh. Maybe that's not so clear after all. The problem is that the counts drop off very quickly. If we use the highest count to scale the figure, most of the other counts are indistinguishable from zero.

Also, there are more than 10,000 words, but most of them appear only a few times, so we are wasting most of the space in the figure in a regime where nothing is happening.

This kind of thing happens a lot. A common way to deal with it is to compute the log of the quantities or to plot them on a log scale:


In [43]:
ranks, counts = wc.ranks()
plt.plot(ranks, counts, linewidth=4, color=ORANGE)
plt.xlabel('rank')
plt.ylabel('count')
plt.xscale('log')
plt.yscale('log')


This (approximately) straight line is characteristic of Zipf's law.

n-grams

On to the next topic: bigrams and trigrams.


In [44]:
def pairwise(iterator):
    """Iterates through a sequence in overlapping pairs.
    
    If the sequence is 1, 2, 3, the result is (1, 2), (2, 3), (3, 4), etc.
    """
    a, b = tee(iterator)
    next(b, None)
    return izip(a, b)

bigrams is the histogram of word pairs:


In [90]:
bigrams = Hist(pairwise(iterate_words('pg2591.txt')))

And here are the 20 most common:


In [48]:
bigrams.most_common(20)


Out[48]:
[(('to', 'the'), 444),
 (('in', 'the'), 399),
 (('of', 'the'), 369),
 (('and', 'the'), 349),
 (('into', 'the'), 294),
 (('said', 'the'), 251),
 (('on', 'the'), 199),
 (('and', 'when'), 168),
 (('he', 'was'), 164),
 (('he', 'had'), 164),
 (('to', 'be'), 163),
 (('it', 'was'), 152),
 (('Then', 'the'), 151),
 (('I', 'will'), 149),
 (('that', 'he'), 143),
 (('at', 'the'), 142),
 (('came', 'to'), 138),
 (('and', 'he'), 135),
 (('she', 'was'), 129),
 (('all', 'the'), 125)]

Similarly, we can iterate the trigrams:


In [49]:
def triplewise(iterator):
    a, b, c = tee(iterator, 3)
    next(b)
    next(c)
    next(c)
    return izip(a, b, c)

And make a histogram:


In [93]:
trigrams = Hist(triplewise(iterate_words('pg2591.txt')))

# Uncomment this line to run the analysis with Elvis Presley lyrics
#trigrams = Hist(triplewise(iterate_words('lyrics-elvis-presley.txt')))

Here are the 20 most common:


In [94]:
trigrams.most_common(20)


Out[94]:
[(('came', 'to', 'the'), 65),
 (('and', 'when', 'he'), 50),
 (('out', 'of', 'the'), 50),
 (('said', 'to', 'the'), 34),
 (('he', 'came', 'to'), 33),
 (('and', 'when', 'she'), 33),
 (('went', 'into', 'the'), 32),
 (('went', 'to', 'the'), 31),
 (('and', 'said', 'to'), 31),
 (('came', 'to', 'a'), 30),
 (('one', 'of', 'the'), 30),
 (('and', 'as', 'he'), 29),
 (('they', 'came', 'to'), 29),
 (('he', 'did', 'not'), 28),
 (('there', 'was', 'a'), 28),
 (('that', 'he', 'had'), 28),
 (('and', 'I', 'will'), 27),
 (('that', 'it', 'was'), 25),
 (('and', 'at', 'last'), 24),
 (('and', 'when', 'the'), 24)]

And now for a little fun. I'll make a dictionary that maps from each word pair to a Hist of the words that can follow.


In [95]:
d = defaultdict(Hist)
for a, b, c in trigrams:
    d[a, b][c] += trigrams[a, b, c]

Now we can look up a pair and see what might come next:


In [96]:
d['the', 'blood']


Out[96]:
Hist({'ran': 2, 'of': 2, 'on': 1, 'streamed': 1, 'came,': 1, 'ran.': 1, 'that': 1, 'fell': 1, 'might': 1})

Here are the most common words that follow "into the":


In [97]:
d['into', 'the'].most_common(10)


Out[97]:
[('forest', 15),
 ('forest,', 13),
 ('garden', 9),
 ('kitchen,', 8),
 ('cellar', 8),
 ('room,', 7),
 ('wide', 7),
 ('water,', 7),
 ('wood', 6),
 ('kitchen', 6)]

Here are the words that follow "said the":


In [98]:
d['said', 'the'].most_common(10)


Out[98]:
[('old', 13),
 ('man,', 12),
 ('little', 10),
 ('fisherman,', 8),
 ('father,', 7),
 ('ass,', 6),
 ('tailor,', 5),
 ('wife,', 5),
 ('fish;', 5),
 ('other;', 5)]

Hist provides choice, which chooses a random word with probability proportional to count:


In [99]:
d['said', 'the'].choice()


Out[99]:
'sparrow,'

Given a prefix, we can choose a random suffix:


In [100]:
prefix = 'said', 'the'
suffix = d[prefix].choice()
suffix


Out[100]:
'fisherman,'

Then we can shift the words and compute the next prefix:


In [101]:
prefix = prefix[1], suffix
prefix


Out[101]:
('the', 'fisherman,')

Repeating this process, we can generate random new text that has the same correlation structure between words as the original:


In [102]:
for i in range(100):
    suffix = d[prefix].choice()
    print(suffix, end=' ')
    prefix = prefix[1], suffix


'the fish cannot make you very happy. You must be a sin no good can come.' And his father some trouble!' When anything had to go to the eldest wanted to climb like me, you nasty frog? My golden ball in his stomach knocked against each other as if they have bruised and wounded those poor trees; they will let the thing be till tomorrow; the finger and thrust into his pockets whatever could be seen approaching, and this time her coachman and everything about her, and brought him what he had been saved by the terms of the lake 

With a prefix of two words, we typically get text that flirts with sensibility.


In [ ]: