```
In [1]:
```# this line makes the code compatible with Python 2 and 3
from __future__ import print_function, division
# this line makes Jupyter show figures in the notebook
%matplotlib inline

A histogram is a map from each possible value to the number of times it appears. A map can be a mathematical function or, as in the examples below, a Python data structure that provides the ability to look up a value and get its probability.

`Counter`

is a data structure provided by Python; I am defining a new data structure, called a `Hist`

, that has all the features of a Counter, plus a few more that I define.

```
In [2]:
```import random
import matplotlib.pyplot as plt
from collections import Counter
from itertools import izip
class Hist(Counter):
def __add__(self, other):
"""Returns the Pmf of the sum of elements from self and other."""
return Hist(x + y for x, y in product(self.elements(), other.elements()))
def choice(self):
"""Chooses a random element."""
return random.choice(list(self.elements()))
def plot(self, **options):
"""Plots the Pmf."""
plt.bar(*zip(*self.items()), **options)
plt.xlabel('Values')
plt.ylabel('Counts')
def ranks(self):
"""Returns ranks and counts as lists."""
return izip(*enumerate(sorted(self.values(), reverse=True)))

As an example, I'll make a Hist of the letters in my name:

```
In [3]:
```hist = Hist('allen')
hist

```
Out[3]:
```

We can look up a letter and get the corresponding count:

```
In [4]:
```hist['l']

```
Out[4]:
```

Or loop through all the letters and print their counts:

```
In [5]:
```for letter in hist:
print(letter, hist[letter])

```
```

`Counter`

provides `most_common`

, which makes a list of (element, count) pairs:

```
In [6]:
```hist.most_common()

```
Out[6]:
```

Here they are in a more readable form:

```
In [7]:
```for letter, count in hist.most_common():
print(letter, count)

```
```

`choice`

, which returns a random element from the Hist. On average, 'l' should appear twice as often as the other letters.

```
In [8]:
```for i in range(10):
print(hist.choice())

```
```

```
In [9]:
```def is_anagram(word1, word2):
return Hist(word1) == Hist(word2)

Here's a simple test:

```
In [10]:
```is_anagram('allen', 'nella')

```
Out[10]:
```

And my favorite anagram pair:

```
In [11]:
```is_anagram('tachymetric', 'mccarthyite')

```
Out[11]:
```

And here's a false one, just to make sure:

```
In [12]:
```is_anagram('abcd', 'abccd')

```
Out[12]:
```

```
In [13]:
```d6 = Hist([1,2,3,4,5,6])
d6

```
Out[13]:
```

`Hist`

provides a plot function:

```
In [14]:
```import seaborn as sns
COLORS = sns.color_palette()
d6.plot(color=COLORS[3])

```
```

`elements`

returns an iterator

```
In [15]:
```d6.elements()

```
Out[15]:
```

Which is easier to see if you convert to a list:

```
In [16]:
```list(d6.elements())

```
Out[16]:
```

The product of two iterators is an iterator that enumerates all pairs:

```
In [17]:
```from itertools import product
product(d6.elements(), d6.elements())

```
Out[17]:
```

Here are the elements of the product:

```
In [18]:
```list(product(d6.elements(), d6.elements()))

```
Out[18]:
```

Now we can compute the sum of all pairs:

```
In [19]:
```list(x + y for x, y in product(d6.elements(), d6.elements()))

```
Out[19]:
```

And finally make a Hist of the sums:

```
In [20]:
```Hist(x + y for x, y in product(d6.elements(), d6.elements()))

```
Out[20]:
```

But all of that is provided by `__add__`

, which we can call using the `+`

operator:

```
In [21]:
```twice = d6 + d6
twice

```
Out[21]:
```

Now we can plot the histogram of outcomes from rolling two dice:

```
In [22]:
```twice.plot(color=COLORS[2])

```
```

Or three dice:

```
In [23]:
```thrice = twice + d6
thrice

```
Out[23]:
```

Notice that this is looking more and more like a bell curve:

```
In [24]:
```thrice.plot(color=COLORS[1])

```
```

This is one of the first topics I wrote about in my blog, and still the most popular, with more than 100,000 page views:

http://allendowney.blogspot.com/2011/02/are-first-babies-more-likely-to-be-late.html

I used data from the National Survey of Family Growth (NSFG):

```
In [25]:
```import thinkstats2
dct_file = '2002FemPreg.dct'
dat_file = '2002FemPreg.dat.gz'
dct = thinkstats2.ReadStataDct(dct_file)
preg = dct.ReadFixedWidth(dat_file, compression='gzip')
preg.head()

```
Out[25]:
```

The variable `outcome`

encodes the outcome of the pregnancy. Outcome 1 is a live birth.

```
In [26]:
```preg.outcome.value_counts().sort_index()

```
Out[26]:
```

`pregorder`

is 1 for first pregnancies, 2 for others.

```
In [27]:
```preg.pregordr.value_counts().sort_index()

```
Out[27]:
```

I selected live births, then split into first babies and others.

```
In [28]:
```live = preg[preg.outcome == 1]
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
len(firsts), len(others)

```
Out[28]:
```

The mean pregnancy lengths are slightly different:

```
In [29]:
```firsts.prglngth.mean(), others.prglngth.mean()

```
Out[29]:
```

The difference is 0.078 weeks:

```
In [30]:
```diff = firsts.prglngth.mean() - others.prglngth.mean()
diff

```
Out[30]:
```

Which is 13 hours. Note: the best units to report are often not the units you computed.

```
In [31]:
```diff * 7 * 24

```
Out[31]:
```

Let's see if we can visualize the difference in the histograms:

```
In [32]:
```first_hist = Hist(firsts.prglngth)
other_hist = Hist(others.prglngth)

I used some plotting options to put two bar charts side-by-side:

```
In [33]:
```def plot_distributions(dist1, dist2):
dist1.plot(width=-0.45, align='edge', color=COLORS[3], label='firsts')
dist2.plot(width=0.45, align='edge', color=COLORS[4], label='others')
plt.xlim(33.5, 43.5)
plt.legend()

Here are the two histograms:

```
In [34]:
```plot_distributions(first_hist, other_hist)

```
```

Remember that the vertical axis is counts. In this case, we are comparing counts with different totals, which might be misleading.

An alternative is to compute a probability mass function (PMF), which divides the counts by the totals, yielding a map from each element to its probability.

The probabilities are "normalized" to add up to 1.

```
In [35]:
```import numpy as np
class Pmf(Hist):
def normalize(self):
total = sum(self.values())
for element in self:
self[element] /= total
return self
def plot_cumulative(self, **options):
xs, ps = zip(*sorted(self.iteritems()))
cs = np.cumsum(ps, dtype=np.float)
cs /= cs[-1]
plt.plot(xs, cs, **options)

Now we can compare PMFs fairly.

```
In [36]:
```first_pmf = Pmf(firsts.prglngth).normalize()
other_pmf = Pmf(others.prglngth).normalize()

```
In [37]:
```plot_distributions(first_pmf, other_pmf)

```
```

```
In [38]:
```first_pmf.plot_cumulative(linewidth=4, color=COLORS[3], label='firsts')
other_pmf.plot_cumulative(linewidth=4, color=COLORS[4], label='others')
plt.xlim(23.5, 44.5)
plt.legend(loc='upper left')

```
Out[38]:
```

The CDFs are similar up to week 38. After that, first babies are more likely to be born late.

```
In [39]:
```def iterate_words(filename):
"""Read lines from a file and split them into words."""
for line in open(filename):
for word in line.split():
yield word.strip()

Here's an example using a book from Project Gutenberg. `wc`

is a histogram of word counts:

```
In [40]:
```# FAIRY TALES
# By The Brothers Grimm
# http://www.gutenberg.org/cache/epub/2591/pg2591.txt'
wc = Hist(iterate_words('pg2591.txt'))

Here are the 20 most common words:

```
In [41]:
```wc.most_common(20)

```
Out[41]:
```

Word frequencies in natural languages follow a predictable pattern called Zipf's law (which is an instance of Stigler's law, which is also an instance of Stigler's law).

We can see the pattern by lining up the words in descending order of frequency and plotting their ranks (1st, 2nd, 3rd, ...) versus counts (6507, 5250, 2707):

```
In [42]:
```ranks, counts = wc.ranks()
plt.plot(ranks, counts, linewidth=10, color=COLORS[5])
plt.xlabel('rank')
plt.ylabel('count')

```
Out[42]:
```

Huh. Maybe that's not so clear after all. The problem is that the counts drop off very quickly. If we use the highest count to scale the figure, most of the other counts are indistinguishable from zero.

Also, there are more than 10,000 words, but most of them appear only a few times, so we are wasting most of the space in the figure in a regime where nothing is happening.

This kind of thing happens a lot. A common way to deal with it is to compute the log of the quantities or to plot them on a log scale:

```
In [43]:
```ranks, counts = wc.ranks()
plt.plot(ranks, counts, linewidth=4, color=COLORS[5])
plt.xlabel('rank')
plt.ylabel('count')
plt.xscale('log')
plt.yscale('log')

```
```

```
In [44]:
```from itertools import tee
def pairwise(iterator):
"""Iterates through a sequence in overlapping pairs.
If the sequence is 1, 2, 3, the result is (1, 2), (2, 3), (3, 4), etc.
"""
a, b = tee(iterator)
next(b, None)
return izip(a, b)

`bigrams`

is the histogram of word pairs:

```
In [45]:
```bigrams = Hist(pairwise(iterate_words('pg2591.txt')))

And here are the 20 most common:

```
In [46]:
```bigrams.most_common(20)

```
Out[46]:
```

Similarly, we can iterate the trigrams:

```
In [47]:
```def triplewise(iterator):
a, b, c = tee(iterator, 3)
next(b)
next(c)
next(c)
return izip(a, b, c)

And make a histogram:

```
In [48]:
```trigrams = Hist(triplewise(iterate_words('pg2591.txt')))
# Uncomment this line to run the analysis with Elvis Presley lyrics
#trigrams = Hist(triplewise(iterate_words('lyrics-elvis-presley.txt')))

Here are the 20 most common:

```
In [49]:
```trigrams.most_common(20)

```
Out[49]:
```

```
In [50]:
```from collections import defaultdict
d = defaultdict(Hist)
for a, b, c in trigrams:
d[a, b][c] += trigrams[a, b, c]

Now we can look up a pair and see what might come next:

```
In [51]:
```d['the', 'blood']

```
Out[51]:
```

Here are the most common words that follow "into the":

```
In [52]:
```d['into', 'the'].most_common(10)

```
Out[52]:
```

Here are the words that follow "said the":

```
In [53]:
```d['said', 'the'].most_common(10)

```
Out[53]:
```

`Hist`

provides `choice`

, which chooses a random word with probability proportional to count:

```
In [54]:
```d['said', 'the'].choice()

```
Out[54]:
```

Given a prefix, we can choose a random suffix:

```
In [55]:
```prefix = 'said', 'the'
suffix = d[prefix].choice()
suffix

```
Out[55]:
```

Then we can shift the words and compute the next prefix:

```
In [56]:
```prefix = prefix[1], suffix
prefix

```
Out[56]:
```

```
In [57]:
```for i in range(100):
suffix = d[prefix].choice()
print(suffix, end=' ')
prefix = prefix[1], suffix

```
```

With a prefix of two words, we typically get text that flirts with sensibility.

```
In [ ]:
```