Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [1]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np

import nsfg
import first

Given a list of values, there are several ways to count the frequency of each value.


In [2]:
t = [1, 2, 2, 3, 5]

You can use a Python dictionary:


In [3]:
hist = {}
for x in t:
    hist[x] = hist.get(x, 0) + 1
    
hist


Out[3]:
{1: 1, 2: 2, 3: 1, 5: 1}

You can use a Counter (which is a dictionary with additional methods):


In [4]:
from collections import Counter
counter = Counter(t)
counter


Out[4]:
Counter({1: 1, 2: 2, 3: 1, 5: 1})

Or you can use the Hist object provided by thinkstats2:


In [5]:
import thinkstats2
hist = thinkstats2.Hist([1, 2, 2, 3, 5])
hist


Out[5]:
Hist({1: 1, 2: 2, 3: 1, 5: 1})

Hist provides Freq, which looks up the frequency of a value.


In [6]:
hist.Freq(2)


Out[6]:
2

You can also use the bracket operator, which does the same thing.


In [7]:
hist[2]


Out[7]:
2

If the value does not appear, it has frequency 0.


In [8]:
hist[4]


Out[8]:
0

The Values method returns the values:


In [9]:
hist.Values()


Out[9]:
[1, 2, 3, 5]

So you can iterate the values and their frequencies like this:


In [10]:
for val in sorted(hist.Values()):
    print(val, hist[val])


1 1
2 2
3 1
5 1

Or you can use the Items method:


In [11]:
for val, freq in hist.Items():
     print(val, freq)


1 1
2 2
3 1
5 1

thinkplot is a wrapper for matplotlib that provides functions that work with the objects in thinkstats2.

For example Hist plots the values and their frequencies as a bar graph.

Config takes parameters that label the x and y axes, among other things.


In [12]:
import thinkplot
thinkplot.Hist(hist)
thinkplot.Config(xlabel='value', ylabel='frequency')


As an example, I'll replicate some of the figures from the book.

First, I'll load the data from the pregnancy file and select the records for live births.


In [13]:
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]

Here's the histogram of birth weights in pounds. Notice that Hist works with anything iterable, including a Pandas Series. The label attribute appears in the legend when you plot the Hist.


In [14]:
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='Count')


Before plotting the ages, I'll apply floor to round down:


In [15]:
ages = np.floor(live.agepreg)

In [16]:
hist = thinkstats2.Hist(ages, label='agepreg')
thinkplot.Hist(hist)
thinkplot.Config(xlabel='years', ylabel='Count')


As an exercise, plot the histogram of pregnancy lengths (column prglngth).


In [17]:
# Solution goes here
length = live.prglngth

In [18]:
hist = thinkstats2.Hist(length, label = 'preglngth')
thinkplot.Hist(hist)
thinkplot.Config(xlabel='Weeks', ylabel='Count')


Hist provides smallest, which select the lowest values and their frequencies.


In [19]:
for weeks, freq in hist.Smallest(10):
    print(weeks, freq)


0 1
4 1
9 1
13 1
17 2
18 1
19 1
20 1
21 2
22 7

Use Largest to display the longest pregnancy lengths.


In [20]:
# Solution goes here
for weeks, freq in hist.Largest(10):
    print(weeks, freq)


50 2
48 7
47 1
46 1
45 10
44 46
43 148
42 328
41 587
40 1116

From live births, we can selection first babies and others using birthord, then compute histograms of pregnancy length for the two groups.


In [21]:
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]

first_hist = thinkstats2.Hist(firsts.prglngth, label='first')
other_hist = thinkstats2.Hist(others.prglngth, label='other')

We can use width and align to plot two histograms side-by-side.


In [22]:
width = 0.45
thinkplot.PrePlot(2)
thinkplot.Hist(first_hist, align='right', width=width)
thinkplot.Hist(other_hist, align='left', width=width)
thinkplot.Config(xlabel='weeks', ylabel='Count', xlim=[27, 46])


Series provides methods to compute summary statistics:


In [23]:
mean = live.prglngth.mean()
var = live.prglngth.var()
std = live.prglngth.std()

Here are the mean and standard deviation:


In [24]:
mean, std


Out[24]:
(38.56055968517709, 2.702343810070593)

As an exercise, confirm that std is the square root of var:


In [25]:
# Solution goes here
std == np.sqrt(var)


Out[25]:
True

Here's are the mean pregnancy lengths for first babies and others:


In [26]:
firsts.prglngth.mean(), others.prglngth.mean()


Out[26]:
(38.60095173351461, 38.52291446673706)

And here's the difference (in weeks):


In [27]:
firsts.prglngth.mean() - others.prglngth.mean()


Out[27]:
0.07803726677754952

This functon computes the Cohen effect size, which is the difference in means expressed in number of standard deviations:


In [28]:
def CohenEffectSize(group1, group2):
    """Computes Cohen's effect size for two groups.
    
    group1: Series or DataFrame
    group2: Series or DataFrame
    
    returns: float if the arguments are Series;
             Series if the arguments are DataFrames
    """
    diff = group1.mean() - group2.mean()

    var1 = group1.var()
    var2 = group2.var()
    n1, n2 = len(group1), len(group2)

    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    d = diff / np.sqrt(pooled_var)
    return d

Compute the Cohen effect size for the difference in pregnancy length for first babies and others.


In [29]:
# Solution goes here

CohenEffectSize(firsts.prglngth, others.prglngth)


Out[29]:
0.028879044654449883

Exercises

Using the variable totalwgt_lb, investigate whether first babies are lighter or heavier than others.

Compute Cohen’s effect size to quantify the difference between the groups. How does it compare to the difference in pregnancy length?


In [30]:
# Solution goes here

firsts_hist=thinkstats2.Hist(np.round(firsts.totalwgt_lb), label='firsts')
others_hist=thinkstats2.Hist(np.round(others.totalwgt_lb), label='others')

print('mean')
print('firsts:', firsts.totalwgt_lb.mean())
print('others:', others.totalwgt_lb.mean())
print('')
print('stdev')
print('firsts:', firsts.totalwgt_lb.mean())
print('others:', others.totalwgt_lb.mean())
print('')
print('median')
print('firsts:', firsts.totalwgt_lb.median())
print('others:', others.totalwgt_lb.median())

width=0.45
thinkplot.PrePlot(2)
thinkplot.Hist(firsts_hist, align='left', width=width)
thinkplot.Hist(others_hist, align='right', width=width)
thinkplot.Config(xlabel='pounds', ylabel='Count', xlim=(0, 15))


mean
firsts: 7.20109443044
others: 7.32585561497

stdev
firsts: 7.20109443044
others: 7.32585561497

median
firsts: 7.3125
others: 7.375

In [58]:
# Solution goes here
CohenEffectSize(firsts.totalwgt_lb, others.totalwgt_lb)


Out[58]:
-0.088672927072602006

For the next few exercises, we'll load the respondent file:


In [31]:
resp = nsfg.ReadFemResp()

Make a histogram of totincr the total income for the respondent's family. To interpret the codes see the codebook.


In [34]:
# Solution goes here
inc_hist = thinkstats2.Hist(resp.totincr)
thinkplot.Hist(inc_hist)


Make a histogram of age_r, the respondent's age at the time of interview.


In [35]:
# Solution goes here
age_hist = thinkstats2.Hist(resp.age_r)
thinkplot.Hist(age_hist)


Make a histogram of numfmhh, the number of people in the respondent's household.


In [36]:
# Solution goes here
fmhh_hist = thinkstats2.Hist(resp.numfmhh)
thinkplot.Hist(fmhh_hist)


Make a histogram of parity, the number of children borne by the respondent. How would you describe this distribution?


In [37]:
# Solution goes here
parity_hist = thinkstats2.Hist(resp.parity)
thinkplot.Hist(parity_hist)


This data is left-skewed with a long tail. Women are about as likely to have 1 as 2 children, though parity drops off significantly after 2 children.

Use Hist.Largest to find the largest values of parity.


In [39]:
# Solution goes here
parity_hist.Largest(10)


Out[39]:
[(22, 1),
 (16, 1),
 (10, 3),
 (9, 2),
 (8, 8),
 (7, 15),
 (6, 29),
 (5, 95),
 (4, 309),
 (3, 828)]

Let's investigate whether people with higher income have higher parity. Keep in mind that in this study, we are observing different people at different times during their lives, so this data is not the best choice for answering this question. But for now let's take it at face value.

Use totincr to select the respondents with the highest income (level 14). Plot the histogram of parity for just the high income respondents.


In [44]:
# Solution goes here
hinc = resp[resp['totincr'] == 14]
other = resp[resp['totincr'] < 14]
hinc_hist = thinkstats2.Hist(hinc.parity)
thinkplot.Hist(hinc_hist)


Find the largest parities for high income respondents.


In [45]:
# Solution goes here
hinc_hist.Largest(5)


Out[45]:
[(8, 1), (7, 1), (5, 5), (4, 19), (3, 123)]

Compare the mean parity for high income respondents and others.


In [49]:
# Solution goes here
print('mean parity, high income:', hinc.parity.mean())
print('mean parity, other:', other.parity.mean())


mean parity, high income: 1.07586206897
mean parity, other: 1.24957581367

Compute the Cohen effect size for this difference. How does it compare with the difference in pregnancy length for first babies and others?


In [50]:
# Solution goes here
CohenEffectSize(hinc.parity, other.parity)


Out[50]:
-0.12511855314660611

The Cohen Effect Size for the difference in parity between mothers with high income and mothers with low income is much larger than the Cohen Effect Size for the difference in pregnancy length for first babies and others. It is also negative, suggesting that, if anything, mothers with high incomes tend to have fewer children.