Examples and Exercises from Think Stats, 2nd Edition

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

%matplotlib inline

import numpy as np

import nsfg
import first
import thinkstats2
import thinkplot

Again, I'll load the NSFG pregnancy file and select live births:



In [2]:

    
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]

Here's the histogram of birth weights:



In [3]:

    
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='Count')

To normalize the disrtibution, we could divide through by the total count:



In [4]:

    
n = hist.Total()
pmf = hist.Copy()
for x, freq in hist.Items():
    pmf[x] = freq / n

The result is a Probability Mass Function (PMF).



In [5]:

    
thinkplot.Hist(pmf)
thinkplot.Config(xlabel='Birth weight (pounds)', ylabel='PMF')

More directly, we can create a Pmf object.



In [6]:

    
pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
pmf









    Out[6]:





Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})

Pmf provides Prob, which looks up a value and returns its probability:



In [7]:

    
pmf.Prob(2)









    Out[7]:





0.4

The bracket operator does the same thing.



In [8]:

    
pmf[2]









    Out[8]:





0.4

The Incr method adds to the probability associated with a given values.



In [9]:

    
pmf.Incr(2, 0.2)
pmf[2]









    Out[9]:





0.6000000000000001

The Mult method multiplies the probability associated with a value.



In [10]:

    
pmf.Mult(2, 0.5)
pmf[2]









    Out[10]:





0.30000000000000004

Total returns the total probability (which is no longer 1, because we changed one of the probabilities).



In [11]:

    
pmf.Total()









    Out[11]:





0.8999999999999999

Normalize divides through by the total probability, making it 1 again.



In [12]:

    
pmf.Normalize()
pmf.Total()









    Out[12]:





1.0

Here's the PMF of pregnancy length for live births.



In [13]:

    
pmf = thinkstats2.Pmf(live.prglngth, label='prglngth')

Here's what it looks like plotted with Hist, which makes a bar graph.



In [14]:

    
thinkplot.Hist(pmf)
thinkplot.Config(xlabel='Pregnancy length (weeks)', ylabel='Pmf')

Here's what it looks like plotted with Pmf, which makes a step function.



In [15]:

    
thinkplot.Pmf(pmf)
thinkplot.Config(xlabel='Pregnancy length (weeks)', ylabel='Pmf')

We can use MakeFrames to return DataFrames for all live births, first babies, and others.



In [16]:

    
live, firsts, others = first.MakeFrames()

Here are the distributions of pregnancy length.



In [17]:

    
first_pmf = thinkstats2.Pmf(firsts.prglngth, label='firsts')
other_pmf = thinkstats2.Pmf(others.prglngth, label='others')

And here's the code that replicates one of the figures in the chapter.



In [18]:

    
width=0.45
axis = [27, 46, 0, 0.6]
thinkplot.PrePlot(2, cols=2)
thinkplot.Hist(first_pmf, align='right', width=width)
thinkplot.Hist(other_pmf, align='left', width=width)
thinkplot.Config(xlabel='Pregnancy length(weeks)', ylabel='PMF', axis=axis)

thinkplot.PrePlot(2)
thinkplot.SubPlot(2)
thinkplot.Pmfs([first_pmf, other_pmf])
thinkplot.Config(xlabel='Pregnancy length(weeks)', axis=axis)

Here's the code that generates a plot of the difference in probability (in percentage points) between first babies and others, for each week of pregnancy (showing only pregnancies considered "full term").



In [19]:

    
weeks = range(35, 46)
diffs = []
for week in weeks:
    p1 = first_pmf.Prob(week)
    p2 = other_pmf.Prob(week)
    diff = 100 * (p1 - p2)
    diffs.append(diff)

thinkplot.Bar(weeks, diffs)
thinkplot.Config(xlabel='Pregnancy length(weeks)', ylabel='Difference (percentage points)')

Biasing and unbiasing PMFs

Here's the example in the book showing operations we can perform with Pmf objects.

Suppose we have the following distribution of class sizes.



In [20]:

    
d = { 7: 8, 12: 8, 17: 14, 22: 4, 
     27: 6, 32: 12, 37: 8, 42: 3, 47: 2 }

pmf = thinkstats2.Pmf(d, label='actual')

This function computes the biased PMF we would get if we surveyed students and asked about the size of the classes they are in.



In [21]:

    
def BiasPmf(pmf, label):
    new_pmf = pmf.Copy(label=label)

    for x, p in pmf.Items():
        new_pmf.Mult(x, x)
        
    new_pmf.Normalize()
    return new_pmf

The following graph shows the difference between the actual and observed distributions.



In [22]:

    
biased_pmf = BiasPmf(pmf, label='observed')
thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, biased_pmf])
thinkplot.Config(xlabel='Class size', ylabel='PMF')

The observed mean is substantially higher than the actual.



In [23]:

    
print('Actual mean', pmf.Mean())
print('Observed mean', biased_pmf.Mean())









    



Actual mean 23.6923076923
Observed mean 29.1233766234

If we were only able to collect the biased sample, we could "unbias" it by applying the inverse operation.



In [24]:

    
def UnbiasPmf(pmf, label=None):
    new_pmf = pmf.Copy(label=label)

    for x, p in pmf.Items():
        new_pmf[x] *= 1/x
        
    new_pmf.Normalize()
    return new_pmf

We can unbias the biased PMF:



In [25]:

    
unbiased = UnbiasPmf(biased_pmf, label='unbiased')
print('Unbiased mean', unbiased.Mean())









    



Unbiased mean 23.6923076923

And plot the two distributions to confirm they are the same.



In [26]:

    
thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, unbiased])
thinkplot.Config(xlabel='Class size', ylabel='PMF')

Pandas indexing

Here's an example of a small DataFrame.



In [27]:

    
import numpy as np
import pandas
array = np.random.randn(4, 2)
df = pandas.DataFrame(array)
df

We can specify column names when we create the DataFrame:



In [28]:

    
columns = ['A', 'B']
df = pandas.DataFrame(array, columns=columns)
df

We can also specify an index that contains labels for the rows.



In [29]:

    
index = ['a', 'b', 'c', 'd']
df = pandas.DataFrame(array, columns=columns, index=index)
df

Normal indexing selects columns.



In [30]:

    
df['A']









    Out[30]:





a    0.044426
b   -1.310203
c   -0.433393
d   -0.009223
Name: A, dtype: float64

We can use the loc attribute to select rows.



In [31]:

    
df.loc['a']









    Out[31]:





A    0.044426
B   -0.390323
Name: a, dtype: float64

If you don't want to use the row labels and prefer to access the rows using integer indices, you can use the iloc attribute:



In [32]:

    
df.iloc[0]









    Out[32]:





A    0.044426
B   -0.390323
Name: a, dtype: float64

loc can also take a list of labels.



In [33]:

    
indices = ['a', 'c']
df.loc[indices]

If you provide a slice of labels, DataFrame uses it to select rows.



In [34]:

    
df['a':'c']

If you provide a slice of integers, DataFrame selects rows by integer index.



In [35]:

    
df[0:2]

But notice that one method includes the last elements of the slice and one does not.

In general, I recommend giving labels to the rows and names to the columns, and using them consistently.

Exercises

Exercise: Something like the class size paradox appears if you survey children and ask how many children are in their family. Families with many children are more likely to appear in your sample, and families with no children have no chance to be in the sample.

Use the NSFG respondent variable numkdhh to construct the actual distribution for the number of children under 18 in the respondents' households.

Now compute the biased distribution we would see if we surveyed the children and asked them how many children under 18 (including themselves) are in their household.

Plot the actual and biased distributions, and compute their means.



In [36]:

    
# Create original PMF

resp = nsfg.ReadFemResp()
numkdhh_pmf = thinkstats2.Pmf(resp['numkdhh'], label='actual')
numkdhh_pmf









    Out[36]:





Pmf({0: 0.46617820227659301, 1: 0.21405207379301322, 2: 0.19625801386889966, 3: 0.087138558157791451, 4: 0.025644380478869556, 5: 0.010728771424833181})



In [37]:

    
# Create copy and confirm values

pmf = numkdhh_pmf.Copy()
print(pmf)
print(pmf.Total())
print('mean', pmf.Mean())









    



Pmf({0: 0.46617820227659301, 1: 0.21405207379301322, 2: 0.19625801386889966, 3: 0.087138558157791451, 4: 0.025644380478869556, 5: 0.010728771424833181})
1.0
mean 1.02420515504



In [38]:

    
# Weight PMF by number of children that would respond with each value

def BiasPmf(pmf, label):
    child_pmf = pmf.Copy(label=label)
    for x, p in pmf.Items():
        child_pmf.Mult(x, x)
    child_pmf.Normalize()
    return child_pmf

child_pmf = BiasPmf(pmf, 'childs_view')
print(child_pmf)
print(child_pmf.Total())
print('mean', child_pmf.Mean())









    



Pmf({0: 0.0, 1: 0.20899335717935616, 2: 0.38323965252938175, 3: 0.25523760858456823, 4: 0.10015329586101177, 5: 0.052376085845682166})
1.0
mean 2.40367910066



In [39]:

    
# Plot

thinkplot.PrePlot(2)
thinkplot.Pmfs([pmf, child_pmf])
thinkplot.Show(xlabel='Children In Family', ylabel='PMF')









    












    





<matplotlib.figure.Figure at 0x11c6fee10>



In [40]:

    
# True mean

print('True mean', pmf.Mean())









    



True mean 1.02420515504



In [41]:

    
# Mean based on the children's responses

print('Child view mean', child_pmf.Mean())









    



Child view mean 2.40367910066

Exercise: Write functions called PmfMean and PmfVar that take a Pmf object and compute the mean and variance. To test these methods, check that they are consistent with the methods Mean and Var provided by Pmf.



In [42]:

    
def PmfMean(pmf):
    mean=0
    for x, p in pmf.Items():
        mean += x*p
    return mean

PmfMean(child_pmf)









    Out[42]:





2.4036791006642821



In [45]:

    
def PmfVar(pmf):
    variance=0
    pmf_mean=PmfMean(pmf)
    for x, p in pmf.Items():
        variance += p * np.power(x-pmf_mean, 2)
    return variance

PmfVar(child_pmf)









    Out[45]:





1.1732721055059874



In [51]:

    
print('Check Mean =', PmfMean(child_pmf) == thinkstats2.Pmf.Mean(child_pmf))
print('Check Variance = ', PmfVar(child_pmf) == thinkstats2.Pmf.Var(child_pmf))









    



Check Mean = True
Check Variance =  True

Exercise: I started this book with the question, "Are first babies more likely to be late?" To address it, I computed the difference in means between groups of babies, but I ignored the possibility that there might be a difference between first babies and others for the same woman.

To address this version of the question, select respondents who have at least two live births and compute pairwise differences. Does this formulation of the question yield a different result?

Hint: use nsfg.MakePregMap:



In [52]:

    
live, firsts, others = first.MakeFrames()



In [64]:

    
preg.iloc[0:2].prglngth









    Out[64]:





0    39
1    39
Name: prglngth, dtype: int64



In [84]:

    
preg_map = nsfg.MakePregMap(live)



In [78]:

    
hist = thinkstats2.Hist()

for case, births in preg_map.items():
    if len(births) >= 2:
        pair = preg.loc[births[0:2]].prglngth
        diff = pair.iloc[1] - pair.iloc[0]
        hist[diff] += 1



In [79]:

    
thinkplot.Hist(hist)



In [81]:

    
pmf = thinkstats2.Pmf(hist)
PmfMean(pmf)









    Out[81]:





-0.056367432150313368

Exercise: In most foot races, everyone starts at the same time. If you are a fast runner, you usually pass a lot of people at the beginning of the race, but after a few miles everyone around you is going at the same speed. When I ran a long-distance (209 miles) relay race for the first time, I noticed an odd phenomenon: when I overtook another runner, I was usually much faster, and when another runner overtook me, he was usually much faster.

At first I thought that the distribution of speeds might be bimodal; that is, there were many slow runners and many fast runners, but few at my speed.

Then I realized that I was the victim of a bias similar to the effect of class size. The race was unusual in two ways: it used a staggered start, so teams started at different times; also, many teams included runners at different levels of ability.

As a result, runners were spread out along the course with little relationship between speed and location. When I joined the race, the runners near me were (pretty much) a random sample of the runners in the race.

So where does the bias come from? During my time on the course, the chance of overtaking a runner, or being overtaken, is proportional to the difference in our speeds. I am more likely to catch a slow runner, and more likely to be caught by a fast runner. But runners at the same speed are unlikely to see each other.

Write a function called ObservedPmf that takes a Pmf representing the actual distribution of runners’ speeds, and the speed of a running observer, and returns a new Pmf representing the distribution of runners’ speeds as seen by the observer.

To test your function, you can use relay.py, which reads the results from the James Joyce Ramble 10K in Dedham MA and converts the pace of each runner to mph.

Compute the distribution of speeds you would observe if you ran a relay race at 7 mph with this group of runners.



In [82]:

    
import relay

results = relay.ReadResults()
speeds = relay.GetSpeeds(results)
speeds = relay.BinData(speeds, 3, 12, 100)



In [83]:

    
pmf = thinkstats2.Pmf(speeds, 'actual speeds')
thinkplot.Pmf(pmf)
thinkplot.Config(xlabel='Speed (mph)', ylabel='PMF')



In [86]:

    
def ObservedPmf(pmf, speed, label):
    observed_pmf = pmf.Copy(label=label)
    for value in observed_pmf.Values():
        diff = abs(speed - value)
        observed_pmf[value] *= diff
    observed_pmf.Normalize()
    return observed_pmf



In [97]:

    
observed = ObservedPmf(pmf, 7, 'observed speeds')
thinkplot.Hist(observed)



In [ ]:

	0	1
0	0.044426	-0.390323
1	-1.310203	-0.830353
2	-0.433393	-0.584162
3	-0.009223	-0.729959