Chapter 1: Statistical thinking for programmers

  • probability: study of random events
  • statistics: using data samples to support claims about populations
  • computation: tool for quantitative analysis

Do first babies arrive late?

Anecdotal evidence problems:

  • small n of observations
  • selection bias: people joining discussion because babies born late
  • confirmation bias: more likely to cite evidence if you support something
  • inaccuracy

Steps:

  • data collection
  • descriptive stats
  • exploratory data analysis: look for patterns, differences, features. Check for inconsistencies, identify limitations.
  • hypothesis testing: is effect real?
  • estimation: use data from ample to estimate characteristics of the general population.

NSFG

Cross-sectional study. Target population: US citizens, 15-44. Respondents/cohort. Meant to be representative. Deliberatively oversampled.


In [4]:
# Had to convert a few print statements to get it to work on Python 3
# http://www.thinkstats.com/survey.py
%run "vendor/survey.py"


Number of respondents 7643
Number of pregnancies 13593

Let me see if I can do this with Pandas as well?

I first Googled parsing binary files, but it turns out it's not really a binary file, but a fixed-width file. I did peak at the survey code a bit. To me, it's more intuitive to write functions that creats dicts, than classes. Then import it into Pandas. Probably not the most efficient, but seems to work.

Still a bit confused about how to split my vars, functions, imports etc in IPython. In a script, imports should be at the top, but in IPython you want to be able to see the progress of thinking, without rewriting the whole thing to reflect the final vision.


In [223]:
fields= [
    ('caseid', 1, 12, int),
    ('nbrnaliv', 22, 22, int),
    ('babysex', 56, 56, int),
    ('birthwgt_lb', 57, 58, int),
    ('birthwgt_oz', 59, 60, int),
    ('prglength', 275, 276, int),
    ('outcome', 277, 277, int),
    ('birthord', 278, 279, int),
    ('agepreg', 284, 287, int),
    ('finalwgt', 423, 440, float),
]

def parse(fields, line):
    res = []
    for (field, start, end, cast) in fields:
        try:
            chunk = line[start-1:end]
            chunkproc = cast(chunk)
            res.append( (field, chunkproc) )
        except ValueError:
            pass
    return dict(res)

In [225]:
from pandas import DataFrame
import pandas as pd
import gzip

pregfile = "/Users/Stian/src/math-with-ipython/think-stats/vendor/2002FemPreg.dat.gz"

preg = []
for line in gzip.open(pregfile):
    preg.append(parse(fields, line))

Instinctively, I don't like assigning the empty [] container, just to fill it up in the list comprehension... I know this is less Pythonic, but I kind of enjoy it:


In [226]:
from toolz import partial

preg2 = list(map(partial(parse,fields), gzip.open(pregfile)))

preg2 == preg


Out[226]:
True

Silly me, I just realized I should use a list comprehension -- nice and clean and Pythonic, without any stray container variables.


In [227]:
preg3 = [parse(fields, line) for line in gzip.open(pregfile)]
preg3 == preg


Out[227]:
True

In [178]:
df = DataFrame(preg)
df.head()


Out[178]:
agepreg babysex birthord birthwgt_lb birthwgt_oz caseid finalwgt nbrnaliv outcome prglength
0 3316 1 1 8 13 1 6448.271112 1 1 39
1 3925 2 2 7 14 1 6448.271112 1 1 39
2 1433 1 1 9 2 2 12999.542264 3 1 39
3 1783 2 2 7 0 2 12999.542264 1 1 39
4 1833 2 3 6 3 2 12999.542264 1 1 39

5 rows × 10 columns

There's also a bit of recoding happening, dividing the mother's age by 100, and converting weight to ounces. (I'll convert it to SI too, since I'm more comfortable with that). This is all very easy with Pandas - and I think clearer to do this after importing, rather than with some magic in the import function.

I struggled a bit with the advanced query and assignment. Goal was to simplify this:

if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and
    rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):
    rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz
else:
    rec.totalwgt_oz = 'NA'```

This:

df.query('birthwgt_lb < 20 & birthwgt_oz <= 16')```

loooks great and parsimonius, but returns a copied slice, and does not allow assignment. The final version works well too, once I figured out that I had to include the column to be assigned to in the [], rather than using df[bool][col] = ...


In [110]:
df.agepreg = df.agepreg/100
df.loc[(df.birthwgt_lb < 20) & (df.birthwgt_oz <= 16), 'totalwgt_oz'] = \
    df.birthwgt_lb*16 + df.birthwgt_oz

Exercise 1.3.2

Supposed to write a loop to count number of live births, let's use Pandas instead. Initially I counted babysex, but that only gave me the live births, since the other categories have NaN. (value_counts() is nice, like table() in R. Groupby also works, an earlier SO entry seems to indicate that it's a lot slower, but currently it seems to be slightly faster).


In [133]:
df.outcome.value_counts()
# also possible:
# df.groupby('outcome').size()


1000 loops, best of 3: 720 µs per loop
1000 loops, best of 3: 973 µs per loop

Exercise 1.3.3

Partition live births into two groups, one for first babies, and one for others. (Might be able to do this more elegantly, by using np.where and groupby or something, but not worth it for just these two groups I think -- would be nice if query worked like a groupby, giving both the result and its negation).

This also works, but is much less clear:

birthgroups = df[df.outcome==1].groupby(df.birthord==1).groups
firstbirth = df.loc[birthgroups[0]]
secondplusbirth = df.loc[birthgroups[1]]

In [168]:
df1 = df.query('outcome==1 & birthord==1')
dfnot1 = df.query('outcome==1 & birthord != 1')

In [172]:
print(np.average(df1.prglength), np.average(dfnot1.prglength))

# Let's do it manually as well
def mean_(x):
    return(sum(x)/len(x))
print(mean_(df1.prglength), mean(dfnot1.prglength))

%timeit np.average(df1.prglength)
%timeit np.mean(df1.prglength)
%timeit mean_(df1.prglength)


38.6009517335 38.5229144667
38.6009517335 38.5229144667
10000 loops, best of 3: 62.9 µs per loop
10000 loops, best of 3: 117 µs per loop
10000 loops, best of 3: 135 µs per loop

According to the book, first babies are born 13 hours later. Let's check.


In [176]:
(mean_(df1.prglength) - mean_(dfnot1.prglength)) * 7 * 24


Out[176]:
13.110260818628319

Conclusion

That was the end of chapter one. Pretty basic stuff so far, but I'm glad I worked my way through it -- it gave me a chance to become more familiar with the IPython interface, and it was fun looking at the author's code and reimplementing it in Python. The indexing and assigning gave me the most trouble - some of these restrictions seem to stem from the underlying Numpy NDArray deciding to create a copy instead of a view. Fascinating to see the example code the author provided -- very different from any I would write, and I really enjoy using Pandas, but valuable to see different approaches. (And Pandas wasn't around when he began writing his book).

I will store the Pandas arrays, and continue working with them in the next chapter.


In [295]:
import pickle
with open("/Users/Stian/src/math-with-ipython/think-stats/preg-pandas.pickle", "wb") as f:
    pickle.dump([df,df1,dfnot1], f)