Cross-sectional study. Target population: US citizens, 15-44. Respondents/cohort. Meant to be representative. Deliberatively oversampled.
In [4]:
# Had to convert a few print statements to get it to work on Python 3
# http://www.thinkstats.com/survey.py
%run "vendor/survey.py"
I first Googled parsing binary files, but it turns out it's not really a binary file, but a fixed-width file. I did peak at the survey code a bit. To me, it's more intuitive to write functions that creats dicts, than classes. Then import it into Pandas. Probably not the most efficient, but seems to work.
Still a bit confused about how to split my vars, functions, imports etc in IPython. In a script, imports should be at the top, but in IPython you want to be able to see the progress of thinking, without rewriting the whole thing to reflect the final vision.
In [223]:
fields= [
('caseid', 1, 12, int),
('nbrnaliv', 22, 22, int),
('babysex', 56, 56, int),
('birthwgt_lb', 57, 58, int),
('birthwgt_oz', 59, 60, int),
('prglength', 275, 276, int),
('outcome', 277, 277, int),
('birthord', 278, 279, int),
('agepreg', 284, 287, int),
('finalwgt', 423, 440, float),
]
def parse(fields, line):
res = []
for (field, start, end, cast) in fields:
try:
chunk = line[start-1:end]
chunkproc = cast(chunk)
res.append( (field, chunkproc) )
except ValueError:
pass
return dict(res)
In [225]:
from pandas import DataFrame
import pandas as pd
import gzip
pregfile = "/Users/Stian/src/math-with-ipython/think-stats/vendor/2002FemPreg.dat.gz"
preg = []
for line in gzip.open(pregfile):
preg.append(parse(fields, line))
Instinctively, I don't like assigning the empty [] container, just to fill it up in the list comprehension... I know this is less Pythonic, but I kind of enjoy it:
In [226]:
from toolz import partial
preg2 = list(map(partial(parse,fields), gzip.open(pregfile)))
preg2 == preg
Out[226]:
Silly me, I just realized I should use a list comprehension -- nice and clean and Pythonic, without any stray container variables.
In [227]:
preg3 = [parse(fields, line) for line in gzip.open(pregfile)]
preg3 == preg
Out[227]:
In [178]:
df = DataFrame(preg)
df.head()
Out[178]:
There's also a bit of recoding happening, dividing the mother's age by 100, and converting weight to ounces. (I'll convert it to SI too, since I'm more comfortable with that). This is all very easy with Pandas - and I think clearer to do this after importing, rather than with some magic in the import function.
I struggled a bit with the advanced query and assignment. Goal was to simplify this:
if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and
rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):
rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz
else:
rec.totalwgt_oz = 'NA'```
This:
df.query('birthwgt_lb < 20 & birthwgt_oz <= 16')```
loooks great and parsimonius, but returns a copied slice, and does not allow assignment. The final version works well too, once I figured out that I had to include the column to be assigned to in the [], rather than using df[bool][col] = ...
In [110]:
df.agepreg = df.agepreg/100
df.loc[(df.birthwgt_lb < 20) & (df.birthwgt_oz <= 16), 'totalwgt_oz'] = \
df.birthwgt_lb*16 + df.birthwgt_oz
Supposed to write a loop to count number of live births, let's use Pandas instead. Initially I counted babysex, but that only gave me the live births, since the other categories have NaN. (value_counts() is nice, like table() in R. Groupby also works, an earlier SO entry seems to indicate that it's a lot slower, but currently it seems to be slightly faster).
In [133]:
df.outcome.value_counts()
# also possible:
# df.groupby('outcome').size()
Partition live births into two groups, one for first babies, and one for others. (Might be able to do this more elegantly, by using np.where and groupby or something, but not worth it for just these two groups I think -- would be nice if query worked like a groupby, giving both the result and its negation).
This also works, but is much less clear:
birthgroups = df[df.outcome==1].groupby(df.birthord==1).groups
firstbirth = df.loc[birthgroups[0]]
secondplusbirth = df.loc[birthgroups[1]]
In [168]:
df1 = df.query('outcome==1 & birthord==1')
dfnot1 = df.query('outcome==1 & birthord != 1')
In [172]:
print(np.average(df1.prglength), np.average(dfnot1.prglength))
# Let's do it manually as well
def mean_(x):
return(sum(x)/len(x))
print(mean_(df1.prglength), mean(dfnot1.prglength))
%timeit np.average(df1.prglength)
%timeit np.mean(df1.prglength)
%timeit mean_(df1.prglength)
According to the book, first babies are born 13 hours later. Let's check.
In [176]:
(mean_(df1.prglength) - mean_(dfnot1.prglength)) * 7 * 24
Out[176]:
That was the end of chapter one. Pretty basic stuff so far, but I'm glad I worked my way through it -- it gave me a chance to become more familiar with the IPython interface, and it was fun looking at the author's code and reimplementing it in Python. The indexing and assigning gave me the most trouble - some of these restrictions seem to stem from the underlying Numpy NDArray deciding to create a copy instead of a view. Fascinating to see the example code the author provided -- very different from any I would write, and I really enjoy using Pandas, but valuable to see different approaches. (And Pandas wasn't around when he began writing his book).
I will store the Pandas arrays, and continue working with them in the next chapter.
In [295]:
import pickle
with open("/Users/Stian/src/math-with-ipython/think-stats/preg-pandas.pickle", "wb") as f:
pickle.dump([df,df1,dfnot1], f)