In [1]:
from __future__ import print_function, division
import numpy as np
import thinkstats2
The NSFG data is in a fixed-width format, documented in a Stata dictionary file. ReadFemPreg
reads the dictionary and then reads the data into a Pandas DataFrame.
In [2]:
def ReadFemPreg(dct_file='2002FemPreg.dct',
dat_file='2002FemPreg.dat.gz'):
"""Reads the NSFG pregnancy data.
dct_file: string file name
dat_file: string file name
returns: DataFrame
"""
dct = thinkstats2.ReadStataDct(dct_file)
preg = dct.ReadFixedWidth(dat_file, compression='gzip')
return preg
After loading a DataFrame, I check the shape.
In [3]:
preg = ReadFemPreg()
print(preg.shape)
And take a look at the first few rows.
In [4]:
preg.head()
Out[4]:
Then I validate the variables I am likely to need. The encoding of agepreg
is non-obvious.
In [5]:
preg.agepreg
Out[5]:
Which is why you have to read the codebook:
We can convert to a more obvious representation easily enough:
In [6]:
preg.agepreg /= 100
preg.agepreg.mean()
Out[6]:
For live births, birthweight is coded as two integers, birthwgt_lb
and birthwgt_oz
. We can use describe
to summarize variables.
In [7]:
preg.birthwgt_lb.describe()
Out[7]:
Most of that looks reasonable, but the maximum is 99 lbs! Let's look at the distribution of values:
In [8]:
preg.birthwgt_lb.value_counts().sort_index()
Out[8]:
Consulting the code book, we see that 97, 98, and 99 are sentinel values indicating "not ascertained", "refused", and "don't know" (that is, the respondent did not know).
Also, the 51 pound baby is undoubtably an error. We can replace unrealistic values with NaN.
In [9]:
preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
And if we don't care about the different sentinel values, we can replace them all with NaN.
In [10]:
na_vals = [97, 98, 99]
preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
Next, it will be convenient to recode birthwgt_lb
and birthwgt_oz
with a single floating-point value.
In [11]:
preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0
After testing these cleanings and recodings, we can encapsuate them in a function. As we work with additional variables, we might add more lines to this function.
In [12]:
def CleanFemPreg(preg):
"""Recodes variables from the pregnancy frame.
preg: DataFrame
"""
# mother's age is encoded in centiyears; convert to years
preg.agepreg /= 100.0
# birthwgt_lb contains at least one bogus value (51 lbs)
# replace with NaN
preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
# replace 'not ascertained', 'refused', 'don't know' with NaN
na_vals = [97, 98, 99]
preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
preg.hpagelb.replace(na_vals, np.nan, inplace=True)
preg.babysex.replace([7, 9], np.nan, inplace=True)
preg.nbrnaliv.replace([9], np.nan, inplace=True)
# birthweight is stored in two columns, lbs and oz.
# convert to a single column in lb
# NOTE: creating a new column requires dictionary syntax,
# not attribute assignment (like preg.totalwgt_lb)
preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0
# due to a bug in ReadStataDct, the last variable gets clipped;
# so for now set it to NaN
preg.cmintvw = np.nan
The NSFG codebook includes summaries for many variables, which we can use to make sure the data is uncorrupted, and we are interpreting it correctly.
In [13]:
preg.pregordr.value_counts().sort_index()
Out[13]:
The distribution of pregordr
is consistent with the summary in the codebook.
After running a few checks like this, I document them using assert statements.
In [14]:
assert len(preg) == 13593
assert preg.caseid[13592] == 12571
assert preg.pregordr.value_counts()[1] == 5033
assert preg.nbrnaliv.value_counts()[1] == 8981
assert preg.babysex.value_counts()[1] == 4641
assert preg.birthwgt_lb.value_counts()[7] == 3049
assert preg.birthwgt_oz.value_counts()[0] == 1037
assert preg.prglngth.value_counts()[39] == 4744
assert preg.outcome.value_counts()[1] == 9148
assert preg.birthord.value_counts()[1] == 4413
assert preg.agepreg.value_counts()[22.75] == 100
assert preg.totalwgt_lb.value_counts()[7.5] == 302
weights = preg.finalwgt.value_counts()
key = max(weights.keys())
assert preg.finalwgt.value_counts()[key] == 6
And once I have this code working in a notebook, I wrap it up in a module so I can import it from other notebooks and scripts. The code from this notebook is in nsfg.py
.