Import and Validation

Copyright 2015 Allen Downey

License: Creative Commons Attribution 4.0 International


In [1]:
from __future__ import print_function, division

import numpy as np
import thinkstats2

The NSFG data is in a fixed-width format, documented in a Stata dictionary file. ReadFemPreg reads the dictionary and then reads the data into a Pandas DataFrame.


In [2]:
def ReadFemPreg(dct_file='2002FemPreg.dct',
                dat_file='2002FemPreg.dat.gz'):
    """Reads the NSFG pregnancy data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    preg = dct.ReadFixedWidth(dat_file, compression='gzip')
    return preg

After loading a DataFrame, I check the shape.


In [3]:
preg = ReadFemPreg()
print(preg.shape)


(13593, 243)

And take a look at the first few rows.


In [4]:
preg.head()


Out[4]:
caseid pregordr howpreg_n howpreg_p moscurrp nowprgdk pregend1 pregend2 nbrnaliv multbrth ... poverty_i laborfor_i religion_i metro_i basewgt adj_mod_basewgt finalwgt secu_p sest cmintvw
0 1 1 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3410.389399 3869.349602 6448.271112 2 9 1231
1 1 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 3410.389399 3869.349602 6448.271112 2 9 1231
2 2 1 NaN NaN NaN NaN 5 NaN 3 5 ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231
3 2 2 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231
4 2 3 NaN NaN NaN NaN 6 NaN 1 NaN ... 0 0 0 0 7226.301740 8567.549110 12999.542264 2 12 1231

5 rows × 243 columns

Then I validate the variables I am likely to need. The encoding of agepreg is non-obvious.


In [5]:
preg.agepreg


Out[5]:
0     3316
1     3925
2     1433
3     1783
4     1833
5     2700
6     2883
7     3016
8     2808
9     3233
10    2575
11    2300
12    2458
13    2983
14    2750
...
13578    2400
13579    2591
13580    2825
13581    3066
13582    3325
13583    2366
13584    2691
13585    2141
13586    2241
13587    2341
13588    1791
13589    1850
13590    1975
13591    2158
13592    2158
Name: agepreg, Length: 13593, dtype: float64

Which is why you have to read the codebook:

http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611935

We can convert to a more obvious representation easily enough:


In [6]:
preg.agepreg /= 100
preg.agepreg.mean()


Out[6]:
24.688151197039499

For live births, birthweight is coded as two integers, birthwgt_lb and birthwgt_oz. We can use describe to summarize variables.


In [7]:
preg.birthwgt_lb.describe()


Out[7]:
count    9144.000000
mean        7.431321
std         7.522723
min         0.000000
25%         6.000000
50%         7.000000
75%         8.000000
max        99.000000
Name: birthwgt_lb, dtype: float64

Most of that looks reasonable, but the maximum is 99 lbs! Let's look at the distribution of values:


In [8]:
preg.birthwgt_lb.value_counts().sort_index()


Out[8]:
0        8
1       40
2       53
3       98
4      229
5      697
6     2223
7     3049
8     1889
9      623
10     132
11      26
12      10
13       3
14       3
15       1
51       1
97       1
98       1
99      57
dtype: int64

Consulting the code book, we see that 97, 98, and 99 are sentinel values indicating "not ascertained", "refused", and "don't know" (that is, the respondent did not know).

Also, the 51 pound baby is undoubtably an error. We can replace unrealistic values with NaN.


In [9]:
preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan

And if we don't care about the different sentinel values, we can replace them all with NaN.


In [10]:
na_vals = [97, 98, 99]
preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)

Next, it will be convenient to recode birthwgt_lb and birthwgt_oz with a single floating-point value.


In [11]:
preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0

After testing these cleanings and recodings, we can encapsuate them in a function. As we work with additional variables, we might add more lines to this function.


In [12]:
def CleanFemPreg(preg):
    """Recodes variables from the pregnancy frame.

    preg: DataFrame
    """
    # mother's age is encoded in centiyears; convert to years
    preg.agepreg /= 100.0

    # birthwgt_lb contains at least one bogus value (51 lbs)
    # replace with NaN
    preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
    
    # replace 'not ascertained', 'refused', 'don't know' with NaN
    na_vals = [97, 98, 99]
    preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
    preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
    preg.hpagelb.replace(na_vals, np.nan, inplace=True)

    preg.babysex.replace([7, 9], np.nan, inplace=True)
    preg.nbrnaliv.replace([9], np.nan, inplace=True)

    # birthweight is stored in two columns, lbs and oz.
    # convert to a single column in lb
    # NOTE: creating a new column requires dictionary syntax,
    # not attribute assignment (like preg.totalwgt_lb)
    preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0    

    # due to a bug in ReadStataDct, the last variable gets clipped;
    # so for now set it to NaN
    preg.cmintvw = np.nan

The NSFG codebook includes summaries for many variables, which we can use to make sure the data is uncorrupted, and we are interpreting it correctly.


In [13]:
preg.pregordr.value_counts().sort_index()


Out[13]:
1     5033
2     3766
3     2334
4     1224
5      613
6      308
7      158
8       78
9       38
10      17
11       8
12       5
13       3
14       3
15       1
16       1
17       1
18       1
19       1
dtype: int64

The distribution of pregordr is consistent with the summary in the codebook.

After running a few checks like this, I document them using assert statements.


In [14]:
assert len(preg) == 13593

assert preg.caseid[13592] == 12571
assert preg.pregordr.value_counts()[1] == 5033
assert preg.nbrnaliv.value_counts()[1] == 8981
assert preg.babysex.value_counts()[1] == 4641
assert preg.birthwgt_lb.value_counts()[7] == 3049
assert preg.birthwgt_oz.value_counts()[0] == 1037
assert preg.prglngth.value_counts()[39] == 4744
assert preg.outcome.value_counts()[1] == 9148
assert preg.birthord.value_counts()[1] == 4413
assert preg.agepreg.value_counts()[22.75] == 100
assert preg.totalwgt_lb.value_counts()[7.5] == 302

weights = preg.finalwgt.value_counts()
key = max(weights.keys())
assert preg.finalwgt.value_counts()[key] == 6

And once I have this code working in a notebook, I wrap it up in a module so I can import it from other notebooks and scripts. The code from this notebook is in nsfg.py.