Import and Validation

License: Creative Commons Attribution 4.0 International



In [1]:

    
from __future__ import print_function, division

import numpy as np
import thinkstats2

The NSFG data is in a fixed-width format, documented in a Stata dictionary file. ReadFemPreg reads the dictionary and then reads the data into a Pandas DataFrame.



In [2]:

    
def ReadFemPreg(dct_file='2002FemPreg.dct',
                dat_file='2002FemPreg.dat.gz'):
    """Reads the NSFG pregnancy data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    preg = dct.ReadFixedWidth(dat_file, compression='gzip')
    return preg

After loading a DataFrame, I check the shape.



In [3]:

    
preg = ReadFemPreg()
print(preg.shape)









    



(13593, 243)

And take a look at the first few rows.



In [4]:

    
preg.head()









    Out[4]:






  
    
      
      caseid
      pregordr
      howpreg_n
      howpreg_p
      moscurrp
      nowprgdk
      pregend1
      pregend2
      nbrnaliv
      multbrth
      ...
      poverty_i
      laborfor_i
      religion_i
      metro_i
      basewgt
      adj_mod_basewgt
      finalwgt
      secu_p
      sest
      cmintvw
    
  
  
    
      0
       1
       1
      NaN
      NaN
      NaN
      NaN
       6
      NaN
       1
      NaN
      ...
       0
       0
       0
       0
       3410.389399
       3869.349602
        6448.271112
       2
        9
       1231
    
    
      1
       1
       2
      NaN
      NaN
      NaN
      NaN
       6
      NaN
       1
      NaN
      ...
       0
       0
       0
       0
       3410.389399
       3869.349602
        6448.271112
       2
        9
       1231
    
    
      2
       2
       1
      NaN
      NaN
      NaN
      NaN
       5
      NaN
       3
        5
      ...
       0
       0
       0
       0
       7226.301740
       8567.549110
       12999.542264
       2
       12
       1231
    
    
      3
       2
       2
      NaN
      NaN
      NaN
      NaN
       6
      NaN
       1
      NaN
      ...
       0
       0
       0
       0
       7226.301740
       8567.549110
       12999.542264
       2
       12
       1231
    
    
      4
       2
       3
      NaN
      NaN
      NaN
      NaN
       6
      NaN
       1
      NaN
      ...
       0
       0
       0
       0
       7226.301740
       8567.549110
       12999.542264
       2
       12
       1231
    
  

5 rows × 243 columns

Then I validate the variables I am likely to need. The encoding of agepreg is non-obvious.



In [5]:

    
preg.agepreg









    Out[5]:





0     3316
1     3925
2     1433
3     1783
4     1833
5     2700
6     2883
7     3016
8     2808
9     3233
10    2575
11    2300
12    2458
13    2983
14    2750
...
13578    2400
13579    2591
13580    2825
13581    3066
13582    3325
13583    2366
13584    2691
13585    2141
13586    2241
13587    2341
13588    1791
13589    1850
13590    1975
13591    2158
13592    2158
Name: agepreg, Length: 13593, dtype: float64

Which is why you have to read the codebook:

http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=labelDetails&fileCode=PREG&section=A&subSec=8016&srtLabel=611935

We can convert to a more obvious representation easily enough:



In [6]:

    
preg.agepreg /= 100
preg.agepreg.mean()









    Out[6]:





24.688151197039499

For live births, birthweight is coded as two integers, birthwgt_lb and birthwgt_oz. We can use describe to summarize variables.



In [7]:

    
preg.birthwgt_lb.describe()









    Out[7]:





count    9144.000000
mean        7.431321
std         7.522723
min         0.000000
25%         6.000000
50%         7.000000
75%         8.000000
max        99.000000
Name: birthwgt_lb, dtype: float64

Most of that looks reasonable, but the maximum is 99 lbs! Let's look at the distribution of values:



In [8]:

    
preg.birthwgt_lb.value_counts().sort_index()









    Out[8]:





0        8
1       40
2       53
3       98
4      229
5      697
6     2223
7     3049
8     1889
9      623
10     132
11      26
12      10
13       3
14       3
15       1
51       1
97       1
98       1
99      57
dtype: int64

Consulting the code book, we see that 97, 98, and 99 are sentinel values indicating "not ascertained", "refused", and "don't know" (that is, the respondent did not know).

Also, the 51 pound baby is undoubtably an error. We can replace unrealistic values with NaN.



In [9]:

    
preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan

And if we don't care about the different sentinel values, we can replace them all with NaN.



In [10]:

    
na_vals = [97, 98, 99]
preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)

Next, it will be convenient to recode birthwgt_lb and birthwgt_oz with a single floating-point value.



In [11]:

    
preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0

After testing these cleanings and recodings, we can encapsuate them in a function. As we work with additional variables, we might add more lines to this function.



In [12]:

    
def CleanFemPreg(preg):
    """Recodes variables from the pregnancy frame.

    preg: DataFrame
    """
    # mother's age is encoded in centiyears; convert to years
    preg.agepreg /= 100.0

    # birthwgt_lb contains at least one bogus value (51 lbs)
    # replace with NaN
    preg.loc[preg.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
    
    # replace 'not ascertained', 'refused', 'don't know' with NaN
    na_vals = [97, 98, 99]
    preg.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
    preg.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
    preg.hpagelb.replace(na_vals, np.nan, inplace=True)

    preg.babysex.replace([7, 9], np.nan, inplace=True)
    preg.nbrnaliv.replace([9], np.nan, inplace=True)

    # birthweight is stored in two columns, lbs and oz.
    # convert to a single column in lb
    # NOTE: creating a new column requires dictionary syntax,
    # not attribute assignment (like preg.totalwgt_lb)
    preg['totalwgt_lb'] = preg.birthwgt_lb + preg.birthwgt_oz / 16.0    

    # due to a bug in ReadStataDct, the last variable gets clipped;
    # so for now set it to NaN
    preg.cmintvw = np.nan

The NSFG codebook includes summaries for many variables, which we can use to make sure the data is uncorrupted, and we are interpreting it correctly.



In [13]:

    
preg.pregordr.value_counts().sort_index()









    Out[13]:





1     5033
2     3766
3     2334
4     1224
5      613
6      308
7      158
8       78
9       38
10      17
11       8
12       5
13       3
14       3
15       1
16       1
17       1
18       1
19       1
dtype: int64

The distribution of pregordr is consistent with the summary in the codebook.

After running a few checks like this, I document them using assert statements.



In [14]:

    
assert len(preg) == 13593

assert preg.caseid[13592] == 12571
assert preg.pregordr.value_counts()[1] == 5033
assert preg.nbrnaliv.value_counts()[1] == 8981
assert preg.babysex.value_counts()[1] == 4641
assert preg.birthwgt_lb.value_counts()[7] == 3049
assert preg.birthwgt_oz.value_counts()[0] == 1037
assert preg.prglngth.value_counts()[39] == 4744
assert preg.outcome.value_counts()[1] == 9148
assert preg.birthord.value_counts()[1] == 4413
assert preg.agepreg.value_counts()[22.75] == 100
assert preg.totalwgt_lb.value_counts()[7.5] == 302

weights = preg.finalwgt.value_counts()
key = max(weights.keys())
assert preg.finalwgt.value_counts()[key] == 6

And once I have this code working in a notebook, I wrap it up in a module so I can import it from other notebooks and scripts. The code from this notebook is in nsfg.py.

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw
0	1	1	NaN	NaN	NaN	NaN	6	NaN	1	NaN	...	3410.389399	3869.349602	6448.271112	2	9	1231
1	1	2	NaN	NaN	NaN	NaN	6	NaN	1	NaN	...	3410.389399	3869.349602	6448.271112	2	9	1231
2	2	1	NaN	NaN	NaN	NaN	5	NaN	3	5	...	7226.301740	8567.549110	12999.542264	2	12	1231
3	2	2	NaN	NaN	NaN	NaN	6	NaN	1	NaN	...	7226.301740	8567.549110	12999.542264	2	12	1231
4	2	3	NaN	NaN	NaN	NaN	6	NaN	1	NaN	...	7226.301740	8567.549110	12999.542264	2	12	1231