Examples and Exercises from Think Stats, 2nd Edition

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

import nsfg

Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.



In [2]:

    
preg = nsfg.ReadFemPreg()
preg.head()









    Out[2]:







  
    
      
      caseid
      pregordr
      howpreg_n
      howpreg_p
      moscurrp
      nowprgdk
      pregend1
      pregend2
      nbrnaliv
      multbrth
      ...
      laborfor_i
      religion_i
      metro_i
      basewgt
      adj_mod_basewgt
      finalwgt
      secu_p
      sest
      cmintvw
      totalwgt_lb
    
  
  
    
      0
      1
      1
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3410.389399
      3869.349602
      6448.271112
      2
      9
      NaN
      8.8125
    
    
      1
      1
      2
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3410.389399
      3869.349602
      6448.271112
      2
      9
      NaN
      7.8750
    
    
      2
      2
      1
      NaN
      NaN
      NaN
      NaN
      5.0
      NaN
      3.0
      5.0
      ...
      0
      0
      0
      7226.301740
      8567.549110
      12999.542264
      2
      12
      NaN
      9.1250
    
    
      3
      2
      2
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      7226.301740
      8567.549110
      12999.542264
      2
      12
      NaN
      7.0000
    
    
      4
      2
      3
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      7226.301740
      8567.549110
      12999.542264
      2
      12
      NaN
      6.1875
    
  

5 rows × 244 columns

Print the column names.



In [3]:

    
preg.columns









    Out[3]:





Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.



In [4]:

    
preg.columns[1]









    Out[4]:





'pregordr'

Select a column and check what type it is.



In [5]:

    
pregordr = preg['pregordr']
type(pregordr)









    Out[5]:





pandas.core.series.Series

Print a column.



In [6]:

    
pregordr









    Out[6]:





0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.



In [7]:

    
pregordr[0]









    Out[7]:





1

Select a slice from a column.



In [8]:

    
pregordr[2:5]









    Out[8]:





2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.



In [9]:

    
pregordr = preg.pregordr

Count the number of times each value occurs.



In [10]:

    
preg.outcome.value_counts().sort_index()









    Out[10]:





1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.



In [11]:

    
preg.birthwgt_lb.value_counts().sort_index()









    Out[11]:





0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.



In [12]:

    
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values









    Out[12]:





array([4, 4, 4, 4, 4, 4, 1])

Exercises

Select the birthord column, print the value counts, and compare to results published in the codebook



In [13]:

    
# Solution goes here

We can also use isnull to count the number of nans.



In [14]:

    
preg.birthord.isnull().sum()









    Out[14]:





4445

Select the prglngth column, print the value counts, and compare to results published in the codebook



In [15]:

    
# Solution goes here

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:



In [16]:

    
preg.totalwgt_lb.mean()









    Out[16]:





7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.



In [17]:

    
# Solution goes here

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:



In [18]:

    
resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:



In [19]:

    
resp.head()









    Out[19]:







  
    
      
      caseid
      rscrinf
      rdormres
      rostscrn
      rscreenhisp
      rscreenrace
      age_a
      age_r
      cmbirth
      agescrn
      ...
      pubassis_i
      basewgt
      adj_mod_basewgt
      finalwgt
      secu_r
      sest
      cmintvw
      cmlstyr
      screentime
      intvlngth
    
  
  
    
      0
      2298
      1
      5
      5
      1
      5.0
      27
      27
      902
      27
      ...
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      1234
      1222
      18:26:36
      110.492667
    
    
      1
      5012
      1
      5
      1
      5
      5.0
      42
      42
      718
      42
      ...
      0
      2335.279149
      2846.799490
      4744.191350
      2
      18
      1233
      1221
      16:30:59
      64.294000
    
    
      2
      11586
      1
      5
      1
      5
      5.0
      43
      43
      708
      43
      ...
      0
      2335.279149
      2846.799490
      4744.191350
      2
      18
      1234
      1222
      18:19:09
      75.149167
    
    
      3
      6794
      5
      5
      4
      1
      5.0
      15
      15
      1042
      15
      ...
      0
      3783.152221
      5071.464231
      5923.977368
      2
      18
      1234
      1222
      15:54:43
      28.642833
    
    
      4
      616
      1
      5
      4
      1
      5.0
      20
      20
      991
      20
      ...
      0
      5341.329968
      6437.335772
      7229.128072
      2
      18
      1233
      1221
      14:19:44
      69.502667
    
  

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?



In [20]:

    
# Solution goes here

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:



In [21]:

    
resp[resp.caseid==2298]









    Out[21]:







  
    
      
      caseid
      rscrinf
      rdormres
      rostscrn
      rscreenhisp
      rscreenrace
      age_a
      age_r
      cmbirth
      agescrn
      ...
      pubassis_i
      basewgt
      adj_mod_basewgt
      finalwgt
      secu_r
      sest
      cmintvw
      cmlstyr
      screentime
      intvlngth
    
  
  
    
      0
      2298
      1
      5
      5
      1
      5.0
      27
      27
      902
      27
      ...
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      1234
      1222
      18:26:36
      110.492667
    
  

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:



In [22]:

    
preg[preg.caseid==2298]









    Out[22]:







  
    
      
      caseid
      pregordr
      howpreg_n
      howpreg_p
      moscurrp
      nowprgdk
      pregend1
      pregend2
      nbrnaliv
      multbrth
      ...
      laborfor_i
      religion_i
      metro_i
      basewgt
      adj_mod_basewgt
      finalwgt
      secu_p
      sest
      cmintvw
      totalwgt_lb
    
  
  
    
      2610
      2298
      1
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      NaN
      6.8750
    
    
      2611
      2298
      2
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      NaN
      5.5000
    
    
      2612
      2298
      3
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      NaN
      4.1875
    
    
      2613
      2298
      4
      NaN
      NaN
      NaN
      NaN
      6.0
      NaN
      1.0
      NaN
      ...
      0
      0
      0
      3247.916977
      5123.759559
      5556.717241
      2
      18
      NaN
      6.8750
    
  

4 rows × 244 columns

How old is the respondent with caseid 1?



In [23]:

    
# Solution goes here

What are the pregnancy lengths for the respondent with caseid 2298?



In [24]:

    
# Solution goes here

What was the birthweight of the first baby born to the respondent with caseid 5012?



In [25]:

    
# Solution goes here



In [ ]:

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
0	1	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	8.8125
1	1	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3410.389399	3869.349602	6448.271112	2	9	NaN	7.8750
2	2	1	NaN	NaN	NaN	NaN	5.0	NaN	3.0	5.0	...	7226.301740	8567.549110	12999.542264	2	12	NaN	9.1250
3	2	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	7.0000
4	2	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	7226.301740	8567.549110	12999.542264	2	12	NaN	6.1875

	caseid	rscrinf	rdormres	rostscrn	rscreenhisp	rscreenrace	age_a	age_r	cmbirth	agescrn	...	basewgt	adj_mod_basewgt	finalwgt	secu_r	sest	cmintvw	cmlstyr	screentime	intvlngth
0	2298	1	5	5	1	5.0	27	27	902	27	...	3247.916977	5123.759559	5556.717241	2	18	1234	1222	18:26:36	110.492667
1	5012	1	5	1	5	5.0	42	42	718	42	...	2335.279149	2846.799490	4744.191350	2	18	1233	1221	16:30:59	64.294000
2	11586	1	5	1	5	5.0	43	43	708	43	...	2335.279149	2846.799490	4744.191350	2	18	1234	1222	18:19:09	75.149167
3	6794	5	5	4	1	5.0	15	15	1042	15	...	3783.152221	5071.464231	5923.977368	2	18	1234	1222	15:54:43	28.642833
4	616	1	5	4	1	5.0	20	20	991	20	...	5341.329968	6437.335772	7229.128072	2	18	1233	1221	14:19:44	69.502667

	caseid	pregordr	howpreg_n	howpreg_p	moscurrp	nowprgdk	pregend1	pregend2	nbrnaliv	multbrth	...	basewgt	adj_mod_basewgt	finalwgt	secu_p	sest	cmintvw	totalwgt_lb
2610	2298	1	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750
2611	2298	2	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	5.5000
2612	2298	3	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	4.1875
2613	2298	4	NaN	NaN	NaN	NaN	6.0	NaN	1.0	NaN	...	3247.916977	5123.759559	5556.717241	2	18	NaN	6.8750