Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


In [1]:
from __future__ import print_function, division

import nsfg

Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.


In [2]:
preg = nsfg.ReadFemPreg()
preg.head()


Out[2]:
caseid pregordr howpreg_n howpreg_p moscurrp nowprgdk pregend1 pregend2 nbrnaliv multbrth ... laborfor_i religion_i metro_i basewgt adj_mod_basewgt finalwgt secu_p sest cmintvw totalwgt_lb
0 1 1 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3410.389399 3869.349602 6448.271112 2 9 NaN 8.8125
1 1 2 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3410.389399 3869.349602 6448.271112 2 9 NaN 7.8750
2 2 1 NaN NaN NaN NaN 5.0 NaN 3.0 5.0 ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 9.1250
3 2 2 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 7.0000
4 2 3 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 7226.301740 8567.549110 12999.542264 2 12 NaN 6.1875

5 rows × 244 columns

Print the column names.


In [3]:
preg.columns


Out[3]:
Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.


In [4]:
preg.columns[1]


Out[4]:
'pregordr'

Select a column and check what type it is.


In [5]:
pregordr = preg['pregordr']
type(pregordr)


Out[5]:
pandas.core.series.Series

Print a column.


In [6]:
pregordr


Out[6]:
0        1
1        2
2        1
3        2
4        3
5        1
6        2
7        3
8        1
9        2
10       1
11       1
12       2
13       3
14       1
15       2
16       3
17       1
18       2
19       1
20       2
21       1
22       2
23       1
24       2
25       3
26       1
27       1
28       2
29       3
        ..
13563    2
13564    3
13565    1
13566    1
13567    1
13568    2
13569    1
13570    2
13571    3
13572    4
13573    1
13574    2
13575    1
13576    1
13577    2
13578    1
13579    2
13580    1
13581    2
13582    3
13583    1
13584    2
13585    1
13586    2
13587    3
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.


In [7]:
pregordr[0]


Out[7]:
1

Select a slice from a column.


In [8]:
pregordr[2:5]


Out[8]:
2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.


In [9]:
pregordr = preg.pregordr

Count the number of times each value occurs.


In [10]:
preg.outcome.value_counts().sort_index()


Out[10]:
1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.


In [11]:
preg.birthwgt_lb.value_counts().sort_index()


Out[11]:
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's caseid to a list of indices into the pregnancy DataFrame. Use it to select the pregnancy outcomes for a single respondent.


In [12]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values


Out[12]:
array([4, 4, 4, 4, 4, 4, 1])

Exercises

Select the birthord column, print the value counts, and compare to results published in the codebook


In [13]:
# Solution goes here

We can also use isnull to count the number of nans.


In [14]:
preg.birthord.isnull().sum()


Out[14]:
4445

Select the prglngth column, print the value counts, and compare to results published in the codebook


In [15]:
# Solution goes here

To compute the mean of a column, you can invoke the mean method on a Series. For example, here is the mean birthweight in pounds:


In [16]:
preg.totalwgt_lb.mean()


Out[16]:
7.265628457623368

Create a new column named totalwgt_kg that contains birth weight in kilograms. Compute its mean. Remember that when you create a new column, you have to use dictionary syntax, not dot notation.


In [17]:
# Solution goes here

nsfg.py also provides ReadFemResp, which reads the female respondents file and returns a DataFrame:


In [18]:
resp = nsfg.ReadFemResp()

DataFrame provides a method head that displays the first five rows:


In [19]:
resp.head()


Out[19]:
caseid rscrinf rdormres rostscrn rscreenhisp rscreenrace age_a age_r cmbirth agescrn ... pubassis_i basewgt adj_mod_basewgt finalwgt secu_r sest cmintvw cmlstyr screentime intvlngth
0 2298 1 5 5 1 5.0 27 27 902 27 ... 0 3247.916977 5123.759559 5556.717241 2 18 1234 1222 18:26:36 110.492667
1 5012 1 5 1 5 5.0 42 42 718 42 ... 0 2335.279149 2846.799490 4744.191350 2 18 1233 1221 16:30:59 64.294000
2 11586 1 5 1 5 5.0 43 43 708 43 ... 0 2335.279149 2846.799490 4744.191350 2 18 1234 1222 18:19:09 75.149167
3 6794 5 5 4 1 5.0 15 15 1042 15 ... 0 3783.152221 5071.464231 5923.977368 2 18 1234 1222 15:54:43 28.642833
4 616 1 5 4 1 5.0 20 20 991 20 ... 0 5341.329968 6437.335772 7229.128072 2 18 1233 1221 14:19:44 69.502667

5 rows × 3087 columns

Select the age_r column from resp and print the value counts. How old are the youngest and oldest respondents?


In [20]:
# Solution goes here

We can use the caseid to match up rows from resp and preg. For example, we can select the row from resp for caseid 2298 like this:


In [21]:
resp[resp.caseid==2298]


Out[21]:
caseid rscrinf rdormres rostscrn rscreenhisp rscreenrace age_a age_r cmbirth agescrn ... pubassis_i basewgt adj_mod_basewgt finalwgt secu_r sest cmintvw cmlstyr screentime intvlngth
0 2298 1 5 5 1 5.0 27 27 902 27 ... 0 3247.916977 5123.759559 5556.717241 2 18 1234 1222 18:26:36 110.492667

1 rows × 3087 columns

And we can get the corresponding rows from preg like this:


In [22]:
preg[preg.caseid==2298]


Out[22]:
caseid pregordr howpreg_n howpreg_p moscurrp nowprgdk pregend1 pregend2 nbrnaliv multbrth ... laborfor_i religion_i metro_i basewgt adj_mod_basewgt finalwgt secu_p sest cmintvw totalwgt_lb
2610 2298 1 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3247.916977 5123.759559 5556.717241 2 18 NaN 6.8750
2611 2298 2 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3247.916977 5123.759559 5556.717241 2 18 NaN 5.5000
2612 2298 3 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3247.916977 5123.759559 5556.717241 2 18 NaN 4.1875
2613 2298 4 NaN NaN NaN NaN 6.0 NaN 1.0 NaN ... 0 0 0 3247.916977 5123.759559 5556.717241 2 18 NaN 6.8750

4 rows × 244 columns

How old is the respondent with caseid 1?


In [23]:
# Solution goes here

What are the pregnancy lengths for the respondent with caseid 2298?


In [24]:
# Solution goes here

What was the birthweight of the first baby born to the respondent with caseid 5012?


In [25]:
# Solution goes here

In [ ]: