Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey

Read the female respondent file and display the variables names.


In [135]:
%matplotlib inline

import chap01soln
resp = chap01soln.ReadFemResp()
resp


Out[135]:
caseid rscrinf rdormres rostscrn rscreenhisp rscreenrace age_a age_r cmbirth agescrn ... pubassis_i basewgt adj_mod_basewgt finalwgt secu_r sest cmintvw cmlstyr screentime intvlngth
0 2298 1 5 5 1 5 27 27 902 27 ... 0 3247.916977 5123.759559 5556.717241 2 18 1234 1222 18:26:36 110.492667
1 5012 1 5 1 5 5 42 42 718 42 ... 0 2335.279149 2846.799490 4744.191350 2 18 1233 1221 16:30:59 64.294000
2 11586 1 5 1 5 5 43 43 708 43 ... 0 2335.279149 2846.799490 4744.191350 2 18 1234 1222 18:19:09 75.149167
3 6794 5 5 4 1 5 15 15 1042 15 ... 0 3783.152221 5071.464231 5923.977368 2 18 1234 1222 15:54:43 28.642833
4 616 1 5 4 1 5 20 20 991 20 ... 0 5341.329968 6437.335772 7229.128072 2 18 1233 1221 14:19:44 69.502667
5 845 1 5 4 1 5 42 42 727 42 ... 0 2335.279149 3725.796795 4705.681352 2 18 1234 1222 17:10:13 95.488000
6 10333 5 5 3 1 5 17 17 1029 17 ... 0 2335.279149 2687.399758 3139.151658 2 18 1236 1224 14:14:38 61.204333
7 855 5 5 4 5 5 22 22 965 22 ... 0 4670.558298 7122.614751 10019.382170 2 18 1235 1223 14:42:52 59.756333
8 8656 5 5 4 1 5 38 38 780 38 ... 0 5198.652195 6027.568848 6520.021223 2 18 1237 1225 15:32:34 56.978833
9 3566 5 5 4 5 5 21 21 974 21 ... 0 2764.142038 3240.986558 4559.095792 2 18 1231 1219 16:22:25 104.744667
10 5917 1 5 3 1 5 44 43 714 43 ... 0 2418.624283 2762.143030 3488.586646 2 18 1233 1221 15:38:06 96.850167
11 9200 5 5 3 1 5 26 26 923 26 ... 0 2418.624283 2754.293339 2987.031126 2 18 1237 1225 14:12:31 61.060667
12 6320 5 5 5 5 1 23 23 952 23 ... 0 5497.225851 6448.332868 7241.477811 1 18 1236 1224 14:27:20 69.906500
13 11700 1 5 4 1 5 34 34 822 34 ... 0 3362.448309 3677.062170 4666.559600 1 18 1236 1224 11:35:31 77.493333
14 7354 1 5 4 1 5 28 28 896 28 ... 0 2417.628123 2790.899197 3026.730179 1 18 1235 1223 14:40:18 79.018500
15 3697 5 5 4 5 5 28 28 896 28 ... 0 9670.512492 18559.585881 31215.367494 1 18 1236 1224 11:59:26 45.963500
16 4881 1 5 5 1 5 23 23 948 23 ... 0 3292.089359 3935.302679 4419.344908 1 18 1234 1222 20:37:54 110.416833
17 5862 1 5 4 1 5 33 33 831 33 ... 0 3056.771190 3456.489520 4386.630850 1 18 1234 1222 16:42:13 107.819667
18 8542 5 5 5 5 5 16 16 1036 16 ... 0 5900.163872 6697.056170 7822.831313 1 18 1237 1225 13:04:36 72.481500
19 2054 1 5 4 5 1 24 24 939 24 ... 0 2417.628123 2570.384230 2886.541491 1 18 1236 1224 13:58:43 87.417500
20 3719 5 5 2 5 5 22 22 972 22 ... 1 4168.324350 4879.253694 5479.401898 1 18 1238 1226 12:50:26 53.782000
21 11740 1 5 5 1 5 32 32 843 32 ... 0 2417.628123 2790.899197 3541.930171 1 18 1234 1222 14:23:08 92.950500
22 11343 1 5 5 1 5 41 41 741 41 ... 0 2417.628123 4394.216361 5549.895265 1 18 1237 1225 20:29:06 70.791667
23 7075 1 5 4 1 5 37 37 782 37 ... 0 4308.038516 5562.306199 6016.746616 2 18 1232 1220 16:30:34 80.659000
24 5422 1 5 2 1 5 38 38 773 38 ... 0 3363.768618 3948.605534 4271.206606 1 18 1234 1222 17:07:52 45.171500
25 2178 5 5 3 1 5 29 29 877 29 ... 0 3363.768618 5306.521605 5754.922681 1 18 1234 1222 18:28:12 77.489833
26 8358 5 5 4 5 5 21 21 977 21 ... 0 2418.624283 2688.131135 3018.771264 2 18 1237 1225 16:10:17 28.934833
27 5083 1 5 4 1 5 37 37 789 37 ... 0 2418.624283 3119.659072 3374.535218 2 18 1236 1224 15:14:10 58.307000
28 1545 5 5 4 1 5 39 39 762 39 ... 0 8034.280664 8972.807739 9705.886131 2 18 1237 1225 14:09:31 53.124833
29 5656 5 5 3 5 5 26 26 921 26 ... 0 4170.041867 6582.846660 7139.097203 2 18 1238 1226 11:20:54 61.501000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7613 4640 5 5 6 5 5 18 18 1016 18 ... 0 2273.045919 5800.771072 10202.976239 1 76 1234 1222 12:01:09 84.903833
7614 3998 5 5 2 5 5 24 24 935 24 ... 0 3544.869231 4031.967457 5671.768622 1 76 1228 1216 17:11:53 106.922500
7615 2432 5 5 5 5 5 15 15 1048 15 ... 0 7576.875171 8452.387480 14866.904339 2 76 1234 1222 16:41:31 40.803333
7616 5438 1 5 3 5 5 30 30 862 30 ... 0 2273.062551 2559.633775 4440.964600 2 76 1229 1217 12:50:02 86.785000
7617 9643 1 5 2 5 5 24 24 929 24 ... 0 2597.785773 2969.368501 4177.010670 2 76 1229 1217 12:26:15 89.036000
7618 3030 1 5 1 5 5 34 34 811 34 ... 0 2273.062551 2525.629369 4381.966956 2 76 1229 1217 14:23:39 80.175667
7619 11585 1 5 4 5 5 34 34 814 34 ... 0 1720.605492 2260.899193 3922.660100 1 76 1228 1216 15:59:28 113.677500
7620 6677 1 5 4 1 5 26 26 909 26 ... 0 3613.271533 4035.558807 4376.563526 1 76 1227 1215 16:35:57 110.918333
7621 10744 5 5 8 5 1 22 22 968 22 ... 0 3082.751506 3278.300013 4611.584628 1 76 1234 1222 14:08:25 95.713167
7622 8403 5 5 4 5 5 19 19 995 19 ... 0 2408.847688 2563.254954 4508.509139 1 76 1228 1216 17:36:32 62.217000
7623 7574 5 5 4 5 5 19 19 997 19 ... 0 2272.098197 2482.173894 4365.895661 2 76 1228 1216 16:09:34 96.809833
7624 8000 1 5 5 5 5 37 37 780 37 ... 0 4544.196393 4924.699453 8987.084031 2 76 1227 1215 16:40:18 140.088333
7625 6288 5 1 3 5 2 20 20 986 20 ... 0 2272.098197 2585.825443 3637.480651 2 76 1227 1215 17:40:15 93.622333
7626 5856 5 5 6 5 3 23 23 958 23 ... 0 12456.163382 31787.898176 44716.036364 1 76 1238 1226 16:04:09 69.623500
7627 8794 1 5 3 5 5 23 23 945 23 ... 0 6818.699775 7950.887074 11184.512847 1 76 1228 1216 18:23:51 77.508000
7628 6365 5 5 3 5 2 17 17 1028 17 ... 0 2272.899925 3456.977988 6080.478584 1 76 1235 1223 17:15:36 46.016667
7629 3537 1 5 3 5 2 36 36 797 36 ... 0 3918.792974 5258.971089 9597.096340 1 76 1239 1227 15:19:09 85.378000
7630 1515 1 5 1 5 5 44 44 688 44 ... 0 2272.899925 2680.745920 4467.463075 1 76 1228 1216 16:23:34 147.769667
7631 9174 1 5 3 5 3 32 32 834 32 ... 0 2272.899925 2608.931475 4526.496109 1 76 1228 1216 17:30:08 171.005000
7632 4213 1 5 3 5 2 40 40 748 40 ... 0 2272.948899 2659.474205 4432.013762 2 76 1229 1217 09:47:43 99.248667
7633 6804 5 5 3 5 2 35 35 805 35 ... 0 3247.069856 3814.222882 6960.575338 2 76 1229 1217 14:20:39 103.929167
7634 1282 1 5 4 5 3 35 35 798 35 ... 0 2068.394620 2222.154405 4055.209574 1 76 1228 1216 16:31:47 105.509000
7635 2954 1 5 4 5 3 30 30 862 30 ... 0 2068.394620 2356.019463 4087.693768 1 76 1228 1216 19:40:51 85.290167
7636 4964 1 5 3 5 3 41 41 727 41 ... 0 2068.394620 2222.154405 3703.220316 1 76 1227 1215 12:38:19 306.238000
7637 143 1 5 1 5 2 35 35 808 35 ... 0 2273.211779 2463.724427 4496.050707 2 76 1230 1218 17:45:36 83.798833
7638 11018 1 5 2 5 3 34 34 811 34 ... 0 3247.445399 3784.333145 6565.818007 2 76 1228 1216 15:57:38 82.907333
7639 6075 5 5 3 5 3 17 17 1014 17 ... 0 2273.211779 2497.234491 4392.385746 2 76 1228 1216 18:23:53 54.044833
7640 5649 1 5 2 5 5 29 29 873 29 ... 0 3247.445399 3569.313710 6003.228729 2 76 1228 1216 18:42:41 68.168000
7641 501 5 5 3 5 2 16 16 1034 16 ... 0 5304.160818 5954.644352 10473.623950 2 76 1228 1216 16:02:45 32.717333
7642 10252 1 5 2 5 2 28 28 889 28 ... 0 3247.445399 3476.637428 5847.356491 2 76 1230 1218 12:45:19 74.061500

7643 rows × 3087 columns

Make a histogram of totincr the total income for the respondent's family. To interpret the codes see the codebook.


In [13]:
import thinkstats2
hist = thinkstats2.Hist(resp.totincr)
print resp.totincr


0        9
1       10
2        5
3       13
4        6
5        7
6        6
7       12
8        6
9       12
10       9
11      13
12       7
13       8
14       6
15       9
16       5
17       2
18       9
19       4
20      12
21      12
22       8
23      13
24      10
25       6
26      11
27       2
28      11
29       9
        ..
7613    13
7614     7
7615    14
7616     4
7617    12
7618     8
7619    14
7620     8
7621    11
7622    14
7623    14
7624    12
7625    14
7626     2
7627     3
7628    14
7629     4
7630     5
7631     1
7632     6
7633     9
7634     5
7635     8
7636     3
7637     8
7638    14
7639    14
7640    10
7641    13
7642    11
Name: totincr, dtype: int64

Display the histogram.


In [3]:
import thinkplot
thinkplot.Hist(hist, label='totincr')
thinkplot.Show()


Make a histogram of age_r, the respondent's age at the time of interview.


In [224]:
import matplotlib.pyplot as plt
age_r = resp.age_r.value_counts().sort_index()

In [225]:
age_r.plot(kind='bar', label = 'age_r', legend = True)


Out[225]:
<matplotlib.axes._subplots.AxesSubplot at 0x15b9bef10>

In [226]:
resp.age_r.plot(kind='hist',label='age_r', range =(0,44), bins = 45, ylim=(0, 300), legend='True')


Out[226]:
<matplotlib.axes._subplots.AxesSubplot at 0x15bca26d0>

Make a histogram of numfmhh, the number of people in the respondent's household.


In [89]:
df_numfmhh = resp.numfmhh
df = df_numfmhh.value_counts().sort_index()
df


Out[89]:
0     942
1    1716
2    1826
3    1740
4     906
5     313
6     118
7      78
8       4
dtype: int64

In [93]:
df.plot(kind='bar', label='numfmhh', legend = True)


Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x13cc13350>

In [106]:
df_numfmhh.plot(kind='hist', label='numfmhh', legend = True)


Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x13dcc4f10>

Make a histogram of parity, the number children the respondent has borne. How would you describe this distribution?


In [158]:
import pandas as pd
parity_vc = resp.parity.value_counts().sort_index()
df_parity_vc = pd.DataFrame(df_parity, columns=['frequency'])

In [142]:
parity_vc.plot(kind='bar',label='parity',legend=True)


Out[142]:
<matplotlib.axes._subplots.AxesSubplot at 0x1422c4f10>

In [143]:
resp.parity.plot(kind='hist', label = 'parity', legend = True, range=(-5,25), ylim= (0,3500), bins = 30)


Out[143]:
<matplotlib.axes._subplots.AxesSubplot at 0x13fa56bd0>

Use Hist.Largest to find the largest values of parity.


In [121]:
resp.parity.describe()


Out[121]:
count    7643.000000
mean        1.223211
std         1.389722
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        22.000000
Name: parity, dtype: float64

Use totincr to select the respondents with the highest income. Compute the distribution of parity for just the high income respondents.


In [184]:
resp_highest_income = resp[resp.totincr == resp.totincr.max()]
resp_other_income = resp[ resp.totincr != resp.totincr.max()]
resp_highest_income_parity = resp_highest_income.parity
resp_highest_income_parity_vc = resp_highest_income_parity.value_counts().sort_index()
resp_highest_income_parity_vc.plot(kind = 'bar', title = 'parity of those respondents with highest income')


Out[184]:
<matplotlib.axes._subplots.AxesSubplot at 0x1404a5c90>

Find the largest parities for high income respondents.


In [181]:
resp_highest_income_parity.max()


Out[181]:
8

Compare the mean parity for high income respondents and others


In [185]:
print resp_highest_income_parity.mean()
print resp_other_income.parity.mean()


1.07586206897
1.24957581367

Investigate any other variables that look interesting.

Histogram of the age of respondents who had sex


In [208]:
resp_hadsex = resp[resp.hadsex == 1]
resp_hadsex_age_r = resp_hadsex.age_r
ax1 = resp_hadsex_age_r.plot(kind = 'hist')


Histogram of the age of respondents who never had sex


In [209]:
resp_nosex = resp[ resp.hadsex != 1]
resp_nosex_age_r = resp_nosex.age_r
resp_nosex_age_r.plot(kind = 'hist')


Out[209]:
<matplotlib.axes._subplots.AxesSubplot at 0x157c7dbd0>

In [312]:
df1 = pd.DataFrame(resp_hadsex.age_r.value_counts().sort_index(), columns = ['hadsex'])
df2 = pd.DataFrame(resp_nosex.age_r.value_counts().sort_index(), columns = ['nosex'])
df1['nosex'] = df2.nosex
df1


Out[312]:
hadsex nosex
15 36 181
16 62 161
17 105 129
18 158 77
19 192 49
20 205 53
21 224 43
22 262 25
23 267 15
24 253 16
25 250 17
26 248 12
27 249 6
28 247 5
29 254 8
30 286 6
31 272 6
32 272 1
33 253 4
34 252 3
35 259 3
36 263 3
37 266 5
38 252 4
39 211 4
40 252 4
41 246 4
42 212 3
43 246 7
44 231 4

In [313]:
df1.plot(kind = 'bar', alpha = 0.5, xlim = (0,15))


Out[313]:
<matplotlib.axes._subplots.AxesSubplot at 0x180590ad0>

In [ ]: