Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey

Read the female respondent file and display the variables names.



In [49]:

    
%matplotlib inline
from operator import itemgetter

import chap01soln









    Out[49]:





Index([u'caseid', u'rscrinf', u'rdormres', u'rostscrn', u'rscreenhisp',
       u'rscreenrace', u'age_a', u'age_r', u'cmbirth', u'agescrn',
       ...
       u'pubassis_i', u'basewgt', u'adj_mod_basewgt', u'finalwgt', u'secu_r',
       u'sest', u'cmintvw', u'cmlstyr', u'screentime', u'intvlngth'],
      dtype='object', length=3087)

Make a histogram of totincr the total income for the respondent's family. To interpret the codes see the codebook.



In [43]:

    
import thinkstats2
hist = thinkstats2.Hist(resp.totincr)
resp = chap01soln.ReadFemResp()
resp.columns

Display the histogram.



In [3]:

    
import thinkplot
thinkplot.Hist(hist, label='totincr')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10356d150>

Make a histogram of age_r, the respondent's age at the time of interview.



In [4]:

    
hist2 = thinkstats2.Hist(resp.age_r)
thinkplot.Hist(hist2, label='age_r')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10f243c10>

Make a histogram of numfmhh, the number of people in the respondent's household.



In [5]:

    
hist3 = thinkstats2.Hist(resp.numfmhh)
thinkplot.Hist(hist3, label='numfmhh')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x113c1dc90>

Make a histogram of parity, the number children the respondent has borne. How would you describe this distribution?



In [6]:

    
hist4 = thinkstats2.Hist(resp.parity)
thinkplot.Hist(hist4, label='parity')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x108543d50>

Use Hist.Largest to find the largest values of parity.



In [7]:

    
hist4.Largest()









    Out[7]:





[(22, 1),
 (16, 1),
 (10, 3),
 (9, 2),
 (8, 8),
 (7, 15),
 (6, 29),
 (5, 95),
 (4, 309),
 (3, 828)]

Use totincr to select the respondents with the highest income. Compute the distribution of parity for just the high income respondents.



In [9]:

    
hist5 = thinkstats2.Hist(resp.parity[resp.totincr == 14])
thinkplot.Hist(hist5, label='parity_hi')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10b25c450>

Find the largest parities for high income respondents.



In [10]:

    
hist5.Largest()









    Out[10]:





[(8, 1), (7, 1), (5, 5), (4, 19), (3, 123), (2, 267), (1, 229), (0, 515)]

Compare the mean parity for high income respondents and others.



In [11]:

    
hi_par = resp.parity[resp.totincr == 14]
par = resp.parity
hi_par.mean(), par.mean()









    Out[11]:





(1.0758620689655172, 1.2232107811068953)

Investigate any other variables that look interesting.



In [12]:

    
hi_par.std(), par.std()









    Out[12]:





(1.1761668844433986, 1.389721983997953)



In [34]:

    
def Mode(h):
    max = 0
    for i in h:
        if h.Freq(i) > h.Freq(max):
            max = i
    return max



In [42]:

    
def AllModes(h):
    hist = h.Copy()
    result = []
    while len(hist) > 0:
        max = Mode(hist)
        result.append((max, hist.Freq(max)))
        hist.Remove(max)
    return result



In [47]:

    
def AllModes2(hist):
    """Returns value-freq pairs in decreasing order of frequency.

    hist: Hist object

    returns: iterator of value-freq pairs
    """
    return sorted(hist.Items(), key=itemgetter(1), reverse=True)



In [51]:

    
%timeit AllModes(hist2)
%timeit AllModes2(hist2)









    



1000 loops, best of 3: 1.39 ms per loop
10000 loops, best of 3: 30.8 µs per loop



In [52]:

    
%timeit?



In [ ]: