Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey

Read the pregnancy file.


In [1]:
%matplotlib inline

import nsfg
preg = nsfg.ReadFemPreg()


nsfg.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.birthwgt_lb[df.birthwgt_lb > 20] = np.nan

Select live births, then make a CDF of totalwgt_lb.


In [4]:
import thinkstats2 as ts

live = preg[preg.outcome == 1]

wgt_cdf = ts.Cdf(live.totalwgt_lb, label = 'weight')

Display the CDF.


In [6]:
import thinkplot as tp

tp.Cdf(wgt_cdf, label = 'weight')
tp.Show()


<matplotlib.figure.Figure at 0x10b48f790>

Find out how much you weighed at birth, if you can, and compute CDF(x).


In [44]:



Out[44]:
0.81422881168400085

If you are a first child, look up your birthweight in the CDF of first children; otherwise use the CDF of other children.


In [59]:



Out[59]:
0.79657754010695192

Compute the percentile rank of your birthweight


In [46]:



Out[46]:
81.422881168400082

Compute the median birth weight by looking up the value associated with p=0.5.


In [45]:



Out[45]:
7.375

Compute the interquartile range (IQR) by computing percentiles corresponding to 25 and 75.


In [47]:



Out[47]:
(6.5, 8.125)

Make a random selection from cdf.


In [48]:



Out[48]:
7.0

Draw a random sample from cdf.


In [49]:



Out[49]:
[6.25, 5.1875, 8.1875, 6.5, 7.9375, 6.6875, 5.75, 6.5625, 7.8125, 5.25]

Draw a random sample from cdf, then compute the percentile rank for each value, and plot the distribution of the percentile ranks.


In [50]:



Generate 1000 random values using random.random() and plot their PMF.


In [7]:
import random
random.random?

In [14]:
import random

thousand = [random.random() for x in range(1000)]
thousand_pmf = ts.Pmf(thousand, label = 'rando')
tp.Pmf(thousand_pmf, linewidth=0.1)
tp.Show()


<matplotlib.figure.Figure at 0x1063ba3d0>

In [22]:
t_hist = ts.Hist(thousand)
tp.Hist(t_hist, label = "rando")
tp.Show()


<matplotlib.figure.Figure at 0x10b4b6890>

Assuming that the PMF doesn't work very well, try plotting the CDF instead.


In [15]:
thousand_cdf = ts.Cdf(thousand, label='rando')
tp.Cdf(thousand_cdf)
tp.Show()


<matplotlib.figure.Figure at 0x10b315750>

In [17]:
import scipy.stats
scipy.stats?

In [64]:



Out[64]:
0.5

In [ ]: