Exercise from Think Stats, 2nd Edition (thinkstats2.com)
Allen Downey
Read the pregnancy file.
In [3]:
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import nsfg
preg = nsfg.ReadFemPreg()
preg
Out[3]:
Select live births, then make a CDF of totalwgt_lb.
In [4]:
import pandas as pd
preg_live = preg[ preg.outcome == 1]
preg_live_totalwgt_lb = pd.DataFrame(preg_live.totalwgt_lb.dropna())
preg_live_totalwgt_lb = preg_live_totalwgt_lb.sort(['totalwgt_lb']).reset_index(drop=True)
cdf_dict = {}
for index, row in preg_live_totalwgt_lb.iterrows():
cdf_dict[row[0]] = (index/ float(len(preg_live_totalwgt_lb) - 1))
cdf = pd.DataFrame(cdf_dict.keys(), columns = ['values'])
cdf['p_rank'] = cdf_dict.values()
cdf = cdf.sort(['p_rank']).reset_index(drop=True)
cdf
Out[4]:
Display the CDF.
In [5]:
cdf.plot(x = 'values', y = 'p_rank', legend = False)
Out[5]:
In [6]:
# lambad function to create cdf of any series
create_cdf_pandas = lambda data: data.value_counts().sort_index().cumsum()*1./len(data)
create_cdf_pandas(preg_live.totalwgt_lb.dropna()).plot(legend=True, label = 'cdf')
ser = create_cdf_pandas(preg_live.totalwgt_lb.dropna())
Find out how much you weighed at birth, if you can, and compute CDF(x).
In [7]:
cdf_value = lambda cdf, num : cdf[num] if num in cdf else cdf[min(cdf.keys(), key=lambda k: abs(k-num))]
cdf_value(ser.to_dict(), 7.5)
Out[7]:
If you are a first child, look up your birthweight in the CDF of first children; otherwise use the CDF of other children.
Compute the percentile rank of your birthweight
In [192]:
#first_live = preg[preg.pregordr == 1 & preg.outcome == 1]
#first_live
df = preg.query('pregordr == 1 and outcome == 1')
ser = create_cdf_pandas(df.totalwgt_lb.dropna())
cdf_value(ser.to_dict(), 7.5)
Out[192]:
Compute the median birth weight by looking up the value associated with p=0.5.
In [8]:
#find_median_value = lambda cdf_dict: [key for key, value in cdf_dict.iteritems() if value == 0.5
#find_median_value(ser.to_dict())
cdf_dict = ser.to_dict()
def get_key(cdf_dict, p):
if 0.5 in cdf_dict.values():
for key, value in cdf_dict.iteritems():
if value == p:
return key
else:
return min(cdf_dict, key=lambda y:abs(float(cdf_dict[y]) - p))
print get_key(cdf_dict, 0.5)
Compute the interquartile range (IQR) by computing percentiles corresponding to 25 and 75.
In [38]:
ser = create_cdf_pandas(preg_live.totalwgt_lb.dropna())
cdf_dict = ser.to_dict()
print get_key(cdf_dict, 0.25)
print get_key(cdf_dict, 0.75)
Make a random selection from cdf.
In [39]:
pd.DataFrame(ser).sample()
Out[39]:
Draw a random sample from cdf.
In [40]:
pd.DataFrame(ser).sample(n = 10)
Out[40]:
Draw a random sample from cdf, then compute the percentile rank for each value, and plot the distribution of the percentile ranks.
In [53]:
df1 = pd.DataFrame(ser).sample(n = 1000, replace = True)
df1.reset_index(inplace=True)
df1.columns = ['val', 'prob']
df1_sample = df1.val
sample_ser = create_cdf_pandas(df1_sample)
#print len(ser)
#print df1_sample.value_counts()
#print df1_sample.value_counts().sort_index()
#print pd.DataFrame(df1, columns = ['val', 'prob'])
sample_ser.plot()
Out[53]:
Generate 1000 random values using random.random() and plot their PMF.
In [65]:
import random
t = [random.random() for _ in range(1000)]
ser_pmf = pd.Series(t).value_counts().sort_index()/ len(t)
ser_pmf.plot()
Out[65]:
Assuming that the PMF doesn't work very well, try plotting the CDF instead.
In [66]:
cdf_ser = create_cdf_pandas(pd.Series(t))
In [68]:
cdf_ser.plot()
Out[68]:
In [ ]: