In [1]:
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd
import thinkstats2
import thinkplot
In [2]:
hist = thinkstats2.Hist([1, 2, 2, 3, 5])
hist
Out[2]:
In [3]:
hist.Freq(2) # hist[2]
Out[3]:
In [4]:
hist.Values()
Out[4]:
In [5]:
thinkplot.Hist(hist)
thinkplot.Show(xlabel='value', ylabel='frequency')
In [6]:
import nsfg
Histogram of pregnancy length in weeks
In [7]:
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Show(xlabel='pounds', ylabel='frequency')
Histogram of pregnancy lengths
In [8]:
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
firsts.prglngth.plot(kind='hist', width=2)
others.prglngth.plot(kind='hist', width=2)
Out[8]:
Some of the characteristics we might want to report are:
$x_{i}-\overline{x}$ is called the “deviation from the mean”
$\sqrt{S}$ is the standard deviation.
Pandas data structures provides methods to compute mean, variance and standard deviation:
mean = live.prglngth.mean()
var = live.prglngth.var() # variance
std = live.prglngth.std() # standard deviation
An effect size is a quantitative measure of the strength of an event.
One obvious choice is the difference in the means. Another way to convey the size of the effect is to compare the difference between groups to the variability within groups.
Cohen's d $$d = \frac{\overline{x_1} -\overline{x_2}}{s}$$
s is the “pooled standard deviation”
$$s=\sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 +n_2 -2}}$$$n_i$ is the sample size of $x_i$, $S_i$ is the variance.
In [9]:
import thinkstats2
resp = thinkstats2.ReadStataDct('2002FemResp.dct').ReadFixedWidth('2002FemResp.dat.gz', compression='gzip')
Make a histogram of totincr the total income for the respondent's family.
In [10]:
resp.totincr.plot.hist(bins=range(17))
Out[10]:
Make a histogram of age_r, the respondent's age at the time of interview.
In [11]:
resp.ager.plot.hist(bins=range(15,46))
Out[11]:
Use totincr to select the respondents with the highest income. Compute the distribution of parity for just the high income respondents.
In [12]:
rich = resp[resp.totincr == resp.totincr.max() ]
rich.parity.plot.hist(bins=range(10))
Out[12]:
Compare the mean parity for high income respondents and others.
In [13]:
rich = resp[resp.totincr == resp.totincr.max() ]
notrich = resp[resp.totincr < resp.totincr.max()]
rich.parity.mean(), notrich.parity.mean()
Out[13]:
In [14]:
preg = nsfg.ReadFemPreg()
In [15]:
first = preg[preg.birthord ==1 ]
others = preg[preg.birthord >1 ]
first.totalwgt_lb.mean(), others.totalwgt_lb.mean()
Out[15]:
In [16]:
def CohenEffectSize(group1, group2):
mean_diff = group1.mean() - group2.mean()
n1= len(group1)
n2 = len(group2)
pooled_var = (n1*group1.var() + n2* group2.var())/(n1+n2)
d = mean_diff / np.math.sqrt(pooled_var)
return d
In [17]:
CohenEffectSize(first.totalwgt_lb, others.totalwgt_lb)
Out[17]: