Distribution: records each value and how many times those values appear

Histogram: shows the fequency of each value in a distribution


In [5]:
import thinkstats2
hist = thinkstats2.Hist([1,2,2,3,5])
hist


Out[5]:
Hist({1: 1, 2: 2, 3: 1, 5: 1})

In [2]:
print hist.Freq(2) == hist[2]
print hist[4]
print hist.Values()
for val in sorted(hist.Values()):
    print (val, hist.Freq(val))
for val, freq in hist.Items():
    print (val, freq)


True
0
[1, 2, 3, 5]
(1, 1)
(2, 2)
(3, 1)
(5, 1)
(1, 1)
(2, 2)
(3, 1)
(5, 1)

In [7]:
import thinkplot

thinkplot.Hist(hist)
thinkplot.Show(xlabel='value', ylabel='frequency')

NOTE: documentation for thinkplot at this link


In [15]:
import nsfg
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
hist = thinkstats2.Hist(live.birthwgt_lb, label='birthwgt_lb')
thinkplot.Hist(hist)
thinkplot.Show(xlabel='pounds',ylabel='frequency')
  • Best way to handle outliers is "domain knowledge"--infor about where the data come from and what they mean.

In [23]:
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]

first_hist = thinkstats2.Hist(firsts.prglngth)
other_hist = thinkstats2.Hist(others.prglngth)

width = 0.45

# takes the amount of histograms we are going to plot
# in order to choose color
thinkplot.PrePlot(2) 
thinkplot.Hist(first_hist, align='right',width=width)
thinkplot.Hist(other_hist, align='left', width=width)
thinkplot.Show(xlabel='weeks', 
               ylabel='frequency',
               axis=[27, 46, 0, 2700])


/Users/davidgoldberg/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py:475: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

Note: apparent differences due to sample size. Probability Mass Functions (PDFs) will help with that problem.

Summary Statistics:

  • central tendency -- do the values cluster?
  • modes -- is there more than one cluster?
  • spread -- How much variablity is there in the values?
  • tails -- How quickly do the probabilities drop off as we move away from the modes?
  • outliers -- are there extreme values far from the modes?

mean versus average:

mean: summary statistic computed using x = 1/n * sum([i for i in x]) where n is the number of samples and x is a list of values.

average: one of several summary stats yout might choose to describe a central tendency.

variance:

  • describes variability or spread of a distribution. the mean squared deviation
$$\sigma^2 = \frac{1} {n}\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2 $$
  • x(i) - mu is called "deviation from the mean," so variance is the "mean squared deviation"

  • variance is not a good summary statistic because it is hard to interpret.


In [13]:
mean = live.prglngth.mean()
var = live.prglngth.var()
std = live.prglngth.std()

print "mean:    ",mean
print "variance:",var
print "std:     ",std


mean:     38.5605596852
variance: 7.30266206783
std:      2.70234381007

Effect Size

  • one method is to compute the difference in means and report it as a fraction of one of the values.

  • Cohen's d compares the difference between groups to the variablity within groups.

$$ d = \frac{\bar{x}_1 - \bar{x}_2}{s} $$

where $x_1$ and $x_2$ are the means of the groups and s is the "pooled standard deviation." Using this measurement, you report difference in means as number of standard deviations, d


In [28]:
diff = abs(firsts.prglngth.mean() - others.prglngth.mean())
print "diff of means in weeks:",diff
print "diff of means in hours: ", diff * 7 * 24
print diff / firsts.prglngth.mean() * 100, '%'


diff of means in weeks: 0.0780372667775
diff of means in hours:  13.1102608186
0.202164100295 %

In [26]:
def CohenEffectSize(group1, group2):
    import math
    diff = group1.mean() - group2.mean()
    
    var1 = group1.var()
    var2 = group2.var()
    n1, n2 = len(group1), len(group2)
    
    #this is basically mean variance
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    d = diff / math.sqrt(pooled_var)
    return d

print CohenEffectSize(firsts.prglngth,others.prglngth)


0.0288790446544

Exercies 2.1

Summarize the results in ths chapter regarding whether or not first babies arrive late. Which summary statisitics would you use if you wanted to get a story on the evening news? Which ones would yo use if you wanted to reassure an axious patient? Finally, answer the question clearly, precisely, and honestly?

My Answer:

a.) if I wanted to get the story on the evening news, I'd probably just use the difference in means converted to hours so that it sounds like it is significant compared to the units.

b.) If I wanted to reasure an anxious patient, i'd use the percentage of the difference in means compared to the mean, because it is easy to understand but sounds like the very small difference that it is.

c.) There is very little difference in pregnancy length between first children and children born later on to the same mother. I know this because the Cohen's d for the means of these two sets is 0.029--meaning the difference between the two is 0.029 standard deviations, which is extremely small.

Exercise 2.3

Write a function, Mode, that takes a hist and returns the most frequent value. The, write a function called AllModes that returns a list of value-frequency pairs in descending order of frequency.


In [21]:
def mode(hist):
    largestKey = None
    largestFreq = None
    for key in hist:
        freq = hist[key]
        if freq > largestFreq:
            largestKey = key
            largestFreq = freq
    return largestKey, largestFreq

def allModes(hist):
    histList = []
    for key in hist:
        histList.append((key, hist[key]))
    return sorted(histList, key=lambda x: x[1], reverse=True)

print 'key, freq'
for key, freq in allModes(hist):
    print key, '\t',freq


key, freq
7.0 	3049
6.0 	2223
8.0 	1889
5.0 	697
9.0 	623
4.0 	229
10.0 	132
3.0 	98
2.0 	53
1.0 	40
11.0 	26
12.0 	10
0.0 	8
13.0 	3
14.0 	3
15.0 	1

Exercise 2.4

Using the variable totalwgt_lb investigate whether first babies are lighter or heavier than others. Compute Cohen's d to quantify the difference between the groups.


In [30]:
import math

diff = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()

var1 = firsts.totalwgt_lb.var()
var2 = others.totalwgt_lb.var()

n1 = len(firsts)
n2 = len(others)

pooled_var = ((n1 * var1) + (n2 * var2)) / (n1 + n2)

print diff / math.sqrt(pooled_var)
print CohenEffectSize(firsts.totalwgt_lb, others.totalwgt_lb)


-0.0886729270726
-0.0886729270726

In [ ]: