Chapter 2: Descriptive Statistics

Mean

$\mu = \frac{1}{n} \sum_{\substack{i}} x_i$ fancy way of saying $\frac{sum(x)}{len(x)}$

Variance

$\sigma^2 = \frac{1}{n} \sum_{\substack{i}} (x_i - \mu)^2$

We start with $x_i-\mu$, ie. deviation from the mean, sum it up for all, and square it.

Let me try to write a very simple function for calculating the variance.


In [70]:
def mean_(lst):
    return(sum(lst)/len(lst))

def variancelst_(lst):
    mu = mean_(lst)
    return([abs(x-mu) for x in lst]) # generates a list showing the square deviance from the mean for each list item

def variance_(lst):
    mu = mean_(lst)
    varlst = [x**2 for x in variancelst_(lst)] # generates a list showing the square deviance from the mean for each list item
    return(sum(varlst)/len(lst)) # returns the sum of the list, divided by number of elements. Average square deviance.    

example = [1,2,344,3,1,3,45,46,4,3]

print(variance_(example) == var(example))
print(std(example) == stddev_(example))


True
True

Let's try figuring out Matplotlib, and plot the numbers against their deviance, with the read line as the mean. Thus, when the green line (the number) meets the red line (the mean), the blue line (the deviance) is 0.


In [37]:
plot(example, color='green')
plot(variancelst_(example), color='blue')
axhline(mean_(example) ,color='red')
annotate('deviance = 0', xy=(6.5, 50), xytext=(7, 100),
    arrowprops=dict(facecolor='black', shrink=0.05))


Out[37]:
<matplotlib.text.Annotation at 0x1079ddc10>

I got carried away and decided to define the standard deviation, and the mean and the mode as well.


In [ ]:
def stddev_(lst):
    return(sqrt(variance_(lst)))

def even(n):
    return n % 2 == 0

def median_(lst):
    srt = sorted(lst)
    
    mid = len(lst) // 2
    if even(len(lst)):
        return( (srt[mid-1] + srt[mid]) / 2)
    else:
        return(srt[mid])

from collections import Counter
def mode_(lst):
    cnt = Counter(lst)
    cntsort = sorted(cnt.items(), key=lambda x: x[1])
    return(cntsort[-1][0])  # return the key of the last item, ie. with most occurrences. Not deterministic without unique mode.

Let's play with the formatting function, and a list comprehension with functions, to print a nice statistical summary.


In [72]:
def summary_(lst):
    fns = [
           ("Number of items", len),
           ("Mean", mean_),
           ("Median", median_),
           ("Mode", mode_),
           ("Variance", variance_),
           ("Std. dev", stddev_)
           ]
    
    outary = ["{:<20}{:8.2f}".format(title, fun(lst)) for (title, fun) in fns]
    print("\n".join(outary))

summary_(example)


Number of items        10.00
Mean                   45.20
Median                  3.00
Mode                    3.00
Variance            10209.56
Std. dev              101.04

Now let's load in the birth dataset, and look at the summary stats for it.


In [81]:
import pickle
with open("/Users/Stian/src/math-with-ipython/think-stats/preg-pandas.pickle", "rb") as f:
    df, df1, dfnot1 = pickle.load(f)
    
print("First child:")
summary_(df1.prglength)
print("\nNot first child:")
summary_(dfnot1.prglength)


First child:
Number of items      4413.00
Mean                   38.60
Median                 39.00
Mode                   39.00
Variance                7.79
Std. dev                2.79

Not first child:
Number of items      4735.00
Mean                   38.52
Median                 39.00
Mode                   39.00
Variance                6.84
Std. dev                2.62

End on p 24


In [ ]: