$\mu = \frac{1}{n} \sum_{\substack{i}} x_i$ fancy way of saying $\frac{sum(x)}{len(x)}$
$\sigma^2 = \frac{1}{n} \sum_{\substack{i}} (x_i - \mu)^2$
We start with $x_i-\mu$, ie. deviation from the mean, sum it up for all, and square it.
Let me try to write a very simple function for calculating the variance.
In [70]:
def mean_(lst):
return(sum(lst)/len(lst))
def variancelst_(lst):
mu = mean_(lst)
return([abs(x-mu) for x in lst]) # generates a list showing the square deviance from the mean for each list item
def variance_(lst):
mu = mean_(lst)
varlst = [x**2 for x in variancelst_(lst)] # generates a list showing the square deviance from the mean for each list item
return(sum(varlst)/len(lst)) # returns the sum of the list, divided by number of elements. Average square deviance.
example = [1,2,344,3,1,3,45,46,4,3]
print(variance_(example) == var(example))
print(std(example) == stddev_(example))
Let's try figuring out Matplotlib, and plot the numbers against their deviance, with the read line as the mean. Thus, when the green line (the number) meets the red line (the mean), the blue line (the deviance) is 0.
In [37]:
plot(example, color='green')
plot(variancelst_(example), color='blue')
axhline(mean_(example) ,color='red')
annotate('deviance = 0', xy=(6.5, 50), xytext=(7, 100),
arrowprops=dict(facecolor='black', shrink=0.05))
Out[37]:
I got carried away and decided to define the standard deviation, and the mean and the mode as well.
In [ ]:
def stddev_(lst):
return(sqrt(variance_(lst)))
def even(n):
return n % 2 == 0
def median_(lst):
srt = sorted(lst)
mid = len(lst) // 2
if even(len(lst)):
return( (srt[mid-1] + srt[mid]) / 2)
else:
return(srt[mid])
from collections import Counter
def mode_(lst):
cnt = Counter(lst)
cntsort = sorted(cnt.items(), key=lambda x: x[1])
return(cntsort[-1][0]) # return the key of the last item, ie. with most occurrences. Not deterministic without unique mode.
Let's play with the formatting function, and a list comprehension with functions, to print a nice statistical summary.
In [72]:
def summary_(lst):
fns = [
("Number of items", len),
("Mean", mean_),
("Median", median_),
("Mode", mode_),
("Variance", variance_),
("Std. dev", stddev_)
]
outary = ["{:<20}{:8.2f}".format(title, fun(lst)) for (title, fun) in fns]
print("\n".join(outary))
summary_(example)
Now let's load in the birth dataset, and look at the summary stats for it.
In [81]:
import pickle
with open("/Users/Stian/src/math-with-ipython/think-stats/preg-pandas.pickle", "rb") as f:
df, df1, dfnot1 = pickle.load(f)
print("First child:")
summary_(df1.prglength)
print("\nNot first child:")
summary_(dfnot1.prglength)
End on p 24
In [ ]: