probability density function - derivative of a CDF. Evaluating for x gives a probability density or "the probability per unit of x. In order to get a probability mass, you have to integrate over x.
Pdf class probides...
Density
take a value, x and returns the density at xRender
evaluates the density at a discrete set of values and returns a pair of sequences: sorted values, xs, and their probabilty densities.MakePmf
, evaluates Density at a discrete set of values and returns a normalized Pmf that approximates the Pdf.GetLinspace
, returns the default set of points used by Render and MakePmf...but they are implemented in children classes
In [2]:
%matplotlib inline
import thinkstats2
import thinkplot
import pandas as pd
import numpy as np
import math, random
In [11]:
mean, var = 163, 52.8
std = math.sqrt(var)
pdf = thinkstats2.NormalPdf(mean, std)
print "Density:",pdf.Density(mean + std)
thinkplot.Pdf(pdf, label='normal')
thinkplot.Show()
In [12]:
#by default, makes pmf stetching 3*sigma in either direction
pmf = pdf.MakePmf()
thinkplot.Pmf(pmf,label='normal')
thinkplot.Show()
Kernel density estimation - an algorithm that takes a sampel and finds an approximately smooth PDF that fits the data.
In [20]:
sample = [random.gauss(mean, std) for i in range(500)]
sample_pdf = thinkstats2.EstimatedPdf(sample)
thinkplot.Pdf(sample_pdf, label='sample PDF made by KDE')
##Evaluates PDF at 101 points
pmf = sample_pdf.MakePmf()
thinkplot.Pmf(pmf, label='sample PMF')
thinkplot.Show()
discretizing a PMF if you evaluate a PDF at discrete points, you can generate a PMF that is an approximation of the PDF.
statistic Any time you take a sample and reduce it to a single number, that number is a statistic.
raw moment if you have a sample of values, $x_i$, the $k$th raw moment is:
$$ m'_k = \frac{1}{n} \sum_i x_i^k $$when k = 1 the result is the sample mean.
central moments are more useful...
In [21]:
def RawMoment(xs, k):
return sum(x**k for x in xs) / len(xs)
def CentralMoment(xs, k):
mean = RawMoment(xs, 1)
return sum((x - mean)**k for x in xs) / len(xs)
...note that when k = 2, the second central moment is variance.
If we attach a weight along a ruler at each location, $x_i$, and then spin the ruler around the mean, the moment of inertia of the spinning weights is the variance of the values
Skewness describes the shape of a distribution. Negative means distribution skews left. Positive means skews right. To compute sample skewness $g1$...
In [22]:
##normalized so there are no units
def StandardizedMoment(xs, k):
var = CentralMoment(xs, 2)
std = math.sqrt(var)
return CentralMoment(xs, k) / std**k
def Skewness(xs):
return StandardizedMoment(xs, 3)
Pearson's median skewness coefficient is a measure of the skewness based on the difference between the sample mean and median: $$ g_p = 3(\bar{x}-m)/S $$
It is a more robust statistic than sample skewness because it is less sensitive to outliers.
In [23]:
def Median(xs):
cdf = thinkstats2.Cdf(xs)
return cdf.Value(0.5)
def PearsonMedianSkewness(xs):
median = Median(xs)
mean = RawMoment(xs, 1)
var = CentralMoment(xs, 2)
std = math.sqrt(var)
gp = 3 * (mean - meadian) / std
return gp
In [ ]:
import hinc, hinc2
print "starting..."
df = hinc.ReadData()
log_sample = hinc2.InterpolateSample(df)
log_cdf = thinkstats2.Cdf(log_sample)
thinkplot.Cdf(log_cdf)
thinkplot.Show(xlabel='household income',
ylabel='CDF')
# log_pmf = thinkstats2.Pmf(log_sample)
# thinkplot.Hist(log_pmf)
# thinkplot.Show()
In [ ]: