probability density function - derivative of a CDF. Evaluating for x gives a probability density or "the probability per unit of x. In order to get a probability mass, you have to integrate over x.

Pdf class probides...

Density take a value, x and returns the density at x
Render evaluates the density at a discrete set of values and returns a pair of sequences: sorted values, xs, and their probabilty densities.
MakePmf, evaluates Density at a discrete set of values and returns a normalized Pmf that approximates the Pdf.
GetLinspace, returns the default set of points used by Render and MakePmf

...but they are implemented in children classes



In [2]:

    
%matplotlib inline
import thinkstats2
import thinkplot
import pandas as pd
import numpy as np
import math, random



In [11]:

    
mean, var = 163, 52.8
std = math.sqrt(var)
pdf = thinkstats2.NormalPdf(mean, std)
print "Density:",pdf.Density(mean + std)
thinkplot.Pdf(pdf, label='normal')
thinkplot.Show()









    



Density: 0.0333001249896






    












    





<matplotlib.figure.Figure at 0x10b858290>



In [12]:

    
#by default, makes pmf stetching 3*sigma in either direction
pmf = pdf.MakePmf()
thinkplot.Pmf(pmf,label='normal')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10cefad90>

Kernel density estimation - an algorithm that takes a sampel and finds an approximately smooth PDF that fits the data.



In [20]:

    
sample = [random.gauss(mean, std) for i in range(500)]
sample_pdf = thinkstats2.EstimatedPdf(sample)
thinkplot.Pdf(sample_pdf, label='sample PDF made by KDE')

##Evaluates PDF at 101 points
pmf = sample_pdf.MakePmf()
thinkplot.Pmf(pmf, label='sample PMF')
thinkplot.Show()









    












    





<matplotlib.figure.Figure at 0x10db2ac50>

Advantages of KDE:

Visualiztion - estimated pdf are easy to get when you look at them.
Interpolation - If you think smooth, you can use KDE to estimate the in-between values in a PDF.
Simulation - smooths out a small sample allowing for wider degree of outcomes during simulations

discretizing a PMF if you evaluate a PDF at discrete points, you can generate a PMF that is an approximation of the PDF.

statistic Any time you take a sample and reduce it to a single number, that number is a statistic.

raw moment if you have a sample of values, $x_i$, the $k$th raw moment is:

$$ m'_k = \frac{1}{n} \sum_i x_i^k $$

when k = 1 the result is the sample mean.

central moments are more useful...



In [21]:

    
def RawMoment(xs, k):
    return sum(x**k for x in xs) / len(xs)

def CentralMoment(xs, k):
    mean = RawMoment(xs, 1)
    return sum((x - mean)**k for x in xs) / len(xs)

...note that when k = 2, the second central moment is variance.

If we attach a weight along a ruler at each location, $x_i$, and then spin the ruler around the mean, the moment of inertia of the spinning weights is the variance of the values

Skewness describes the shape of a distribution. Negative means distribution skews left. Positive means skews right. To compute sample skewness $g1$...



In [22]:

    
##normalized so there are no units
def StandardizedMoment(xs, k):
    var = CentralMoment(xs, 2)
    std = math.sqrt(var)
    return CentralMoment(xs, k) / std**k

def Skewness(xs):
    return StandardizedMoment(xs, 3)

Pearson's median skewness coefficient is a measure of the skewness based on the difference between the sample mean and median: $$ g_p = 3(\bar{x}-m)/S $$

It is a more robust statistic than sample skewness because it is less sensitive to outliers.



In [23]:

    
def Median(xs):
    cdf = thinkstats2.Cdf(xs)
    return cdf.Value(0.5)

def PearsonMedianSkewness(xs):
    median = Median(xs)
    mean = RawMoment(xs, 1)
    var = CentralMoment(xs, 2)
    std = math.sqrt(var)
    gp = 3 * (mean - meadian) / std
    return gp

To summarize the Moments:

the mean is a raw moment with k = 1

the variance is a central moment with k = 2

the sample skewness is a standardized moment with k = 3

note that Pearson Median Skewness is a more robust measure of skewness.

Exercise



In [ ]:

    
import hinc, hinc2

print "starting..."
df = hinc.ReadData()
log_sample = hinc2.InterpolateSample(df)

log_cdf = thinkstats2.Cdf(log_sample)

thinkplot.Cdf(log_cdf)
thinkplot.Show(xlabel='household income',
               ylabel='CDF')

# log_pmf = thinkstats2.Pmf(log_sample)
# thinkplot.Hist(log_pmf)
# thinkplot.Show()



In [ ]: