probability density function - derivative of a CDF. Evaluating for x gives a probability density or "the probability per unit of x. In order to get a probability mass, you have to integrate over x.

Pdf class probides...

  • Density take a value, x and returns the density at x
  • Render evaluates the density at a discrete set of values and returns a pair of sequences: sorted values, xs, and their probabilty densities.
  • MakePmf, evaluates Density at a discrete set of values and returns a normalized Pmf that approximates the Pdf.
  • GetLinspace, returns the default set of points used by Render and MakePmf

...but they are implemented in children classes


In [2]:
%matplotlib inline
import thinkstats2
import thinkplot
import pandas as pd
import numpy as np
import math, random

In [11]:
mean, var = 163, 52.8
std = math.sqrt(var)
pdf = thinkstats2.NormalPdf(mean, std)
print "Density:",pdf.Density(mean + std)
thinkplot.Pdf(pdf, label='normal')
thinkplot.Show()


Density: 0.0333001249896
<matplotlib.figure.Figure at 0x10b858290>

In [12]:
#by default, makes pmf stetching 3*sigma in either direction
pmf = pdf.MakePmf()
thinkplot.Pmf(pmf,label='normal')
thinkplot.Show()


<matplotlib.figure.Figure at 0x10cefad90>

Kernel density estimation - an algorithm that takes a sampel and finds an approximately smooth PDF that fits the data.


In [20]:
sample = [random.gauss(mean, std) for i in range(500)]
sample_pdf = thinkstats2.EstimatedPdf(sample)
thinkplot.Pdf(sample_pdf, label='sample PDF made by KDE')

##Evaluates PDF at 101 points
pmf = sample_pdf.MakePmf()
thinkplot.Pmf(pmf, label='sample PMF')
thinkplot.Show()


<matplotlib.figure.Figure at 0x10db2ac50>

Advantages of KDE:

  • Visualiztion - estimated pdf are easy to get when you look at them.
  • Interpolation - If you think smooth, you can use KDE to estimate the in-between values in a PDF.
  • Simulation - smooths out a small sample allowing for wider degree of outcomes during simulations

discretizing a PMF if you evaluate a PDF at discrete points, you can generate a PMF that is an approximation of the PDF.

statistic Any time you take a sample and reduce it to a single number, that number is a statistic.

raw moment if you have a sample of values, $x_i$, the $k$th raw moment is:

$$ m'_k = \frac{1}{n} \sum_i x_i^k $$

when k = 1 the result is the sample mean.

central moments are more useful...


In [21]:
def RawMoment(xs, k):
    return sum(x**k for x in xs) / len(xs)

def CentralMoment(xs, k):
    mean = RawMoment(xs, 1)
    return sum((x - mean)**k for x in xs) / len(xs)

...note that when k = 2, the second central moment is variance.

If we attach a weight along a ruler at each location, $x_i$, and then spin the ruler around the mean, the moment of inertia of the spinning weights is the variance of the values

Skewness describes the shape of a distribution. Negative means distribution skews left. Positive means skews right. To compute sample skewness $g1$...


In [22]:
##normalized so there are no units
def StandardizedMoment(xs, k):
    var = CentralMoment(xs, 2)
    std = math.sqrt(var)
    return CentralMoment(xs, k) / std**k

def Skewness(xs):
    return StandardizedMoment(xs, 3)

Pearson's median skewness coefficient is a measure of the skewness based on the difference between the sample mean and median: $$ g_p = 3(\bar{x}-m)/S $$

It is a more robust statistic than sample skewness because it is less sensitive to outliers.


In [23]:
def Median(xs):
    cdf = thinkstats2.Cdf(xs)
    return cdf.Value(0.5)

def PearsonMedianSkewness(xs):
    median = Median(xs)
    mean = RawMoment(xs, 1)
    var = CentralMoment(xs, 2)
    std = math.sqrt(var)
    gp = 3 * (mean - meadian) / std
    return gp

To summarize the Moments:

the mean is a raw moment with k = 1

the variance is a central moment with k = 2

the sample skewness is a standardized moment with k = 3

note that Pearson Median Skewness is a more robust measure of skewness.

Exercise


In [ ]:
import hinc, hinc2

print "starting..."
df = hinc.ReadData()
log_sample = hinc2.InterpolateSample(df)

log_cdf = thinkstats2.Cdf(log_sample)

thinkplot.Cdf(log_cdf)
thinkplot.Show(xlabel='household income',
               ylabel='CDF')

# log_pmf = thinkstats2.Pmf(log_sample)
# thinkplot.Hist(log_pmf)
# thinkplot.Show()

In [ ]: