ECML2016nonparametric-1part


The man behind

What is nonparametric statistic?

Frequentist definition

A nonparametric procedure is a statistical procedure that has a certain desirable properties that hold under relatively mild assumptions regarding the underlying populations from which the data are obtained.


Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.</sub>

![alt text][logo]

Frequentist definition

The term nonparametric is imprecise. The related term distribution-free has a more precise meaning...The distribution-free property enables one to obtain the distribution of the statistic under the null hypothesis without specifying the underlying distribution of the data


Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.</sub>

BNP definition

Inferring the distribution of the data when is completely unknown.

Bayesian parametric

Bayesian nonparametric

Distribution on distributions?




$$ x_1,\dots,x_n\sim P ~~~\textit{ and }~~~ P \sim \mu $$

Nonparametric Prior

Probability Space: Let $\mathbb{X}$ be a standard Borel space with Borel $\sigma$-field $\mathcal{B}_\mathbb{X}$ and $\mathbb{P}$ be the space of probability measures on $(\mathbb{X},\mathcal{B}_{\mathbb{X}})$.

Nonparametric Prior: Let $\mathbb{M}$ be the class of all probability measures on $(\mathbb{P},\mathcal{B}_\mathbb{P})$. We call the elements $\mu \in \mathbb{M}$ nonparametric priors.

Dirichlet Process

developed by Ferguson

$$ x_1,\dots,x_n\sim P ~~~\textit{ and }~~~ P \sim \mu $$

We are interested on a particular $\mu$:

$$ x_1,\dots,x_n\sim P ~~~\textit{ and }~~~ P \sim Dp(s,\alpha) $$

Dirichlet Process

developed by Ferguson

An element $\mu \in \mathbb{M}$ is called a Dirichlet process $\mathcal{D}p(s,\alpha)$ on $\mathbb{X}$, with

  • prior strength $s\in \mathbb{R}^+$
  • prior probability measure $\alpha$ on $\mathbb{X}$

if for every finite measurable partition $B_1,\dots,B_m$ of $\mathbb{X}$,

the vector $(P(B_1),\dots,P(B_m))$

has a Dirichlet distribution with parameters $(s\alpha(B_1),\dots,s\alpha(B_m))$

Prior Base measure

It means that $\alpha(\mathbb{X})=s\in \mathbb{R}$ is finite and, thus, we can normalize $\alpha(\cdot)$!

Normalization: $$ \alpha(\cdot)=s ~\alpha(\cdot) ~~~ \text{ with $~~~\alpha(\mathbb{X})=1$} $$
Note that $\alpha(\cdot)$ is now a probability measure.

We equivalently say :$~~~P\sim \mathcal{D}(\alpha)~~~$ or $~~~P \sim \mathcal{Dp}(s,\alpha)$

Example 1/3

  • $\mathbb{X}=\mathbb{R}$
  • prior strength measure: $s=4$
  • prior probability measure: $\alpha=0.5\delta_{\{-1\}}+0.5\delta_{\{1\}}$


In [8]:
using Gadfly
using Compose
myplot=plot(x=rand(1)*0.0,y=rand(1)*0.0,Guide.annotation(compose(context(),line([(-1, 0), (-1, 0.5)]),line([(1, 0), (1, 0.5)]),stroke(colorant"blue"),linewidth(0.4))),Stat.xticks(ticks=[-2.0,-1.5,-1.0,-0.5,0.0, 0.5, 1.0,1.5,2]),Stat.yticks(ticks=[0.0, 0.5, 1.0]),xintercept=[0.0], Geom.line,Guide.ylabel("Probability"),Theme(major_label_font_size=15pt,minor_label_font_size=13pt))
draw(PNG("plots/real.png", 6inch, 3inch), myplot)


WARNING: The following aesthetics are mapped, but not used by any geometry:
    xintercept

Dirichlet Process

We say that $P\sim \mathcal{D}p(s,\alpha)$ if

  • for every finite measurable partition $B_1,\dots,B_m$ of $\mathbb{X}$,
  • the vector $(P(B_1),\dots,P(B_m))$ has a Dirichlet distribution with parameters $(s\alpha(B_1),\dots,s\alpha(B_m))$

Example 2/3

$$ P \sim Dp(s,\alpha) $$

if

  • Partition: $B_1=(-\infty,0]$, $B_2=(0,\infty)$

  • then

$$ \begin{array}{rcl} (P(B_1),P(B_2))&\sim& Dir(s\alpha(B_1),s\alpha(B_2))\\ &=&Beta(s\alpha(B_1),s\alpha(B_2)) \end{array} $$

Example 3/3

Since $s=4$ and $\alpha=0.5\delta_{\{-1\}}+0.5\delta_{\{1\}}$ then

$$ \begin{array}{rcl} (P(B_1),P(B_2))&\sim& Dir(s\alpha(B_1),s\alpha(B_2))\\ &=& Dir(2,2)=Beta(2,2) \end{array} $$

In [2]:
xv=(0:0.01:1)
yv=6*xv.^(1).*(1-xv).^(1)

densplot=plot(x=xv,y=yv, Geom.line,Guide.xlabel("P(B<sub>1</sub>)"),Guide.ylabel("Density"),Theme(panel_fill=colorant"black", default_color=colorant"orange",major_label_font_size=15pt,minor_label_font_size=13pt))
draw(PNG("plots/densplot.png", 6inch, 3inch), densplot)

Posterior Dirichlet Process

One of the most remarkable properties of the DP priors is that the posterior of $P$ is again a DP.

  • let $x_1,\dots,x_n \sim P$ i.i.d.
  • assume that $P \sim Dp(s,\alpha)$

Then the posterior distribution of $P$ is
$$ P|x_1,\dots,x_n \sim Dp\left(s+n, \frac{s}{s+n} \alpha+ \frac{1}{s+n} \sum\limits_{i=1}^n \delta_{x_i}\right), $$

where $\delta_{x_i}$ is an atomic probability measure centered at $x_i$.

Example cont.

  • $\mathbb{X}=\mathbb{R}$
  • prior strength: $\alpha(\mathbb{R})=s=4$
  • prior probability measure: $\alpha=0.5\delta_{\{-1\}}+0.5\delta_{\{1\}}$

Example cont.

  • Observations: $\{-1.5,-0.5, 0.25, 0.75, 1.25, 1.75\}$
  • posterior strength: $s_n=s+n=4+6$
  • posterior probability measure: $\alpha_n=\frac{s}{s+n} \alpha+ \frac{1}{s+n} \sum\limits_{i=1}^n \delta_{x_i}$


In [3]:
using Gadfly
using Compose
myplot=plot(x=rand(1)*0.0,y=rand(1)*0.0,Guide.annotation(compose(context(),line([(-1, 0), (-1, 0.4)]),line([(1, 0), (1, 0.4)]),stroke(colorant"blue"),linewidth(0.4))),Guide.annotation(compose(context(),line([(-1.5, 0), (-1.5, 0.1)]),line([(-0.5, 0), (-0.5, 0.1)]),line([(0.25, 0), (0.25, 0.1)]),line([(0.75, 0), (0.75, 0.1)]),line([(1.25, 0), (1.25, 0.1)]),line([(1.75, 0), (1.75, 0.1)]),stroke(colorant"orange"),linewidth(0.4))),Stat.xticks(ticks=[-2.0,-1.5,-1.0,-0.5,0.0, 0.5, 1.0,1.5,2]),Stat.yticks(ticks=[0.0, 0.5, 1.0]),xintercept=[0.0], Geom.line,Guide.ylabel("Probability"),Theme(major_label_font_size=15pt,minor_label_font_size=13pt))
draw(PNG("plots/realpost.png", 6inch, 3inch), myplot)


WARNING: The following aesthetics are mapped, but not used by any geometry:
    xintercept
  • Partition: $B_1=(-\infty,0]$, $B_2=(0,\infty)$
$$ \begin{array}{rcl} (P(B_1),P(B_2))&\sim& Dir(s_n\alpha_n(B_1),s_n\alpha_n(B_2))\\ &=&Dir(4,6)=Beta(4,6) \end{array} $$

In [4]:
xv=(0:0.01:1)
yv=504*xv.^(3).*(1-xv).^(5)

densplot=plot(x=xv,y=yv, Geom.line,Guide.xlabel("P(B<sub>1</sub>)"),Guide.ylabel("Density"),Theme(panel_fill=colorant"black", default_color=colorant"orange",major_label_font_size=15pt,minor_label_font_size=13pt))
draw(PNG("plots/densplotpost.png", 6inch, 3inch), densplot)

Another example

  • Prior measure $s=1$, $\alpha = \delta_{-1.5}$
  • Observations $\{-1,-0.5, 0.25, 0.75, 1.25, 1.75\}$
  • posterior strength: $s_n=s+n=1+6=7$
  • posterior probability measure: $\alpha_n=\frac{s}{s+n} \alpha+ \frac{1}{s+n} \sum\limits_{i=1}^n \delta_{x_i}$
$$ \begin{array}{rcl} (P(B_1),P(B_2))&\sim&Beta(3,4) \end{array} $$

In [5]:
using Gadfly
using Compose
myplot=plot(x=rand(1)*0.0,y=rand(1)*0.0,Guide.annotation(compose(context(),line([(-1.5, 0), (-1.5, 1/7)]),stroke(colorant"blue"),linewidth(0.4))),Guide.annotation(compose(context(),line([(-1, 0), (-1, 1/7)]),line([(-0.5, 0), (-0.5, 1/7)]),line([(0.25, 0), (0.25, 1/7)]),line([(0.75, 0), (0.75, 1/7)]),line([(1.25, 0), (1.25, 1/7)]),line([(1.75, 0), (1.75, 1/7)]),stroke(colorant"orange"),linewidth(0.4))),Stat.xticks(ticks=[-2.0,-1.5,-1.0,-0.5,0.0, 0.5, 1.0,1.5,2]),Stat.yticks(ticks=[0.0, 0.5, 1.0]),xintercept=[0.0], Geom.line,Guide.ylabel("Probability"),Theme(major_label_font_size=15pt,minor_label_font_size=13pt))
draw(PNG("plots/realpost1.png", 6inch, 3inch), myplot)


WARNING: The following aesthetics are mapped, but not used by any geometry:
    xintercept
  • Partition: $B_1=(-\infty,-0.75]$, $B_2=(-0.75,0.5]$ $B_3=[0.5,\infty)$
$$ \begin{array}{rcl} (P(B_1),P(B_2),P(B_3))&\sim& Dir(s_n\alpha_n(B_1),s_n\alpha_n(B_2),s_n\alpha_n(B_3))\\ &=&Dir(2,2,3) \end{array} $$

Finest Partition

  • $\alpha_n=\frac{s}{s+n} \alpha+ \frac{1}{s+n} \sum\limits_{i=1}^n \delta_{x_i}$, $~~~~s_n=s+n$

  • $B_1=(-\infty,-1.5]$, $~B_2=(-1.5,-1]$, $~B_3=(-1,-0.5]$,.....

Finest Partition

We can actually only consider

  • $B_1=(.,-1.5]$, $~B_2=(.,-1]$, $~B_3=(.,-0.5]$,...

    $$ \begin{array}{rcl} (P(B_1),\dots,P(B_{8}))&\sim& Dir(s_n\alpha_n(B_1),\dots,s_n\alpha_n(B_8))\\ &=& Dir(s,1,1,1,1,1,1) \end{array} $$

  • Note that the support of $P$ is atomic, hence

    $$ P=w_0 \delta_{\{-1.5\}}+ w_1 \delta_{-1}+ w_2 \delta_{-0.5}+\dots+ w_7 \delta_{1.75}, $$
    with $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$

So we have a Hierarchical model:

General property

$$ P=w_0 P_0+ \sum_{i=1}^n w_i \delta_{X_i}, $$

where

  • $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$
  • $\sum_{i=0}^n w_i = 1$
  • $P_0 \sim Dp(s,\alpha)$

Bayesian NonParametric Statistics

Our view

It reduces to ask the right question to the poster

Bayesian NonParametric Statistics

It reduces to ask the right question to the posterior

$$ P=w_0 P_0+ \sum_{i=1}^n w_i \delta_{x_i}, $$

with

  • $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$
  • $\sum_{i=0}^n w_i = 1$,
  • $P_0 \sim Dp(s,\alpha)$.

Sign Test

Goal: is the median of some distribution $P$ greater than zero?

  • Known $P$: just check whether $P(0,\infty)>P(-\infty,0]~~$ equiv. $~~P(0,\infty)>0.5$

  • Unknown $P$: use observations from $P$ to estimate $P$ and query the posterior (Bayesian way)

Our estimate of $P$ is $$ P=w_0 P_0+ \sum_{i=1}^n w_i \delta_{x_i}, $$ with

  • $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$
  • $P_0 \sim Dp(s,\alpha)$.

Sign Test

Let us ask the posterior

$$ P(\infty,0] ~~~ P(0,\infty) $$

This is equal to

$$ \begin{array}{rclcl} P(-\infty,0]&=&w_0 P_0(-\infty,0]&+& \sum_{i=1}^n w_i I_{(-\infty,0]}(x_i)\\ P(0,\infty)&=&w_0 P_0(0,\infty)&+& \sum_{i=1}^n w_i I_{(0,\infty)}(x_i) \end{array} $$

Bayesian Sign Test

The Bayesian sign test reduces to compute

$$ \mathcal{P}\Big(P(0,\infty)>P(-\infty,0]\Big) =\mathcal{P}\Big(P(0,\infty)>0.5\Big) $$

where $\mathcal{P}$ is computed w.r.t. $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$ and $P_0 \sim Dp(s,\alpha)$

Example cont.

$$ \begin{array}{rcl} P(-\infty,0]&=&w_0 P_0(-\infty,0]+ \sum_{i=1}^n w_i I_{(-\infty,0]}(x_i)\\ &=&w_0 + w_1 +w_2\\ P(0,\infty)&=& w_0 P_0(0,\infty)+ \sum_{i=1}^n w_i I_{(0,\infty)}(x_i)\\ &=& w_3 +w_4+w_5+w_6 \end{array} $$

The goal is to compute

$$ \mathcal{P}(P(0,\infty)>0.5) =\mathcal{P}(w_3 +w_4+w_5+w_6 >0.5) $$

where $\mathcal{P}$ is computed w.r.t. $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$


In [6]:
using Distributions, Gadfly, Compose, DataFrames
include("plots/Bsigntest1.jl")

#Bayesian Sign Test 
observ=[-1;-0.5;0.25;0.75;1.25;1.75];
data=Bsigntest(observ,zeros(length(observ),1),0,[1,0]);
#Plot
df0 = DataFrame(x=data[2,:][:])
p0=plot(df0, x=:x, Geom.histogram,Theme(major_label_font_size=13pt,minor_label_font_size=12pt,key_label_font_size=11pt),Guide.xlabel("P(0,&#8734;)"))
set_default_plot_size(15cm, 8cm)
display(p0)
prob=length(find(z->z> 0.5,data[2,:][:]))/length(data[2,:][:])
println("Probability that P(0,\u221e)>0.5 is $prob")


P(0,∞) -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 -1 0 1 2 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 -1500 -1450 -1400 -1350 -1300 -1250 -1200 -1150 -1100 -1050 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 -2000 0 2000 4000 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000
Probability that P(0,∞)>0.5 is 0.65675

Bayesian Answer

The median is greater than zero with posterior probability $0.66$.

  • No Null hypothesis
  • No $p$-value

Sensitivity to the prior (IDP)

$$ \begin{array}{rcl} P(-\infty,0]&=&w_0 P_0(-\infty,0]+ \sum_{i=1}^n w_i I_{(-\infty,0]}(x_i)\\ &=& w_1 +w_2\\ P(0,\infty)&=& w_0 P_0(0,\infty)+ \sum_{i=1}^n w_i I_{(0,\infty)}(x_i)\\ &=& w_0 +w_3 +w_4+w_5+w_6 \end{array} $$

The goal is to compute

$$ \mathcal{P}(P(0,\infty)>0.5) =\mathcal{P}(w_0 +w_3 +w_4+w_5+w_6 >0.5) $$

where $\mathcal{P}$ is computed w.r.t. $(w_0,w_1,\dots,w_n)\sim Dir(s,1,\dots,1)$


In [7]:
using Distributions, Gadfly, Compose, DataFrames
include("plots/Bsigntest1.jl")

#Bayesian Sign Test 
observ=[-1;-0.5;0.25;0.75;1.25;1.75];
data=Bsigntest(observ,zeros(length(observ),1),0,[0,1]);
#Plot
df0 = DataFrame(x=data[2,:][:])
p0=plot(df0, x=:x, Geom.histogram,Theme(major_label_font_size=13pt,minor_label_font_size=12pt,key_label_font_size=11pt),Guide.xlabel("P(0,&#8734;)"))
set_default_plot_size(15cm, 8cm)
display(p0)
prob=length(find(z->z> 0.5,data[2,:][:]))/length(data[2,:][:])
println("Probability that P(0,\u221e)>0.5 is $prob")


P(0,∞) -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 -1 0 1 2 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 -2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 -2000 -1900 -1800 -1700 -1600 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 -2000 0 2000 4000 -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000
Probability that P(0,∞)>0.5 is 0.88956