Bayesian Hierarchy

1. Reconstruction without calibration (details)

2. Reconstruciton with calibration

Example:

What we don't observe: $\left\{Y(t): ~t\in \{1000,...,1999\}\right\}$

What we do observe: $\left\{X(t): ~t\in\{1000,...,1999\}\right\}$

$X$ is "proxy" data: data like tree-rings, ice-cores, lake-sediments, which are highly correlated with $Y$, and $Y$ is global average temperature.

Problem: if we don't have data for $Y$, we may seek other external data which also explains $Y$.

Solution: we assume we can explain $Y$ with physical factors.

  • $C=$carbon in atmosphere (GHG in general)
  • $V=$Volcanic activity
  • $S=$Solar activity

    Equation: $$Y(t) = \beta_0 + \beta_1 S(t) + \beta_2 V(t) + \beta_3 C(t) + \sigma_Y \epsilon_Y(t)$$ where $sigma_Y \epsilon_Y(t) ~iid~\mathcal{N}(0,sigma_Y^2)$

    This is our prior equation on $Y$.

    This is the bottom-level in the hierarchy. The top-level of the hierarchy is an equation relating $X\&Y$.

    $$X(t) = \alpha_0 + \alpha_1 Y(t) + \sigma_X \epsilon_X(t)$$ where $\sigma_X \epsilon_X(t) ~iid ~\mathcal{N}(0,\sigma_X^2)$. This is the likeliood equation.

So, we see that the density of $Y(t)$ (the prior density) is the density of normal random variable $\mathcal{N}\left(\beta_0 + \beta_1 S(t) + \beta_2 V(t) + \beta_3 C(t), \sigma_Y^2\right)$

We also see that the conditional density of $X(t)$ given $Y(t)$ (the likelihood) is the density of $\mathcal{N}\left(\alpha_0 + \alpha_1 Y(t), \sigma_X^2\right)$

In summary, We have

  • $Y:$ unobserved
  • $X:$ observed
  • prior density of $Y$ involves external data $W = (S,V,C)$
    • Notation: $[Y]$ (really it should be $\mathbf{P}_Y(y)$)
  • likelihood of $X$ given $Y$:

    • Notation: $[X | Y]$ (this is really $\mathbf{P}_{X|Y}(x|y)$)

    Question: find $[Y|X]$?

i.e., what is the posterior density of what we do not observe fiven what we do observe.

Answer: $$[Y|X] = \frac{[X|Y] ~ [Y]}{[X]}$$

Bayes' theorem,

Very important note: We are interested in the behavior of that last formula as a function of $Y$; the behavior as a function of $X$ does not matter. So, we can basically ignore the denominator.

How can we compute $[X|Y][Y]$? More importantly, how can we sample from that distribution? The first question rarely can be answered. The seconds question is almost as good because by sampling many times we can draw a histogram and see that this posterior density looks like.

Answer to the 2nd question: Gibbs sampler

  • Major issue: these posteriors depend on the values of $(\sigma_X^2, \sigma_Y^2, \beta_0, \beta_1, ..., \alpha_0, \alpha_1)$

If we do know $\theta$, we can solve $[X|Y] ~[Y]$ explicitly, and we dont need Gibbs. But, we almost never know $\theta$, we want to estimate it.

Gibbs sampler

To estimate $\theta$, we also need a prior and likelihood: Given ourselves $[\theta]$ prior density of $\theta$

  • For the $\alpha$'s and $\beta$'s, we use normal priors.

  • For the $\sigma_X^2\&\sigma_Y^2$ we use inverse gammas

    Why such priors? Because if we combine this with the likelihoods for each parameter, out of the 2 equations we have assuming everything is known, we get normal posteriors for the $\alpha$'s and $\beta$'s, and we get inverse-gamma posteriors for the $\sigma^2$'s.

    So this is a qustion of convinience. These priors are called conjugate priors for stability reasons.

    Why is convinience important? Because it ensures we can sample from those posteriors.

    Note: what are the likelihood of $\alpha$'s, $\beta$'s and $\sigma^2$'s given everything else?

    Try it with $\alpha$'s: the $\alpha$'s come from $X(t) = \alpha_0 + \alpha_1 Y(t) + \sigma_X \epsilon_X(t)$

    $$[X | \alpha_0, \alpha_1, \sigma_X, Y]$$ is the likelihood of $X$.

    we see that $$X \sim \mathcal{N}(\alpha_0 + \alpha_1 Y(t) , \sigma_x^2)$$

    similarly,

$$\sigma_X^2 \sim IG(...)$$

and then we compute the posteriors. For instance the posteriors for $\alpha_0$:

$$[\alpha_0 | X, \alpha_1, \sigma_x, Y] = \frac{1}{normalization~factor} [X|\alpha_0, \alpha_1, \sigma_X, Y]~[\alpha_0]$$

Gibbs sampling step-by-step:

(0) Do the math and compute all these posteriors like the case we just did for $\alpha_0$.

  Note: since all parameters (except $\alpha_0$) are fixed, in Bayes here, only $\alpha_0\&X$ switch sides.

(1) Initialize the values of all parameters for all variables, for example by sampling from their priors.

(2) Sample from all the posteriors computed in (0), with all other parameters equal to their current values.

(3) Repeat step (2), for each variable & each parameter. Do tha $N$ times. A good choice for $N=10000$

Quick Summary:

  • $Y = \text{ unobserved variable}$

  • $X = \text{observed variable} (data)$

  • $\text{Likelihood: } [X|Y] \text{ conditional probability density)}$

  • $\text{Prior: } [Y]$

    $\\\text{ in reality in a hierarchical setting, we give a model for }Y\text{ which uses external data }W$

  • $\text{Parameters: } \Theta \Longrightarrow \text{also need: } [X|\Theta] \text{ and } [\Theta]$$

Example:

$$X_t = \alpha_0 + \alpha_1 Y_t + \sigma_X \epsilon_{X,t}$$$$Y_t = \beta_0 + \beta_1 W_t + \sigma_Y \epsilon_{Y,t}$$$$\Rightarrow \Theta = \left(\alpha_0, \alpha_1, \sigma_X^2, \beta_0, \beta_1, \sigma_Y^2\right)$$
  • Perhaps, the parameters need estimating: $[\Theta|X]$
  • For sure, we want $[Y|X]$

  • Bayes: $$[Y|X] \propto [X|Y] ~[Y]$$

    Note that we do not need the denominator for thr exact equality. If we did need it, we would find it by integrating the numerator over all $Y$ values. Typically, that integral cannot be computed.

    $$[\Theta|X] \propto [X | \Theta] ~[\Theta]$$

  • Our mathematical setup:

    • Normal priors for $\alpha_0, \alpha_1, \beta_0, \beta_1$

    • Inverse Gamma priors for $\sigma_X^2, \sigma_Y^2$

  • Our shortcut:

    • use Gibbs sampler so that the only thing we really need is posteriors given everything else
  • Example: we pretend $X\& Y$ are one-dimensional : $N = 1 (\text{ we ignore }t)$

$$[Y | X, \Theta] \propto [X|Y,\Theta] ~ [Y|\Theta]$$

We use the idea that we are flipping $X\& Y$ as if we were computing $[Y|X]$ and $\Theta$ was known.

  • $X_t = \alpha_0 + \alpha_1 Y_t + \sigma_X \epsilon_{X,t} \Longrightarrow $ $\displaystyle X|Y,\Theta \sim \mathcal{N}(\alpha_0 + \alpha_1 Y, \sigma_X^2) $

    $$ = \frac{1}{\sqrt{2\pi \sigma_X^2}} \exp \left(-\frac{1}{2\sigma_X^2} (X - (\alpha_0 + \alpha_1Y))\right)^2$$

  • $Y_t = \beta_0 + \beta_1 W_t + \sigma_Y \epsilon_{Y,t} \Longrightarrow Y|\Theta \sim \mathcal{N}(\beta_0 + \beta_1 W, \sigma_Y^2)$

    $$ = \frac{1}{\sqrt{2\pi \sigma_Y^2}} \exp \left(-\frac{1}{2\sigma_Y^2} (Y - (\beta_0 + \beta_1Y))\right)^2$$

  • Back to $[Y|X,\Theta] \propto \text{ product of both terms above} $

    $$\propto \exp \left(-\frac{1}{2} \left[ \frac{1}{\sigma_X^2} (X-\alpha_0 + \alpha_1Y) + \frac{1}{\sigma_Y^2} (Y - (\beta_0 + \beta_1))^2\right]\right) \\ \propto \exp \left(-\frac{1}{2} \left[\frac{-2X\alpha_1Y + 2\alpha_0 \alpha_1 Y + \alpha_1^2Y^2}{\sigma_X^2} + \frac{Y^2 - 2Y(\beta_0 + \beta_1W)}{\sigma_Y^2}\right]\right)$$

Note: If we have $\exp(aX^2 + bX + C) \Longrightarrow \sim\displaystyle \mathcal{N}(-\frac{b}{2a}, \frac{1}{-2a})$

Now that we know how to recognize a normal distribution, we see this is a normal desnity because it is quadratic in $Y$.

$$[Y|X,\Theta] \propto \exp\left(-\frac{1}{2}\left(Y^2(\frac{\alpha_1^2}{\sigma_X^2} + \frac{1}{\sigma_Y^2}) + Y(\frac{-2X\alpha_1 + 2\alpha_0 \alpha_1}{\sigma_X^2} + \frac{-2\beta_0 - 2\beta_1}{\sigma_Y^2})\right)\right)$$

From the previous calcualtions, we see that this is normal density with

  • mean: $\displaystyle - \frac{\frac{-2X\alpha_1 + 2\alpha_0 \alpha_1}{\sigma_X^2} + \frac{-2\beta_0 - 2\beta_1}{\sigma_Y^2}}{2\left(\frac{\alpha_1^2}{\sigma_X^2} + \frac{1}{\sigma_Y^2}\right)}$

  • variance: $\displaystyle \frac{1}{\frac{\alpha_1^2}{\sigma_X^2} + \frac{1}{\sigma_Y^2}}$

Interesting fact: we notice that the posterior variance is equal to $\displaystyle \frac{1}{\displaystyle \frac{1}{\text{priot variance of }Y} + \frac{1}{\text{prior variance of }Y\text{ in likelihood model}}} = \text{geometric average of those two variances}$

We just proved that ( at least in what we need in Gibbs sampler) a normal prior and a normal likelihood, imply a normal posterior. This also works for the $\alpha$'s and $\beta$'s.

In other words: $$\left[\left(\begin{array}{c}\alpha_0\\\alpha_1\end{array}\right) ~\biggr\vert~ X, Y, \sigma_X^2, \sigma_Y^2, \left(\begin{array}{c}\beta_0\\\beta_1\end{array}\right)\right] \sim \mathcal{N}$$

$$\left[\left(\begin{array}{c}\beta_0\\\beta_1\end{array}\right) ~\biggr\vert~ X, Y, \sigma_X^2, \sigma_Y^2, \left(\begin{array}{c}\alpha_0\\\alpha_1\end{array}\right)\right] \sim \mathcal{N}$$

What about the $\sigma$s?

  • Using the same method, we prove that if the prior on $\sigma^2$ in inverse Gamma, and the equation involving $\sigma$ is linear with normal errors, then $[\sigma^2 | \text{eveything else}]$ is also inverse gamma.

These priors, speciall, those n the variuos bits of $\Theta$ are called conjugate priors


In [ ]: