Joint Embedding

by Shangsi Wang, reviewed by Eric Bridgeford

Setting

Task

Given $\left\{G_i\right\}_{i=1}^N$, a collection of graphs with adjacency matrix $A_i \in \mathbb{R}^{v\times v}$ where each graph contains the same vertex set of length $v$. We desire to embed each $G_i$ into $\mathbb{R}^d$ by a vector $\lambda_i \in \mathbb{R}^d$, where $\mathbb{R}^d$ is a $d$ dimensional space.

Notation:

\begin{align*} H &= \begin{bmatrix} h_1^{(1)} & ...&h_d^{(1)} \\ h_1^{(2)} & ...& h_d^{(2)} \\ \vdots & \ddots & \vdots \\ h_1^{(v)} & ... & h_d^{(v)} \end{bmatrix} \end{align*}

where each $H \in \mathbb{R}^{v \times d}, h_i \in \mathbb{R}^v$. Here, each $h_i$ is a norm-1 vector. Then we can use the Multiple Eigen Graphs (MREG) model to form a probability matrix. First, we define the probability of each edge as a linear combination of our $1$-norms that define our $d$-dimensional space, and our coefficients $\lambda_i$:

\begin{align*} A_i[s, t] \sim Bern\left(p = \sum_{k=1}^d\lambda_i^{(k)} h_k^{(s)}h_k^{(t)} \right) \end{align*}

Then we can define the probability of each edge as:

\begin{align*} P_i &= \sum_{k=1}^d \lambda_i^{(k)} h_k h_k^T \end{align*}

And we can say that our graphs are simulated from the MREG model with coefficients $\Lambda$ and $H$:

\begin{align*} \left\{A_i \right\}_{i=1}^n \sim MREG(\lambda, H) \end{align*}

that is, each graph $A_i$ is somewhere on the space spanned by $H$ given by $\lambda_i$.

Loss Function

\begin{align*} L(\lambda, H | A) = \sum_{i=1}^n \left|\left|A_i - \sum_{k=1}^d \lambda_i^{(k)} h_kh_k^T\right|\right|^2 \end{align*}

Statistical Goal

The statistical goal is to find the norm-1 vectors $h_k$ that define the subspace, and the coefficients $\lambda_i$ that define the unique combinations of $h_k^Th_k$ for each $A_i$ that maximize our fit to $A_i$. Stated as an optimization problem, we have:

\begin{align*} \left(\hat{\Lambda}, \hat{H}\right) &= \textrm{argmin}_{\lambda_i, h_k:||h_k||=1} \sum_{i=1}^n \left|\left|A_i - \sum_{k=1}^d \lambda_i^{(k)} h_kh_k^T\right|\right|^2 \end{align*}

Another problem (addressed in other papers referenced by Shangsi) is how to choose $d$, the dimensionality of the subspace we are embedding to. In experimental settings, it is fine to arbitrarily set $d$, however, in real world settings, $d$ can be chosen by computing the algorithm under a range of $d = 1,2,...D$ for some sufficiently large $D$, and then plotting the objective function given the optimal parameters selected by each $d$ and selecting $d$ where the objective roughly begins to flatten out. From a theoretical perspective, this amounts to fitting continuously larger $d$ until adding more dimensions to our subspace provides a substantially small enough improvement in the objective that we can assume that we have found optimal $d$.

Desiderata

It is clear that the hypothesis class is going to be heavily non-convex. this means that any algorithms is going to have to make concessions somewhere to arrive at a robust, reliable solution in any sort of computationally-efficient time period. So our desiderata are:

works well in theory in certain settings
epirically performs well in simulation settings
empirically performs well in real data scenarios
is relatively fast in that it doesn't traverse the entire hypothesis set, as the hypothesis set is incredibly large.

Approach

Algorithm

For the algorithm, the hypothesis set is incredibly large. We are attempting to maximize two high-dimensional quantities, $\Lambda$ and $H$, simultaneously as we pass over each of the graphs. To accomplish this, Shangsi chooses the Alternating Descent approach; that is, maximize $\lambda_k$ by holding $h_k$ fixed using a gradient-descent approach, and then maximize $H$ using least-squares, in an alternating fashion. We repeat this until convergence for each $k=1:d$ individually, the dimensions we are attempting to embed our graphs on. Our algorithm is as follows:

Set residuals $R_i^{(1)} = A_i$ // initialize our error as the graphs themselves
for $k=1:d$
a. Initialize $h_k, \lambda_k$
b. while not converged:
- Fix $\lambda_k$, and update $h_k$ using gradient descent on $L(\Lambda | A, H)$ by minimizing wrt $\lambda_k$
- Renormalize $h_k$
- Fix $h_k$, and update $\lambda_k$ using gradient descent on $L(H | A, \Lambda)$ by minimizing wrt $h_k$
- Recompute $L(\Lambda, H | A)$
c. endwhile
d. Update residuals $R_i^{(k+1)} = R_i^{(k)} - \Lambda_{i}^{(k)}h_k^Th_k$
endfor
Output $\Lambda, H$

Evaluation Methods

Generate a simple example with a $1$ dimensional subspace and vary the number of graphs to see how effectively we can find the joint embedding by checking the mean difference between our single prediction vector and the true norm-$1$ vector.
Generate pairs of related graphs and find a classifier with minimal classification error from a $2$ dimensional subspace, measuring accuracy as the correct-classification rate.
Predict the composite creativity index from samples of DTI

Results Overview

Simulations

Basic Simulation

For this experiment, we generate $\lambda_i \sim Unif(1, 2)$ and $h_1$, our single norm vector that our graphs embed on. Then we simulate graphs $\left\{G_i\right\}_{i=1}^n \sim MREG(\left\{\lambda_i\right\}_{i=1}^n, h_1)$ using the $MREG$ model described above.

We run the joint-embedding algorithm 100 times, and vary the number of graphs simulated from $n=2^j$ for $j=4...16$. We compute the quantity $||\hat{h}_1^m - h_1||$, or the model bias after convergence on iteration $m$. This is simply a measure of how close we are to the expected $h_1$. As we can see in the plots below, the more graph examples we have, the more accurately we are able to predict $\hat{h}_1$. Additionally in this simulation, we look at the model convergence $||\hat{h}_1^n - \hat{h}_1^{n/2}||$. This compares our $\hat{h}_1^n$ on $n$ graphs to previous estimate $\hat{h}_1^{n/2}$ on $n/2$ graphs, and verifies that our model converges towards a solution as we increase the number of graphs.

<img="joint embedding simple graph", src="./images/week_25/JE_simple.png">

As we can see, our estimation gets closer to the correct estimation (blue line), and our solution converges as we increase the number of iterations.

Classification Simulation

For this experiment, we generate $m$ pairs $\left\{(A_i, Y_i)\right\}_{i=1}^m$ of binarized graphs with a binarized label, $Y_i \in \left\{0, 1\right\}$, with our embedding space defined by two vectors $h_1$ and $h_2$. To classify graphs, $m=200$ graphs are sampled and analyzed using the Joint-Embedding algorithm. A $1-$NN rule is used to classify the graphs, and prediction accuracy is measured from $m=4:200$ graphs. As can be seen below, Joint-Embedding (JE) vastly outperforms the similar algorithm Laplacian Eigenmap (LE) in terms of the classification.

<img="JE vs LE", src="./images/week_25/JE_vs_LE.png">

Real Data

For this study, Shangsi investigates the usage of the Joint-Embedding algorithm on predicting individual composite creativity index (CCI) given sets of diffusion-tensor imaging (DTI)-derived connectomes. Predictions are made by embedding 113 graphs with $d=10$, and fitting a linear model by performing:

\begin{align*} CCI_i \sim \beta_0 + \hat{\lambda}^T_i \beta + \epsilon_i \end{align*}

Or a linear regression of the $CCI_i$ onto $\lambda_i$. The model performs significantly better than the null model with a p-value of 0.0018 after performing an $F$-test. We compare this to exclusively looking at the graphs themselves, which produces a p-value of 0.0039. The first test reveals particularly that having more connections across hemispheres in particular results in greater creativity, while the experiment performed on the raw graphs only indicates that more connectivity itself is responsible for greater creativity.

<img="Real Data", src="./images/week_25/JE_real.png">