Each row of $X^T$ is one-sparse of unit length. Each element of $Y$ is a probability.
A probabilistic realization of $P$ produces a synthetic dataset $A$; for each element $A_{ij} \sim \text{Bernoulli}(P_{ij})$. We also only observe a subset of $A$; let $\Omega$ denote the set of indices $(i,j)$ of these observations.
Given a dataset matrix $A$, we would like to estimate the underlying probability matrix $P$ by the matrix factorization:
$$ \hat{X}^T \hat{Y} \approx P $$However, we only observe $A$ (binary outcomes drawn from $P$). Given this loss of information, what is the "correct" solution for our estimates?
First, let's tackle the problem for estimating the cluster assignments; i.e. the matrix $\hat{X}$. Consider a single read $a_i$ (a binary vector and a row of $A$). We can use Bayes' Rule to derive the probability that $a_i$ belongs to a given cell type $h$:
$$p(h~|~a_i) = \frac{p(a_i~|~h) p(h)}{p(a_i)}$$If we assume we know the true methylation profiles (the matrix $Y$), we can calculate $p(a_i~|~h)$, which is the likelihood for a multivariate Bernoulli distribution:
$$ p(a_i~|~h) = \prod_{j \in \Omega_i} A_{ij} Y_{hj} + (1-A_{ij})(1-Y_{hj}) \text{ ,}$$where $\Omega_i$ is the set of observed indices in row $i$ of $A$. If we make no further assumptions, and let $p(h)$ be uniform over the cell types then we arrive a solution for that is in some sense the best possible estimate of $X$. Call this correct/optimal estimate $\tilde{X}$. Each column of $\tilde{X}$:
$$\tilde{x}_i = N_i \begin{bmatrix} p(a_i~|~h=1)\\ p(a_i~|~h=2)\\ \vdots \\ p(a_i~|~h=k) \end{bmatrix} $$where $N_i = \frac{p(h)}{p(a_i)}$ is a normalizing constant that ensures $\tilde{x}_i$ sums to one. In practice we don't need to calculate this normalization constant explicitly; we just calculate each $p(a_i~|~h)$ and normalize the vector to sum to one.
We could alternatively choose to assume that we know the prior probailities, $p(h)$, for each cell type. This is equivalent to assuming we know the proportions of each cell type in the sample. Then the solution becomes
$$\tilde{x}_i = N_i \begin{bmatrix} p(a_i~|~h=1)p(h=1)\\ p(a_i~|~h=2)p(h=2)\\ \vdots \\ p(a_i~|~h=k)p(h=k) \end{bmatrix} $$where $N_i = \frac{1}{p(a_i)}$. As before, we don't need to calculate $N$ explicitly.
In the previous section we assumed that $Y$ was perfectly reconstructed from $A$ in order to bound the best possible estimate of $X$. Now we reverse this logic by assuming $X$ is known, and $Y$ is to be estimated. What is the best possible estimate of $Y$, $\tilde{Y}$?
Let $\Theta_{h,j}$ denote the indices in column $j$ of the data matrix $A$ that are both observed (in $\Omega$) and belonging to cell type $h$. Then the maximum likelihood estimate of $Y_{h,j}$ is observed proportion of methylated sites position $j$ for reads originating from cell type $h$:
$$ \tilde{Y}_{h,j} = \underset{p}{\text{argmax}} \prod_{i \in \Theta_{h,j}}A_{ij} p + (1-A_{ij})(1-p) = \frac{\sum_{i \in \Theta_{h,j}} A_{ij}}{N_{h,j}}$$where $N_{h,j}$ is number of reads for cell type $h$ at position $j$ (i.e. the number of elements in $\Theta_{h,j}$).
At the moment this analysis assumes $X$ is a hard clustering assignment... Instead we should take a weighted MLE for each $Y_{h,j}$
So far we have derived very optimistic bounds on performance by assuming we know the ground truth for either $X$ or $Y$, and then estimating the other. In order to derive more realistic bounds on performance we can alternate between fixing $X$ and $Y$
In [1]:
using MethylClust
using PyPlot
In [2]:
m,n,k = 100,100,2
A,P,c = generate_reads(N=100,W=30);
# ground truth X
Xr = zeros(k,m)
for i = 1:m
Xr[c[i],i] = 1
end
# ground truth Y
Yr = P
Out[2]:
In [3]:
# estimate best X given the data and Y
function est_X(A,Y)
Xn = zeros(k,m)
#priors = (1/m) * [sum(c .== ik) for ik=1:k]
for i = 1:m
# Restrict computations to range of the read
ind = find(!isnan(A[i,:]))
# Compute likelihoods for each cell type
lkhs = ones(k)
for kk = 1:k
for j = ind
assert(!isnan(A[i,j]))
#lkhs[kk] += (A[i,j]==1.0) ? log(P[kk,j]) : log(1-P[kk,j])
lkhs[kk] *= (A[i,j]==1.0) ? P[kk,j] : 1-P[kk,j]
#lkhs[kk] += A[i,j]*log(Y[kk,j])
#lkhs[kk] += (1-A[i,j])*log(1-Y[kk,j])
end
end
# normalize posterior probabilities so they sum to one
Xn[:,i] = lkhs / sum(lkhs)
#Xn[i,:] = (lkhs.*priors) / sum(lkhs.*priors)
end
return Xn
end
# estimate best Y given the data and X
function est_Y(A,X)
Yn = zeros(k,n)
#priors = (1/m) * [sum(c .== ik) for ik=1:k]
c_est = [indmax(X[:,i]) for i=1:m]
for j = 1:n
obs_ind = find(!isnan(A[:,j]))
for h = 1:k
ind = intersect(obs_ind,find(c_est .== h))
Yn[h,j] = sum(A[ind,j] .== 1.0) / length(ind)
end
end
return Yn
end
Out[3]:
In [4]:
# First show true X and true Y
matshow(Xr,cmap=ColorMap("Greys")),xticks([]),yticks([]),title("True X")
matshow(Yr,cmap=ColorMap("Greys")),xticks([]),yticks([]),title("True Y")
Out[4]:
In [6]:
X = deepcopy(Xr)
Y = deepcopy(Yr)
println(mean(abs(X-Xr)),"\t",mean(abs(Y-Yr)))
for _ = 1:10
X = est_X(A,Y)
Y = est_Y(A,X)
println(mean(abs(X-Xr)),"\t",mean(abs(Y-Yr)))
end
matshow(X,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est X")
matshow(Y,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est Y")
#println(mean(abs(X-Xr)))
#println(mean(abs(Y-Yr)))
Out[6]:
In [9]:
X = deepcopy(Xr)
Y = deepcopy(Yr)
for _ = 1:100
Y = est_Y(A,X)
X = est_X(A,Y)
end
matshow(X,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est X")
matshow(Y,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est Y")
println(mean(abs(X-Xr)))
println(mean(abs(Y-Yr)))
In [13]:
X
Out[13]:
In [8]:
Out[8]:
In [ ]: