An iterative approach to estimate best-case performance?

ground truth of the dataset:

$$ P = X^T Y \text{ ; with } P \in \mathbf{R}^{m \times n} \text{ , } X \in \mathbf{R}^{k \times m} \text{ , } Y \in \mathbf{R}^{k \times n} $$

Each row of $X^T$ is one-sparse of unit length. Each element of $Y$ is a probability.

A probabilistic realization of $P$ produces a synthetic dataset $A$; for each element $A_{ij} \sim \text{Bernoulli}(P_{ij})$. We also only observe a subset of $A$; let $\Omega$ denote the set of indices $(i,j)$ of these observations.

Given a dataset matrix $A$, we would like to estimate the underlying probability matrix $P$ by the matrix factorization:

$$ \hat{X}^T \hat{Y} \approx P $$

However, we only observe $A$ (binary outcomes drawn from $P$). Given this loss of information, what is the "correct" solution for our estimates?

best solution for $X$

First, let's tackle the problem for estimating the cluster assignments; i.e. the matrix $\hat{X}$. Consider a single read $a_i$ (a binary vector and a row of $A$). We can use Bayes' Rule to derive the probability that $a_i$ belongs to a given cell type $h$:

$$p(h~|~a_i) = \frac{p(a_i~|~h) p(h)}{p(a_i)}$$

If we assume we know the true methylation profiles (the matrix $Y$), we can calculate $p(a_i~|~h)$, which is the likelihood for a multivariate Bernoulli distribution:

$$ p(a_i~|~h) = \prod_{j \in \Omega_i} A_{ij} Y_{hj} + (1-A_{ij})(1-Y_{hj}) \text{ ,}$$

where $\Omega_i$ is the set of observed indices in row $i$ of $A$. If we make no further assumptions, and let $p(h)$ be uniform over the cell types then we arrive a solution for that is in some sense the best possible estimate of $X$. Call this correct/optimal estimate $\tilde{X}$. Each column of $\tilde{X}$:

$$\tilde{x}_i = N_i \begin{bmatrix} p(a_i~|~h=1)\\ p(a_i~|~h=2)\\ \vdots \\ p(a_i~|~h=k) \end{bmatrix} $$

where $N_i = \frac{p(h)}{p(a_i)}$ is a normalizing constant that ensures $\tilde{x}_i$ sums to one. In practice we don't need to calculate this normalization constant explicitly; we just calculate each $p(a_i~|~h)$ and normalize the vector to sum to one.

We could alternatively choose to assume that we know the prior probailities, $p(h)$, for each cell type. This is equivalent to assuming we know the proportions of each cell type in the sample. Then the solution becomes

$$\tilde{x}_i = N_i \begin{bmatrix} p(a_i~|~h=1)p(h=1)\\ p(a_i~|~h=2)p(h=2)\\ \vdots \\ p(a_i~|~h=k)p(h=k) \end{bmatrix} $$

where $N_i = \frac{1}{p(a_i)}$. As before, we don't need to calculate $N$ explicitly.

best solution for $Y$

In the previous section we assumed that $Y$ was perfectly reconstructed from $A$ in order to bound the best possible estimate of $X$. Now we reverse this logic by assuming $X$ is known, and $Y$ is to be estimated. What is the best possible estimate of $Y$, $\tilde{Y}$?

Let $\Theta_{h,j}$ denote the indices in column $j$ of the data matrix $A$ that are both observed (in $\Omega$) and belonging to cell type $h$. Then the maximum likelihood estimate of $Y_{h,j}$ is observed proportion of methylated sites position $j$ for reads originating from cell type $h$:

$$ \tilde{Y}_{h,j} = \underset{p}{\text{argmax}} \prod_{i \in \Theta_{h,j}}A_{ij} p + (1-A_{ij})(1-p) = \frac{\sum_{i \in \Theta_{h,j}} A_{ij}}{N_{h,j}}$$

where $N_{h,j}$ is number of reads for cell type $h$ at position $j$ (i.e. the number of elements in $\Theta_{h,j}$).

At the moment this analysis assumes $X$ is a hard clustering assignment... Instead we should take a weighted MLE for each $Y_{h,j}$

best solution we could hope for?

So far we have derived very optimistic bounds on performance by assuming we know the ground truth for either $X$ or $Y$, and then estimating the other. In order to derive more realistic bounds on performance we can alternate between fixing $X$ and $Y$


In [1]:
using MethylClust
using PyPlot


INFO: Loading help data...

In [2]:
m,n,k = 100,100,2
A,P,c = generate_reads(N=100,W=30);

# ground truth X
Xr = zeros(k,m)
for i = 1:m
    Xr[c[i],i] = 1
end

# ground truth Y
Yr = P


Out[2]:
2x100 Array{Float64,2}:
 0.280212  0.45799   0.98048    0.569843  …  0.468104  0.230964  0.368798
 0.655155  0.265631  0.0761956  0.358521     0.590719  0.580406  0.521198

In [3]:
# estimate best X given the data and Y
function est_X(A,Y)
    Xn = zeros(k,m)
    #priors = (1/m) * [sum(c .== ik) for ik=1:k]

    for i = 1:m
        # Restrict computations to range of the read
        ind = find(!isnan(A[i,:]))

        # Compute likelihoods for each cell type
        lkhs = ones(k)
        for kk = 1:k
            for j = ind
                assert(!isnan(A[i,j]))
                #lkhs[kk] += (A[i,j]==1.0) ? log(P[kk,j]) : log(1-P[kk,j])
                lkhs[kk] *= (A[i,j]==1.0) ? P[kk,j] : 1-P[kk,j]
                #lkhs[kk] += A[i,j]*log(Y[kk,j])
                #lkhs[kk] += (1-A[i,j])*log(1-Y[kk,j])
            end
        end

        # normalize posterior probabilities so they sum to one
        Xn[:,i] = lkhs / sum(lkhs)
        #Xn[i,:] = (lkhs.*priors) / sum(lkhs.*priors)
    end
    return Xn
end


# estimate best Y given the data and X
function est_Y(A,X)
    Yn = zeros(k,n)
    #priors = (1/m) * [sum(c .== ik) for ik=1:k]
    
    c_est = [indmax(X[:,i]) for i=1:m]

    for j = 1:n
        obs_ind = find(!isnan(A[:,j]))
        for h = 1:k
            ind = intersect(obs_ind,find(c_est .== h))
            Yn[h,j] = sum(A[ind,j] .== 1.0) / length(ind)
        end
    end
    return Yn
end


Out[3]:
est_Y (generic function with 1 method)

In [4]:
# First show true X and true Y
matshow(Xr,cmap=ColorMap("Greys")),xticks([]),yticks([]),title("True X")
matshow(Yr,cmap=ColorMap("Greys")),xticks([]),yticks([]),title("True Y")


Out[4]:
(PyObject <matplotlib.image.AxesImage object at 0x112bb6210>,({},{}),({},{}),PyObject <matplotlib.text.Text object at 0x112b76f90>)

In [6]:
X = deepcopy(Xr)
Y = deepcopy(Yr)
println(mean(abs(X-Xr)),"\t",mean(abs(Y-Yr)))
for _ = 1:10
    X = est_X(A,Y)
    Y = est_Y(A,X)
    println(mean(abs(X-Xr)),"\t",mean(abs(Y-Yr)))
end
matshow(X,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est X")
matshow(Y,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est Y")
#println(mean(abs(X-Xr)))
#println(mean(abs(Y-Yr)))


0.0	0.0
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
0.009327208167505483	0.08570543026463932
Out[6]:
(PyObject <matplotlib.image.AxesImage object at 0x1165ddb50>,({},{}),({},{}),PyObject <matplotlib.text.Text object at 0x1165a89d0>)

In [9]:
X = deepcopy(Xr)
Y = deepcopy(Yr)
for _ = 1:100
    Y = est_Y(A,X)
    X = est_X(A,Y)
end
matshow(X,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est X")
matshow(Y,cmap=ColorMap("Greys"),clim=([0,1])),xticks([]),yticks([]),title("est Y")
println(mean(abs(X-Xr)))
println(mean(abs(Y-Yr)))


0.009327208167505483
0.08570543026463932

In [13]:
X


Out[13]:
2x100 Array{Float64,2}:
 0.935705   0.976045  0.0211237  …  0.0274073  0.000946332  0.00429871
 0.0642948  0.023955  0.978876      0.972593   0.999054     0.995701  

In [8]:



Out[8]:
100-element Array{Any,1}:
 1
 1
 1
 1
 1
 1
 2
 1
 1
 1
 1
 1
 1
 ⋮
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

In [ ]: