Implementation of Generalized correlation measure using count statistics for gene expression data with ordered samples

https://academic.oup.com/bioinformatics/article/34/4/617/4474788

Given two times series $\mathbf{x} = (x_1, x_2 \dots x_m)$ and $\mathbf{y} = (y_1, y_2, \dots, y_n)$, we want to define a correlation measure that captures local association between $\mathbf{x}$ and $\mathbf{y}$ even in cases when yhey are not aligned. Given there are localized correaltions, the two time series cannot be aligned globally. That is, there are local subsequences which exhibit local correlation. We do not know apriori the location of the subsequences, but probably know how much far apart at the maximum they can occur in.

Define $\phi$ a rank function, such that $\phi(x_i, x_{i+1} \dots x_{i+n}) = (i_1, i_2, i_3 \dots i_n)$ such that $x_({i_1}) = x_1$, $x_({i_2}) = x_2$,and so on, where $x_(j)$ represents the $j^{th}$ order statistic. To calrify, an example would be $\phi((7,3,4)) = (3,1,2)$ as $3 < 4 < 7$

Also define $\mathbb{I}_(i, j)^+$ to be the indicator when the length $k$ subsequence starting at $i^{th}$ position of $\mathbf{x}$ and $j^{th}$ position of $\mathbf{y}$ have the identical ranking. That is: $$ \mathbb{I}_(i, j)^+ = \mathbb{I}(\phi((x_{i}, x_{i+1}, \dots x_{i+k-1})) = \phi((y_{j}, y_{j+1}, \dots y_{j+k-1})) $$

Similary, when the rank patterns are completely reversed: $$ \mathbb{I}_(i, j)^- = \mathbb{I}(\phi((x_{i}, x_{i+1}, \dots x_{i+k-1})) = \phi((-y_{j}, -y_{j+1}, \dots y_{j+k-1})) $$

For some maximum lag difference of $d$ between $x_i$ and $y_j$, define $v =(i,j)$; $S = \{ v = (i, j) | 1 \leq i \leq m-k+1; 1 \leq j \leq n-k+1, |i-j| \leq d \}$, then correaltions measures are deined to be: \begin{align*} V^+ &= \sum_{v\in S} \mathbb{I}_v^+\\ V^- &= \sum_{v\in S} \mathbb{I}_v^-\\ V &= V^+ + V^- \end{align*}

Assumption 1: Either of $\mathbf{x}$ or $\mathbf{y}$ has an exchangable distribution, ithat is any permutation of $\mathbb{x} = (x_{i_1}, x_{i_2}, \dots x_{i_n})$ produces the same joint probability distirbution as $(x_1, x_2, \dots x_n$).

Assumption 2: $\mathbf{x}$ and $\mathbf{y}$ are independent,

WLOG consider $m=n$

Because of Assumption 1, $P\left(\phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) = \frac{1}{k!}$ as each pemutation is equally likely.

\begin{align*} \mathbb{E}[I_{i,j}]^+ &= \sum_r P\left(\phi((x_i, x_{i+1}, \dots x_{i+k-1})) = r , \phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) \\ &= \sum_r P\left(\phi((x_i, x_{i+1}, \dots x_{i+k-1})) = r | \phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) P\left(\phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) \\ &= \sum_r P\left(\phi((x_i, x_{i+1}, \dots x_{i+k-1})) = r \right) P\left(\phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) \because \text{$\mathbf{x,y}$ are independent}\\ &= \frac{1}{k!} \sum_r P\left(\phi((x_i, x_{i+1}, \dots x_{i+k-1})) = r \right) P\left(\phi((y_i, y_{i+1}, \dots y_{i+k-1})) = r \right) \\ &= \frac{1}{k!} \end{align*}

Similarly, $$ \mathbb{E}[I_{i,j}]^- = \frac{1}{k!} $$

Let $n_{k,d}$ represent the total number of terms given the constraints $1 \leq i \leq m-k+1; 1 \leq j \leq n-k+1, |i-j| \leq d \}$ We have $\bar{n} = n-k+1$ such $i$ to be summed over.

Consider $ j \leq d$, then $j-d \leq i \leq j+d$ then each $j$ has $2d+1$ positions to be summed over. However $j <d$ leads to edge effects. This contributes to $2 \times \frac{d(d+1)}{2}$

Thus, $$ \mathbb{E}[V] = \frac{2}{k!} ( (2d+1)(n-k+1) - d(d+1) ) $$

Variance calculations : TODO



In [37]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [38]:

    
def corr(x, y, k, d):
    x = np.asarray(x)
    y = np.asarray(y)
    m = len(x)
    n = len(y)
    Vp = np.zeros((m,n))
    Vn = np.zeros((m,n))
    for i in range(0, m-k+1):
        for j in range(max(0, i-d), min(n-k, i+d)+1):
            xsub = x[range(i, i+k)]
            ysub = y[range(j, j+k)]
            rankx = np.argsort(xsub)
            ranky = np.argsort(ysub)
            indp_ij = 0
            indn_ij = 0
            if np.array_equal(rankx, ranky):
                indp_ij = 1
            rankyn = np.argsort(-ysub)
            if np.array_equal(rankx, rankyn):
                indn_ij = 1
            Vp[i,j] = indp_ij
            Vn[i,j] = indn_ij
    return Vp, Vn



In [39]:

    
x = [1.2, 2.3, 3.5, 4.1, 5.8]
y = [3.1, 2.4, 0.9, 4.3, 5.5]
Vp, Vn = corr(x, y, 3, 1)



In [40]:

    
Vp









    Out[40]:





array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])



In [41]:

    
Vn









    Out[41]:





array([[ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])



In [ ]: