Personal view
In a narrow biological mindset, data integration involves having multiple datasets, either at a single omics level or a multi-omics stack, and extracting a piece of information that is holistic in nature. The way to achive this is by understanding model/feature representation from complex data sets and by designing algorithms for knowledge transfer. In a wider industry circle, a lot of the drive in the area comes from recommender systems (an equivalent in biology would be to pick the right genes from a set of samples, or the right sample from a set of genes, unfortunately the biological research is a backward water in this case, largely due to the p-value obsession driving the publishing industry).
Classes of methods:
We have a final day task in data integration, for now we will use a dimension reduction technique that uses linear decompositions to formulate a subspace of maxinal co-variation between two datasets. First let us formulate a test dataset which is a very useful exercise that can allow you to make your method more robust and tune parameters.
In [1]:
import numpy as np
# features
n = 500
# latents variables:
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)
l3 = np.random.normal(size=n)
l4 = np.random.normal(size=n)
latent_X = np.array([l1, l1, l2, l2, l2, l3]).T
latent_Y = np.array([l2, l3, l4, l4]).T
X = 0.50*latent_X + 0.50*np.random.normal(size=6 * n).reshape((n, 6))
Y = 0.50*latent_Y + 0.50*np.random.normal(size=4 * n).reshape((n, 4))
X_train = X[:n // 2]
Y_train = Y[:n // 2]
X_test = X[n // 2:]
Y_test = Y[n // 2:]
print("Corr(X)")
print(np.round(np.corrcoef(X.T), 2))
print("Corr(Y)")
print(np.round(np.corrcoef(Y.T), 2))
Many clustering algorithms would now fail because we have only one common dimension (number of features, you can thing for example in terms of genes or samples). More-over, the biological signal is obscured by different latent variables tht can be sources of biological and technical noise. CCA
Canonical-correlation analysis seeks linear combinations of X and Y that have maximum correlation to each other. The technique is inductive, the first pair of components has scores that capture the most variation, and the second pair is orthogonal (uncorelated) to the first, etc. We can visualize this with a couple of plots.
In [2]:
from sklearn.cross_decomposition import CCA
cca = CCA(n_components=2)
cca.fit(X_train, Y_train)
X_train_r, Y_train_r = cca.transform(X_train, Y_train)
X_test_r, Y_test_r = cca.transform(X_test, Y_test)
In [3]:
print(X_train_r.shape, Y_train_r.shape)
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
plt.subplot(221)
plt.scatter(X_train_r[:, 0], Y_train_r[:, 0], label="train",
marker="o", c="b", s=25)
plt.scatter(X_test_r[:, 0], Y_test_r[:, 0], label="test",
marker="o", c="r", s=25)
plt.xlabel("x scores")
plt.ylabel("y scores")
plt.title('Comp. 1: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 0], Y_test_r[:, 0])[0, 1])
plt.xticks(())
plt.yticks(())
plt.legend(loc="best")
plt.subplot(224)
plt.scatter(X_train_r[:, 1], Y_train_r[:, 1], label="train",
marker="o", c="b", s=25)
plt.scatter(X_test_r[:, 1], Y_test_r[:, 1], label="test",
marker="o", c="r", s=25)
plt.xlabel("x scores")
plt.ylabel("y scores")
plt.title('Comp. 2: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 1], Y_test_r[:, 1])[0, 1])
plt.xticks(())
plt.yticks(())
plt.legend(loc="best")
# 2) Off diagonal plot components 1 vs 2 for X and Y
plt.subplot(222)
plt.scatter(X_train_r[:, 0], X_train_r[:, 1], label="train",
marker="*", c="b", s=50)
plt.scatter(X_test_r[:, 0], X_test_r[:, 1], label="test",
marker="*", c="r", s=50)
plt.xlabel("X comp. 1")
plt.ylabel("X comp. 2")
plt.title('X comp. 1 vs X comp. 2 (test corr = %.2f)'
% np.corrcoef(X_test_r[:, 0], X_test_r[:, 1])[0, 1])
plt.legend(loc="best")
plt.xticks(())
plt.yticks(())
plt.subplot(223)
plt.scatter(Y_train_r[:, 0], Y_train_r[:, 1], label="train",
marker="*", c="b", s=50)
plt.scatter(Y_test_r[:, 0], Y_test_r[:, 1], label="test",
marker="*", c="r", s=50)
plt.xlabel("Y comp. 1")
plt.ylabel("Y comp. 2")
plt.title('Y comp. 1 vs Y comp. 2 , (test corr = %.2f)'
% np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1])
plt.legend(loc="best")
plt.xticks(())
plt.yticks(())
plt.show()
Task: