SMI is a measure of the similarity between the dominant subspaces of two matrices. It comes in two flavours (projections):
The former (default) compares subspaces using ordinary least squares and can be formulated as the explained variance when predicting one matrix subspace using the other matrix subspace. PR is a restriction where only rotation and scaling is allowed in the similarity calculations.
Subspaces are by default computed using Principal Component Analysis (PCA). When the number of components extracted from one of the matrices is smaller than the other, the explained variance is calculated predicting the smaller subspace by using the larger subspace.
X1
is a matrix of size (100 x 300) generated by drawing 300000 pseudo-random numbers before column centring.
X2
is a copy of matrix X1
where the 3rd principal component has been omitted, by performing singular value decomposition on X1
and constructing X2
without the 3rd left and right singular vectors and without the 3rd singular value.
In [2]:
import numpy as np
import hoggorm as ho
X1 = ho.center(np.random.rand(100, 300))
U, s, V = np.linalg.svd(X1, 0)
X2 = np.dot(np.dot(np.delete(U, 2, 1), np.diag(np.delete(s, 2))), np.delete(V, 2, 0))
X1.shape
Out[2]:
Removing the 3rd principal component from one of the matrices, one would expect that the first two components are exactly equal. The third component should yield an explained variance equal to the proportion explained, i.e. 2/3, while the following components will have an increasing proportion of explained variance (SMI). Rows correspond to the first matrix and collumns correspond to the second matrix in the prints below. An extra subspace dimension in X1
will always cover the extra component in X2
, while the lost component in X2
can never be compensated for, thus the pattern below.
More details regarding the use of the SMI are found in the documentation.
In [3]:
smiOP = ho.SMI(X1, X2, ncomp1=10, ncomp2=10)
print(np.round(smiOP.smi, 2))
A hypothesis can be made regarding the similarity of two subspaces where the null hypothesis is that they are equal and the alternative is that they are not. Permutation testing yields the following P-values (probabilities that the observed difference could be larger given the null hypothesis is true).
In [4]:
print(np.round(smiOP.significance(), 2))
Now import hoggormplot package and plot similarities.
In [5]:
# import hoggormplot for plotting
import hoggormplot as hop
In [6]:
# Plot similarities
hop.plotSMI(smiOP, [10, 10])
The significance symbols in the diamond plot above indicate if a chosen subspace from one matrix can be found inside the subspace from the other matrix ($\supset$, $\subset$, =), or if there is signficant difference (P-values <0.001*** <0.01 ** <0.05 * <0.1 . >=0.1).
In [7]:
smiPR = ho.SMI(X1, X2, ncomp1=10, ncomp2=10, projection="Procrustes")
print(np.round(smiPR.smi, 2))
The number of permutations can be controlled for quick (100) or accurate (>10000) computations of significance.
In [8]:
print(np.round(smiPR.significance(B = 100),2))
In [8]:
smiCustom = ho.SMI(X1, X2, ncomp1=10, ncomp2=10, Scores1=U)
Reference:
Ulf Geir Indahl, Kristian Hovde Liland, Tormod Næs,
A similarity index for comparing coupled matrices,
Journal of Chemometrics 32(e3049), (2018).
In [ ]: