iterative search of variables that covariates more with Y response vector. After the first PV is found, the matrix is reduced to find the next one.
KW: supervised methods
Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval Partial Least-Squares Regression ( i PLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Applied Spectroscopy, 54(3), 413–419. http://doi.org/10.1366/0003702001949500
Apply univariate statistic to all variables and validate prediction on test set for all of them. Variable with lower RMSEP (error on the test/validation set) is chosen. All two-variable models are then build and evaluated, until the inclusion of new variable doesn't affect RMSEP anymore. For the selection of variable care must be taken not to overfit and independant validation test is often required.
KW: supervised methods
Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval Partial Least-Squares Regression ( i PLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Applied Spectroscopy, 54(3), 413–419. http://doi.org/10.1366/0003702001949500
the loadings are used to weight the X matrix recursively until convergence. Only the number of variable is chosen arbitrarily.
KW: supervised methods
Rinnan, Å., Andersson, M., Ridder, C., & Engelsen, S. B. (2014). Recursive weighted partial least squares (rPLS): An efficient variable selection method using PLS. Journal of Chemometrics, 28(5), 439–447. http://doi.org/10.1002/cem.2582
the spectra is split into interval (fixed or variable) size and models are build for each of them. Interval with best results are attributed to intervals of the spectra of most interest for the global model.
Remark: how does it behave with very small interval (literature uses interval twice as big as regular interval for PLS). How dows it compare with univariate variable selection?
Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval Partial Least-Squares Regression ( i PLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy. Applied Spectroscopy, 54(3), 413–419. http://doi.org/10.1366/0003702001949500
In [1]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline
R = 5*1.8 #km
A = np.pi*np.power(R,2)
A
Out[1]:
Determining the number of components in PCA analysis is crucial because it allows to distinguis the informative variance from the noise. It is expected that the former is larger than the latter and therefore the first eignevalues of the covariance matrix (the variances of the first components) are larger than the following ones.
The Tracy-Widom probability density function describes the distribution of the largest eigenvalues for completely random covariance matrices. Therefore it provides a test ot whether the data have structure (informative variables) or not.
Saccenti, E., & Timmerman, M. E. (2016). Approaches to sample size determination for multivariate data: Applications to PCA and PLS-DA of omics data. Journal of Proteome Research, 15(8), 2379–2393. http://doi.org/10.1021/acs.jproteome.5b01029
no funciono
In [2]:
n = 2000
p = 500
S = np.matrix( (np.random.normal(0,1,n*p).reshape(n,p)) )
C = np.cov(np.transpose(S))
S = np.subtract(S, np.mean( S, axis=0 ))
#np.round( np.mean( S, axis=0 ) )
C2 = np.divide( np.matmul(np.transpose(S),S), n )
#C = C - np.diag(C)
#C = np.identity(p)
#C[0,0] = 10
#C[1,1] = 8
A = np.linalg.eig(C2)
plt.title('sorted eigenvalues')
#plt.plot(np.arange(C.shape[0]), A[0])
plt.plot(np.arange(C.shape[0]), sorted(A[0],reverse=True))
plt.show()
plt.title('limiting histogram of eigenvalues')
plt.hist(sorted(np.real(A[0])),bins=50)
plt.show()
df = pd.DataFrame(A[0])
df.plot.density(title='limiting density of eigenvalues')
In [360]:
np.var(S[:,1])
Out[360]:
In [314]:
np.mean(S[:,1])
Out[314]:
In [ ]:
In [ ]:
In [ ]:
In [379]:
n = 5
p = 5
D = []
L = 10000
for i in range(0, L):
#S = np.matrix( (sp.randn(n,p)) )
S = np.matrix( (np.random.normal(0,1,n*p).reshape(n,p)) )
S = np.subtract(S, np.mean( S, axis=0 ))
C = np.divide( np.matmul(np.transpose(S),S), n )
A = np.linalg.eig(C)
D.append(np.max(A[0]))
plt.plot(np.arange(L), sorted(D,reverse=True))
plt.show()
In [380]:
plt.hist(D,bins=100)
plt.show()
In [381]:
df = pd.DataFrame(D)
df.plot(kind='density')
Out[381]:
In [ ]:
In [ ]:
In [391]:
C = np.identity(p)
C[0,0] = 10
C[1,1] = 8
In [397]:
C
Out[397]:
In [398]:
np.linalg.eig(C)
Out[398]:
In [401]:
C = np.arange(0,9).reshape(3,3)
C
Out[401]:
In [403]:
A = np.linalg.eig(C)
A
Out[403]:
In [407]:
np.matmul( A[1], np.transpose(A[0]) )
Out[407]:
In [390]:
plt.matshow(np.matmul(np.transpose(S),S))
plt.show()
np.matmul(np.transpose(S),S)
Out[390]:
In [385]:
np.percentile(D,50)
Out[385]:
In [310]:
plt.hist(sorted(np.real(A[0])),bins=50)
plt.show()
In [249]:
np.power( np.sqrt(n)+np.sqrt(p), 2 )
Out[249]:
In [328]:
mu_np = np.power( np.sqrt(n-1) + np.sqrt(p), 2)
sigma_np = np.multiply( np.sqrt(n-1) + np.sqrt(p), np.power(1/np.sqrt(n-1)+1/np.sqrt(p),1/3))
In [329]:
(A[0][0] - mu_np) / sigma_np
Out[329]:
In [330]:
A[0][0]
Out[330]:
In [240]:
np.std(S[:,0])
Out[240]:
In [241]:
plt.hist(S[:,0],bins=50)
plt.show()
In [ ]:
In [215]:
plt.hist(S[:,0],bins=50)
plt.show()
In [216]:
C[1,1:10]
Out[216]:
In [190]:
C2[1,1:10]
Out[190]:
In [191]:
A[1]
Out[191]:
In [192]:
np.var(C[0,:])
Out[192]:
In [193]:
plt.matshow(C2)
plt.show()
In [194]:
plt.plot(np.arange(C.shape[0]), C[:,3])
plt.show()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [325]:
from scipy.stats import gaussian_kde
density = gaussian_kde(np.real(A[0]))
x = np.arange(0., 8, .1)
plt.plot(x, density(x))
plt.show()
In [280]:
import pandas as pd
df = pd.DataFrame(np.real(A[0]))
df.plot(kind='density')
Out[280]:
In [172]:
mean(np.real(A))
In [ ]:
In [60]:
import csv
In [61]:
with open('/Users/jul/git/pipe-generate-dataset/dataset.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
dataset = np.array(list(reader)).astype("float")
In [62]:
dataset.shape
Out[62]:
In [63]:
with open('/Users/jul/git/pipe-generate-dataset/classMatrix.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
dataClass = np.array(list(reader)).astype("float")
In [64]:
dataClass
Out[64]:
In [86]:
from matplotlib.mlab import PCA
model = PCA(dataset, standardize=False)
In [87]:
model.Y[:,0]
Out[87]:
In [88]:
plt.scatter(model.Y[:,0], model.Y[:,1])
Out[88]:
In [90]:
plt.scatter(np.arange(dataset.shape[1]), model.a[0,:])
Out[90]:
In [71]:
plt.scatter(np.arange(dataset.shape[1]), np.std(model.a[:,:], 0))
Out[71]:
In [75]:
plt.scatter(np.arange(dataset.shape[1]), model.a[1,:])
Out[75]:
In [79]:
plt.scatter(np.arange(dataset.shape[1]), np.std(model.a[:,:], 0))
Out[79]:
In [ ]: