In [2]:
%matplotlib inline
Especially popular in SNP and gene expression studies, feature selection contains a miriad of choices for selecting variables of interest. The module sklearn.feature_selection contains several feature selection possibilities, such as univariate filter selection methods and the recursive feature elimination.
Another class of feature selection, SelectFromModel is based on any estimator that can score features and order them by importance. Many sparse estimators such as Lasso, Logistic and Linear SVC can be used.
In [1]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
# C controls the sparsity, try smaller for fewer features!
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print(X_new.shape)
Rather than zeroing out on features, tree based estimators would compute feature importance.
In [2]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print(clf.feature_importances_)
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print(X_new.shape)
Task:
Use the example below to fit a tree or a forest and compare results!
In [2]:
%matplotlib inline
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']
# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()
# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]
# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
sfm.threshold += 0.1
X_transform = sfm.transform(X)
n_features = X_transform.shape[1]
# Plot the selected two features from X.
plt.title(
"Features selected from Boston using SelectFromModel with "
"threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1]
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()
Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms. This is partly because linear modeling doesn't work well with data where feature present multiple linear dependencies (seethis for more).
Three benefits of performing feature selection before modeling your data are:
Many times our data is not full rank, which means that some variables repeat as a linear combination or others. PCA will transform the dataset into a number of uncorelated components. The first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.
For a visual explanation look here.
Mathematically, PCA is a linear transformation of a matrix $X(n,p)$ of $p$ features, defined by a set of $m$ $p$-dimensional vectors called loadings ${w}_{(k)} = (w_1, \dots, w_p)_{(k)}$ mapping each row vector of $X$ into a set of $m$ principal component scores ${t}_{(i)} = (t_1, \dots, t_m)_{(i)}$, given by
$${t_{k}}_{(i)} = {x}_{(i)} \cdot {w}_{(k)} \qquad \mathrm{for} \qquad i = 1,\dots,n \qquad k = 1,\dots,m $$in such a way that the individual variables $t_1, \dots, t_m$ of $t$ considered over the data set successively inherit the maximum possible variance from $x$, with each loading vector $w$ constrained to be a unit vector. Numerically, there is a direct connection between the loading vectors and the spectral decomposition of the empirical covariance matrix $X^TX$ in that the first loading vector is the eigenvector with the largest eigenvalue in the transformation of the covariance matrix into a diagonal form.
With scikit-learn, PCA is part of the matrix decomposition module, and it is numerically solved via SVD decomposition.
In [13]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
print("Shape of Iris dataset", iris.data.shape)
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
#ax = Axes3D(fig, elev=-150, azim=110)
# feature extraction
plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X_reduced = pca.transform(X)
# summarize components
print("Explained Variance: ", pca.explained_variance_ratio_)
print("Principal components", pca.components_)
print("Shape of reduced dataset", X_reduced.shape)
for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X_reduced[y == label, 0].mean(),
X_reduced[y == label, 1].mean() + 1.5,
X_reduced[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()
In [7]:
import matplotlib.pyplot as plt
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=Y)
Out[7]:
Task:
First of all, PCA is not necessarily a factor analysis subject, and the habit of using scores vs loadings doesn't exist or is different elsewhere. Finish the code bellow to add proper labels for the loadings, and compute the actual loadings.
Further read:
In [17]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,load,labels=None):
xs = score[:,0]
ys = score[:,1]
n = load.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, load[i,0], load[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(load[i,0]* 1.15, load[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(load[i,0]* 1.15, load[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
ICA is an algorithm that finds directions in the feature space corresponding to projections with high non-Gaussianity.
Why:
Further read:
In [3]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FastICA
# two Student T tests with hight degrees of freedom
rng = np.random.RandomState(42) # initialize the Mersenne Twister randomg number generator
S = rng.standard_t(1.5, size=(20000, 2))
S[:, 0] *= 2.
# Mix data
A = np.array([[1, 1], [0, 2]]) # Mixing matrix
X = np.dot(S, A.T) # Generate observations
# Train PCA and Fast ICA
pca = PCA()
S_pca_ = pca.fit(X).transform(X)
ica = FastICA(random_state=rng)
S_ica_ = ica.fit(X).transform(X) # Estimate the sources
S_ica_ /= S_ica_.std(axis=0)
# Plot results
def plot_samples(S, axis_list=None):
plt.scatter(S[:, 0], S[:, 1], s=2, marker='o', zorder=10,
color='steelblue', alpha=0.5)
if axis_list is not None:
colors = ['orange', 'red']
for color, axis in zip(colors, axis_list):
axis /= axis.std()
x_axis, y_axis = axis
# Trick to get legend to work
plt.plot(0.1 * x_axis, 0.1 * y_axis, linewidth=2, color=color)
plt.quiver(0, 0, x_axis, y_axis, zorder=11, width=0.01, scale=6,
color=color)
plt.hlines(0, -3, 3)
plt.vlines(0, -3, 3)
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.xlabel('x')
plt.ylabel('y')
plt.figure()
plt.subplot(2, 2, 1)
plot_samples(S / S.std())
plt.title('True Independent Sources')
axis_list = [pca.components_.T, ica.mixing_]
plt.subplot(2, 2, 2)
plot_samples(X / np.std(X), axis_list=axis_list)
legend = plt.legend(['PCA', 'ICA'], loc='upper right')
legend.set_zorder(100)
plt.title('Observations')
plt.subplot(2, 2, 3)
plot_samples(S_pca_ / np.std(S_pca_, axis=0))
plt.title('PCA recovered signals')
plt.subplot(2, 2, 4)
plot_samples(S_ica_ / np.std(S_ica_))
plt.title('ICA recovered signals')
plt.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.36)
plt.show()
Bellow, the components_ attribute, returns an array containing measures of the relationship between the newly created factors, placed in rows, and the original features, placed in columns. Higher positive scores correspond with factors being positively associated with the features. In this case, although we fit the factor model based on four components, only two components seem to fit the dataset.
Factor analysis uses an expectation maximization (EM) optimizer to find the best Gaussian distribution that can accurately model your data within a tolerance of n_tolerance. In simple terms n_components is the dimensionality of the Gaussian distribution.
In [2]:
from sklearn.datasets import load_iris
from sklearn.decomposition import FactorAnalysis
iris = load_iris()
X, y = iris.data, iris.target
factor = FactorAnalysis(n_components=4, random_state=101).fit(X)
In [6]:
import pandas as pd
pd.DataFrame(factor.components_,columns=iris.feature_names)
Out[6]:
It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
While t-SNE can be useful for DR, it is considered more useful as a visualization technique in classification problems. One of the prize winning creators maintains a website with native calls to his implementation in multiple languages including python, but scikit-learn also has an implementation available.
One of the most popular demo is the classification of the MINST digit data.
In [1]:
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST original")
X = mnist.data / 255.0
y = mnist.target
print(X.shape, y.shape)
In [2]:
import pandas as pd
feat_cols = [ 'pixel'+str(i) for i in range(X.shape[1]) ]
df = pd.DataFrame(X,columns=feat_cols)
df['label'] = y
df['label'] = df['label'].apply(lambda i: str(i))
#X, y = None, None
print('Size of the dataframe: {}'.format(df.shape))
In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
rndperm = np.random.permutation(df.shape[0])
# Plot the graph
plt.gray()
fig = plt.figure( figsize=(16,7) )
for i in range(0,30):
ax = fig.add_subplot(3,10,i+1, title='Digit: ' + str(df.loc[rndperm[i],'label']) )
ax.matshow(df.loc[rndperm[i],feat_cols].values.reshape((28,28)).astype(float))
plt.show()
In [4]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols].values)
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1]
df['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
In [5]:
from ggplot import *
chart = ggplot( df.loc[rndperm[:3000],:], aes(x='pca-one', y='pca-two', color='label') ) \
+ geom_point(size=75,alpha=0.8) \
+ ggtitle("First and Second Principal Components colored by digit")
chart
Out[5]:
In [6]:
import time
from sklearn.manifold import TSNE
n_sne = 3000
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=250)
tsne_results = tsne.fit_transform(df.loc[rndperm[:n_sne],feat_cols].values)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
In [7]:
df_tsne = df.loc[rndperm[:n_sne],:].copy()
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]
chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='label') ) \
+ geom_point(size=70,alpha=0.1) \
+ ggtitle("tSNE dimensions colored by digit")
chart
Out[7]:
Task:
Further reading:
In [ ]: