t-distributed Stochastic Neighbor Embedding.

t-SNE [1] is a tool to visualize high-dimensional data and it's a great tool to inspect datasets.

It converts affinities of data points to probabilities. http://scikit-learn.org/stable/modules/manifold.html#t-sne

https://distill.pub/2016/misread-tsne/

Make sure the same scale is used over all features.


In [10]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler,Normalizer, RobustScaler
from sklearn.pipeline import make_pipeline

In [2]:
iris=datasets.load_iris()
X=iris.data
y=iris.target

perplexity,” which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has.


In [51]:
#default 2 components
model= TSNE(learning_rate=50,init='pca',perplexity=45 )
transformed = model.fit_transform(X)
plt.scatter(transformed[:,0], transformed[:,1], c=y)
plt.show()



In [53]:
normalizer = Normalizer()
model= TSNE(learning_rate=100,init='pca',perplexity=30)
pipeline = make_pipeline(normalizer,model)
transformed=pipeline.fit_transform(X)
plt.scatter(transformed[:,0], transformed[:,1], c=y)
plt.show()



In [41]:
scaler = StandardScaler()
model= TSNE(learning_rate=50,init='pca')
pipeline = make_pipeline(scaler,model)
transformed=pipeline.fit_transform(X)
plt.scatter(transformed[:,0], transformed[:,1], c=y)
plt.show()