t-distributed Stochastic Neighbour Embedding

t-SNE is a nonlinear dimensionality reduction technique for high-dimensional data.

More info in the usual place: https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding



In [ ]:

    
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy
import pickle
from dscribe.descriptors import MBTR
from visualise import view

We are going to apply this technique to a database of wine samples. The inputs are 13 chemical descriptors, the output is the index of its class (cheap, ok, good). In principle we do not know the output.



In [ ]:

    
dataIn = numpy.genfromtxt('./data/wineInputs.txt', delimiter=',')
dataOut = numpy.genfromtxt('./data/wineOutputs.txt', delimiter=',')

# find indexes of wines for each class
idx1 = numpy.where(dataOut==1)
idx2 = numpy.where(dataOut==2)
idx3 = numpy.where(dataOut==3)



In [ ]:

    
# compute the tSNE transformation of the inputs in 2 dimensions
comp = TSNE(n_components=2).fit_transform(dataIn)

# plot the resulting 2D points
plt.plot(comp[:,0],comp[:,1],'ro')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

The transform had no idea about the output classes, and still three clusters of points can be seen. We can overlay the knowledge of correct classifaction to check if the clusters correspond to what we know:



In [ ]:

    
plt.plot(comp[idx1,0],comp[idx1,1],'go')
plt.plot(comp[idx2,0],comp[idx2,1],'ro')
plt.plot(comp[idx3,0],comp[idx3,1],'bo')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

Exercises

1. Iron clusters

We have a bunch of Fe clusters and it is not easy to determine their crystal structure with conventional tools. Let's try using the MBTR descriptor and t-SNE on these clusters and check if we can distinguish between FCC and BCC phases.



In [ ]:

    
import ase.io

# load the database
samples = ase.io.read("data/clusters.extxyz", index=':')

# samples is now a list of ASE Atoms objects, ready to use!
# the first 55 clusters are FCC, the last 55 are BCC

# define MBTR setup
mbtr = MBTR(
    species=["Fe"],
    periodic=False,
    k2={
        "geometry": {"function": "distance"},
        "grid": { "min": 0, "max": 2, "sigma": 0.01, "n": 200 },
        "weighting": {"function": "exp", "scale": 0.4, "cutoff": 1e-2}
    },
    k3={
        "geometry": {"function": "cosine"},
        "grid": { "min": -1.0, "max": 1.0, "sigma": 0.02, "n": 200 },
        "weighting": {"function": "exp", "scale": 0.4, "cutoff": 1e-2}
    },
    flatten=True,
    sparse=False,
)

# calculate MBTR descriptor for each sample - takes a few secs
mbtrs = mbtr.create(samples)
print(mbtrs.shape)

Plot the t-SNE projection of MBTR output and see if you can see the two classes of structures accurately



In [ ]:

    
# ...

Plot the original MBTR descriptors and see if the structural differences are visible there



In [ ]:

    
# ...

Try changing the MBTR and t-SNE parameters and see how the projection changes



In [ ]:

    
# ...