In this notebook we perform clustering tests on the fingerprint functions. Note that at this point we don't yet have the oxidation state fingerprints and only have those for ones, Z and Chi. We do the clustering with oxidation state fingerprints added in a later notebook. We have the fingerprints and formulae saved in the Pandas DataFrame Fingerprint_lt50_old.csv file. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated. Also note that the struct files are from the materials project which are not always accurate.
In [6]:
import pandas as pd
import numpy as np
import tqdm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
In [7]:
%matplotlib inline
In [8]:
Df=pd.read_csv("../data/FingerPrint_lt50_old.csv",sep='\t',index_col=0)
In [9]:
Df.head()
Out[9]:
In [10]:
Formulas=Df["Formula"].values
Formulas
Out[10]:
In [11]:
Fingerprints=Df.drop("Formula",axis=1).values
In [12]:
from sklearn.metrics import euclidean_distances
In [13]:
from sklearn.cluster import KMeans
In [15]:
inertia_array=np.zeros(19)
for i in tqdm.tqdm_notebook(range(2,20)):
Km_run=KMeans(n_clusters=i)
Km_run.fit_predict(Fingerprints)
inertia_array[i-2]=Km_run.inertia_
In [16]:
inertia_array2=np.zeros(6)
for i in tqdm.tqdm_notebook(range(20,26)):
Km_run=KMeans(n_clusters=i)
Km_run.fit_predict(Fingerprints)
inertia_array2[i-20]=Km_run.inertia_
In [32]:
inertia_array=np.append(inertia_array,inertia_array2)
In [41]:
inertia_array=inertia_array[inertia_array>0]
In [42]:
plt.figure(figsize=(8,8))
plt.plot(np.arange(2,26),inertia_array)
plt.ylabel("Inertia")
plt.xlabel("Number of Clusters")
Out[42]:
We decide on 15 clusters. There is no real elbow to speak of and we shall see later that 15 gives us fairly tight clusters
In [8]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
pc=PCA(n_components=50)
In [9]:
pca_fing=pc.fit_transform(Fingerprints)
In [10]:
list(enumerate(np.cumsum(pc.explained_variance_ratio_)))
Out[10]:
In [10]:
Km_pca50=KMeans(n_clusters=15,random_state=42)
In [11]:
clusts=Km_pca50.fit_predict(pca_fing)
In [14]:
rst=np.random.RandomState(42)
c_arr=[tuple(rst.uniform(0,1,4)) for i in range(15)]
Plotting the points based on their cluster number as a function of their first 2 PCA components. The labels are the formula of the first compound in the cluster.
In [12]:
labels=[Formulas[clusts==i][0] for i in sorted(set(clusts))]
list(enumerate(labels))
Out[12]:
In [15]:
plt.figure(figsize=(18,12))
for i in range(15):
plt.scatter(pca_fing[clusts==i,0],pca_fing[clusts==i,1],c=c_arr[i],label="Cluster "+labels[i])
plt.legend()
Out[15]:
Doing the same thing in 3d for the first 3 PCA components. mplot3d isn't the best tool for 3d plotting due to occlusion effects. Final plots will need to be done with mayavi or something
In [17]:
%matplotlib inline
import mpl_toolkits.mplot3d.axes3d as p3
# Plot result
fig = plt.figure(figsize=(10,10))
ax = p3.Axes3D(fig)
#ax.view_init(7, -80)
for i in range(14,-1,-1):
ax.plot3D(pca_fing[clusts==i,0],pca_fing[clusts==i,1],pca_fing[clusts==i,3],'o',c=c_arr[i],label="Cluster "+labels[i])
#plt.title('Without connectivity constraints (time %.2fs)' % elapsed_time)
plt.legend()
Out[17]:
We have discovered a cluster which seems to contain only perovsklites. We print out the formulas in this cluster ( using Pandas) and see that they are almost exclusively perovskites or very close
In [19]:
Df["cluster_pca50"]=clusts
In [20]:
perov_clust_forms=Df[Df["cluster_pca50"]==Df.ix[0]["cluster_pca50"]]["Formula"]
In [21]:
perov_clust_forms
Out[21]:
Since the clusters in Kmeans are arbitrarily ordered, so in order to get pretty similarity matrices, we need to reorder them. We choose to do this based on the distance of cluster center to cluster center of cluster 0. This is not terribly necessary but leads to nicer plots
In [18]:
from collections import Counter
Counter(clusts)
Out[18]:
We plot the similarity matrix for the cluster centers before(1st figure) and after (2nd figure) reordering. We see that the second figure much more ordered
In [20]:
dist_centers=euclidean_distances(Km_pca50.cluster_centers_)
In [21]:
print dist_centers[0]
plt.imshow(dist_centers)
plt.colorbar()
Out[21]: