By this point we already have the fingerprints for Ones,Z,Chi and oxidation number calculated. These are stored, along with the chemical formulae, in the file Fingerprint_lt50.csv in the data folder. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated.
In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
Df=pd.read_csv("../data/FingerPrint_lt50.csv",sep='\t',index_col=0)
In [6]:
Df.head()
Out[6]:
We load up all the formulas and the fingerprints and check that they have the correct shapes
In [7]:
Formulas=Df["Formula"].values
In [8]:
Fingerprint=Df.drop("Formula",axis=1).values
In [6]:
Fingerprint.shape
Out[6]:
In [7]:
Formulas.shape
Out[7]:
We perform hundred component PCA on the fingerprints and then check the number of components we want to keep in order to retain a large part of the covariance. We plot the total covariance captured as a function of number of components kept in the plot below
In [9]:
from sklearn.decomposition import PCA
In [10]:
pca=PCA(n_components=100)
In [11]:
pca_fing=pca.fit_transform(Fingerprint)
In [13]:
plt.plot(np.arange(1,101),np.cumsum(pca.explained_variance_ratio_))
plt.ylabel("Explained Variance Ratio")
plt.xlabel("Number of Components")
Out[13]:
We enumerate the elements of the plot below
In [12]:
list(enumerate(np.cumsum(pca.explained_variance_ratio_)))
Out[12]:
With 50 components we might already be pushing the limits for DBSCAN clustering which we need to try. So lets stick to 50 and then ramp it up if necessary. We are capturing almost 96% of the covariance in this scenario
In [13]:
pca_fing50=pca_fing[:,0:50]
In [14]:
from sklearn.cluster import KMeans
In [15]:
Km=KMeans(n_clusters=15,random_state=42,n_init=50)
In [16]:
clust_km50=Km.fit_predict(pca_fing50)
We output the number of elements in each cluster
In [17]:
from collections import Counter
print Counter(clust_km50)
In [147]:
from sklearn.metrics.pairwise import euclidean_distances
Using Pandas sorting routines to sort the Fingerprints.
In [19]:
dist_center=euclidean_distances(Km.cluster_centers_,Km.cluster_centers_[0])
dist_sorted=sorted(dist_center)
sort_key=[]
for i in range(15):
sort_key.append(np.where(dist_center==dist_sorted[i])[0][0])
In [20]:
clust_km50_s=np.zeros(len(clust_km50))
for i in range(15):
clust_km50_s[clust_km50==sort_key[i]]=int(i)
In [21]:
Counter(clust_km50_s)
Out[21]:
In [22]:
Df["clust_pca50"]=clust_km50_s
In [23]:
Df_sorted=Df.sort_values(by="clust_pca50")
In [24]:
Formula_s=Df_sorted["Formula"]
In [25]:
Finger_s=Df_sorted.drop(["Formula","clust_pca50"],axis=1).values
In [26]:
Finger_s.shape
Out[26]:
We now perform pca on this new sorted set of fingerprints. We could have just sorted the earlier pca fingerprints instead, however PCA is cheap and we did not have the PCA fingerprints in the pandas dataframe
In [27]:
pca2=PCA(n_components=50)
fing_pca50_s=pca2.fit_transform(Finger_s)
print np.sum(pca2.explained_variance_ratio_)
We now calculate eulidean distances and plot the similarity matrix
In [28]:
dist_fing_s=euclidean_distances(fing_pca50_s)
In [29]:
np.amax(dist_fing_s)
Out[29]:
In [31]:
clust_km50_s=Df_sorted["clust_pca50"].values
#fing_714=Finger_s[clust_km50_s>6]
#dist_fing_714=euclidean_distances(fing_714)
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s[::2,::2],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()
Out[31]:
In [32]:
fing_13=fing_pca50_s[clust_km50_s==10]
plt.figure(figsize=(8,8))
plt.imshow(euclidean_distances(fing_13),cmap=plt.cm.viridis)
plt.colorbar()
Out[32]: