1.4 Clustering_oxi_fingerprint-checkpoint


Lets Cluster the fingerprints with new oxidation number fingerprints included

Loading data and reading in formulas and fingerprints

By this point we already have the fingerprints for Ones,Z,Chi and oxidation number calculated. These are stored, along with the chemical formulae, in the file Fingerprint_lt50.csv in the data folder. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated.


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [5]:
Df=pd.read_csv("../data/FingerPrint_lt50.csv",sep='\t',index_col=0)

In [6]:
Df.head()


Out[6]:
Formula Ones_1 Ones_2 Ones_3 Ones_4 Ones_5 Ones_6 Ones_7 Ones_8 Ones_9 ... Oxi_91 Oxi_92 Oxi_93 Oxi_94 Oxi_95 Oxi_96 Oxi_97 Oxi_98 Oxi_99 Oxi_100
0 Nb1 Ag1 O3 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.012953 -0.095604 -0.104102 -0.183833 -0.356473 -0.505818 -0.536482 -0.473015 -0.425665 -0.483642
1 Li2 Ag6 O4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -0.999999 ... -0.097331 -0.222620 -0.219186 -0.149412 -0.046637 0.069006 0.131393 0.098103 -0.040059 -0.296037
2 Cs2 Ag2 Cl4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... 0.136590 0.131849 -0.020244 -0.112980 -0.003839 0.224747 0.338938 0.204959 -0.099369 -0.436527
3 Ag2 Hg1 I4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.055794 -0.147931 -0.240709 -0.256481 -0.141461 0.054555 0.253675 0.414680 0.427099 0.152495
4 Ag2 C2 O6 -1.0 -1.0 -1.0 -0.999999 -0.99997 -0.999462 -0.993801 -0.954192 -0.782973 ... -0.153798 -0.262954 -0.317802 -0.218989 0.012370 0.284588 0.437742 0.290560 -0.120417 -0.541639

5 rows × 401 columns

We load up all the formulas and the fingerprints and check that they have the correct shapes


In [7]:
Formulas=Df["Formula"].values

In [8]:
Fingerprint=Df.drop("Formula",axis=1).values

In [6]:
Fingerprint.shape


Out[6]:
(14722, 400)

In [7]:
Formulas.shape


Out[7]:
(14722,)

Checking how many components of PCA we need

We perform hundred component PCA on the fingerprints and then check the number of components we want to keep in order to retain a large part of the covariance. We plot the total covariance captured as a function of number of components kept in the plot below


In [9]:
from sklearn.decomposition import PCA

In [10]:
pca=PCA(n_components=100)

In [11]:
pca_fing=pca.fit_transform(Fingerprint)

In [13]:
plt.plot(np.arange(1,101),np.cumsum(pca.explained_variance_ratio_))
plt.ylabel("Explained Variance Ratio")
plt.xlabel("Number of Components")


Out[13]:
<matplotlib.text.Text at 0x1147f7890>

We enumerate the elements of the plot below


In [12]:
list(enumerate(np.cumsum(pca.explained_variance_ratio_)))


Out[12]:
[(0, 0.15492476023975613),
 (1, 0.26728530460418154),
 (2, 0.37097694951145038),
 (3, 0.44911330754675549),
 (4, 0.50686173834960302),
 (5, 0.54818714088119103),
 (6, 0.58323300926758548),
 (7, 0.61567526741159406),
 (8, 0.64623301261143551),
 (9, 0.6753853103743449),
 (10, 0.70011819172487566),
 (11, 0.72254301595249826),
 (12, 0.74121355408205825),
 (13, 0.7579571516374779),
 (14, 0.77366155752734567),
 (15, 0.78832712266163718),
 (16, 0.80150606731796237),
 (17, 0.8130421783532743),
 (18, 0.82345035235024044),
 (19, 0.83304796471075315),
 (20, 0.84128592259890633),
 (21, 0.84891253858063653),
 (22, 0.85612295472810085),
 (23, 0.86324319793702187),
 (24, 0.86963881334327131),
 (25, 0.87567663526488293),
 (26, 0.8814613061272828),
 (27, 0.88708658630136927),
 (28, 0.89220666872309595),
 (29, 0.89720519583809344),
 (30, 0.90199058325264481),
 (31, 0.90650415919498806),
 (32, 0.910664538351043),
 (33, 0.91475877696229735),
 (34, 0.91832314786481828),
 (35, 0.92184296738755589),
 (36, 0.92502799745327802),
 (37, 0.92797838048097792),
 (38, 0.93083995533706443),
 (39, 0.93359494994852588),
 (40, 0.93629850693996486),
 (41, 0.93885631469169761),
 (42, 0.94118796491430512),
 (43, 0.94346161380692661),
 (44, 0.9456581168116599),
 (45, 0.94776312192981538),
 (46, 0.94979411263935365),
 (47, 0.95172814689819907),
 (48, 0.95353701589368189),
 (49, 0.95533286673473061),
 (50, 0.95702062847558489),
 (51, 0.95860214376507369),
 (52, 0.96010822881070557),
 (53, 0.9615894937874655),
 (54, 0.96294043321718892),
 (55, 0.96424470224888537),
 (56, 0.96551317435166328),
 (57, 0.96671472402433367),
 (58, 0.96787841304338018),
 (59, 0.96899651264141484),
 (60, 0.9701019619923319),
 (61, 0.97118533711317234),
 (62, 0.97222455091039539),
 (63, 0.97324018959278835),
 (64, 0.97422451626610385),
 (65, 0.97515178406429515),
 (66, 0.97604814105988658),
 (67, 0.97690252220910234),
 (68, 0.97770769793947354),
 (69, 0.97848200440333211),
 (70, 0.97923812684516931),
 (71, 0.97995537929644994),
 (72, 0.98064971759334563),
 (73, 0.98132967064422572),
 (74, 0.98195180709688024),
 (75, 0.98256518620625544),
 (76, 0.98317100515889366),
 (77, 0.98375568419439641),
 (78, 0.98433561619439247),
 (79, 0.98489320453973905),
 (80, 0.98542745237298235),
 (81, 0.98594422304702989),
 (82, 0.98645136251061649),
 (83, 0.9869401532124682),
 (84, 0.9874225119646699),
 (85, 0.98788801135567161),
 (86, 0.98833163749463104),
 (87, 0.98876475120261154),
 (88, 0.98917907792733772),
 (89, 0.98956437454205848),
 (90, 0.98993744486144664),
 (91, 0.99030179400213569),
 (92, 0.99065429916676195),
 (93, 0.99100122998011819),
 (94, 0.99133936412772694),
 (95, 0.99166729025381883),
 (96, 0.99198146956530953),
 (97, 0.99228373293134642),
 (98, 0.99257719807720723),
 (99, 0.99286402707305377)]

With 50 components we might already be pushing the limits for DBSCAN clustering which we need to try. So lets stick to 50 and then ramp it up if necessary. We are capturing almost 96% of the covariance in this scenario


In [13]:
pca_fing50=pca_fing[:,0:50]

Clustering using Kmeans

We perform 15 cluster Kmeans on this 50 component pca fingerprint to get representative clusters.


In [14]:
from sklearn.cluster import KMeans

In [15]:
Km=KMeans(n_clusters=15,random_state=42,n_init=50)

In [16]:
clust_km50=Km.fit_predict(pca_fing50)

We output the number of elements in each cluster


In [17]:
from collections import Counter
print Counter(clust_km50)


Counter({5: 1690, 2: 1595, 11: 1459, 0: 1407, 6: 1276, 13: 1251, 10: 1229, 1: 868, 8: 822, 12: 818, 7: 777, 3: 601, 4: 550, 14: 283, 9: 96})

Sorting the cluster centers in order of distance from cluster 0 to obtain dissimilarity matrix

We rename the clusters in ascending order based on distance from cluster 0. Then we reorder the fingerprints by cluster number and then compute similarity matrix based on the euclidean distance


In [147]:
from sklearn.metrics.pairwise import euclidean_distances

Using Pandas sorting routines to sort the Fingerprints.


In [19]:
dist_center=euclidean_distances(Km.cluster_centers_,Km.cluster_centers_[0])

dist_sorted=sorted(dist_center)
sort_key=[]
for i in range(15):
    sort_key.append(np.where(dist_center==dist_sorted[i])[0][0])


/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

In [20]:
clust_km50_s=np.zeros(len(clust_km50))
for i in range(15):
    clust_km50_s[clust_km50==sort_key[i]]=int(i)

In [21]:
Counter(clust_km50_s)


Out[21]:
Counter({0.0: 1407,
         1.0: 1595,
         2.0: 1690,
         3.0: 1251,
         4.0: 1229,
         5.0: 601,
         6.0: 1459,
         7.0: 1276,
         8.0: 822,
         9.0: 868,
         10.0: 550,
         11.0: 818,
         12.0: 283,
         13.0: 777,
         14.0: 96})

In [22]:
Df["clust_pca50"]=clust_km50_s

In [23]:
Df_sorted=Df.sort_values(by="clust_pca50")

In [24]:
Formula_s=Df_sorted["Formula"]

In [25]:
Finger_s=Df_sorted.drop(["Formula","clust_pca50"],axis=1).values

In [26]:
Finger_s.shape


Out[26]:
(14722, 400)

We now perform pca on this new sorted set of fingerprints. We could have just sorted the earlier pca fingerprints instead, however PCA is cheap and we did not have the PCA fingerprints in the pandas dataframe


In [27]:
pca2=PCA(n_components=50)
fing_pca50_s=pca2.fit_transform(Finger_s)
print np.sum(pca2.explained_variance_ratio_)


0.955332866735

We now calculate eulidean distances and plot the similarity matrix


In [28]:
dist_fing_s=euclidean_distances(fing_pca50_s)

In [29]:
np.amax(dist_fing_s)


Out[29]:
21.518883998765343

In [31]:
clust_km50_s=Df_sorted["clust_pca50"].values
#fing_714=Finger_s[clust_km50_s>6]
#dist_fing_714=euclidean_distances(fing_714)
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s[::2,::2],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()


Out[31]:
<matplotlib.colorbar.Colorbar at 0x10c15dcd0>

Let us try to estimate eps and number of elements for DBSCAN from one tight cluster (cluster 13)


In [32]:
fing_13=fing_pca50_s[clust_km50_s==10]
plt.figure(figsize=(8,8))
plt.imshow(euclidean_distances(fing_13),cmap=plt.cm.viridis)
plt.colorbar()


Out[32]:
<matplotlib.colorbar.Colorbar at 0x10e825390>