1.3 Fingerprint_clustering-checkpoint


Lets cluster these Fingerprints

In this notebook we perform clustering tests on the fingerprint functions. Note that at this point we don't yet have the oxidation state fingerprints and only have those for ones, Z and Chi. We do the clustering with oxidation state fingerprints added in a later notebook. We have the fingerprints and formulae saved in the Pandas DataFrame Fingerprint_lt50_old.csv file. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated. Also note that the struct files are from the materials project which are not always accurate.

Loading the CSV into Pandas dataframe and extracting formulae and fingerprints


In [6]:
import pandas as pd
import numpy as np
import tqdm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [7]:
%matplotlib inline

In [8]:
Df=pd.read_csv("../data/FingerPrint_lt50_old.csv",sep='\t',index_col=0)

In [9]:
Df.head()


Out[9]:
Formula Ones_1 Ones_2 Ones_3 Ones_4 Ones_5 Ones_6 Ones_7 Ones_8 Ones_9 ... Chi_91 Chi_92 Chi_93 Chi_94 Chi_95 Chi_96 Chi_97 Chi_98 Chi_99 Chi_100
0 Nb1 Ag1 O3 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.497277 -0.453894 -0.191895 0.064329 0.104619 -0.118050 -0.394161 -0.537855 -0.587967 -0.654697
1 Li2 Ag6 O4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -0.999999 ... -0.120349 -0.206105 -0.217994 -0.183563 -0.075639 0.073079 0.137550 0.062229 -0.112490 -0.359644
2 Cs2 Ag2 Cl4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... 0.259454 0.218056 0.031980 -0.098497 -0.054451 0.137466 0.285717 0.202032 -0.097068 -0.462973
3 Ag2 Hg1 I4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.059718 -0.141794 -0.241085 -0.272397 -0.160438 0.054698 0.275458 0.446003 0.474980 0.227803
4 Ag2 C2 O6 -1.0 -1.0 -1.0 -0.999999 -0.99997 -0.999462 -0.993801 -0.954192 -0.782973 ... -0.123078 -0.202793 -0.201147 -0.164696 -0.108094 0.009786 0.153578 0.153906 -0.070010 -0.394319

5 rows × 301 columns


In [10]:
Formulas=Df["Formula"].values
Formulas


Out[10]:
array(['Nb1 Ag1 O3', 'Li2 Ag6 O4', 'Cs2 Ag2 Cl4', ..., 'Y6 Co6 O18',
       'Y1 Co1 F5', 'Mg1 Co4 S8'], dtype=object)

In [11]:
Fingerprints=Df.drop("Formula",axis=1).values

In [12]:
from sklearn.metrics import euclidean_distances

Testing out Inertia for Kmeans with different n_clusters to see if there is a well defined "elbow" or a sharp drop.


In [13]:
from sklearn.cluster import KMeans

In [15]:
inertia_array=np.zeros(19)
for i in tqdm.tqdm_notebook(range(2,20)):
    Km_run=KMeans(n_clusters=i)
    Km_run.fit_predict(Fingerprints)
    inertia_array[i-2]=Km_run.inertia_




In [16]:
inertia_array2=np.zeros(6)
for i in tqdm.tqdm_notebook(range(20,26)):
    Km_run=KMeans(n_clusters=i)
    Km_run.fit_predict(Fingerprints)
    inertia_array2[i-20]=Km_run.inertia_




In [32]:
inertia_array=np.append(inertia_array,inertia_array2)

In [41]:
inertia_array=inertia_array[inertia_array>0]

In [42]:
plt.figure(figsize=(8,8))
plt.plot(np.arange(2,26),inertia_array)
plt.ylabel("Inertia")
plt.xlabel("Number of Clusters")


Out[42]:
[<matplotlib.lines.Line2D at 0x1124a3f90>]

We decide on 15 clusters. There is no real elbow to speak of and we shall see later that 15 gives us fairly tight clusters

Lets use PCA to make the dataset smaller

We use the explained variance ratio to decide the number of components. We keep 50 components which keeps ~97% of variance


In [8]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
pc=PCA(n_components=50)

In [9]:
pca_fing=pc.fit_transform(Fingerprints)

In [10]:
list(enumerate(np.cumsum(pc.explained_variance_ratio_)))


Out[10]:
[(0, 0.16149118125330231),
 (1, 0.27367676124440887),
 (2, 0.37250488053590131),
 (3, 0.44928486557661768),
 (4, 0.50712061537192721),
 (5, 0.54981904927756264),
 (6, 0.5853508611030771),
 (7, 0.6199933456382144),
 (8, 0.65102686933536513),
 (9, 0.68001285644235121),
 (10, 0.70764070985117467),
 (11, 0.73034622458312459),
 (12, 0.7499285382066283),
 (13, 0.76809153805162245),
 (14, 0.78435923367535199),
 (15, 0.79901524025260173),
 (16, 0.81279379573299504),
 (17, 0.82483369075007018),
 (18, 0.83574923304734461),
 (19, 0.84572638458306681),
 (20, 0.8541211132340204),
 (21, 0.86218460989479784),
 (22, 0.86988930772944628),
 (23, 0.87697220787494989),
 (24, 0.88343686872852345),
 (25, 0.88968751572190052),
 (26, 0.89574043598524988),
 (27, 0.90158254950196837),
 (28, 0.90718878573617356),
 (29, 0.91206621481073447),
 (30, 0.91678640751493468),
 (31, 0.92125512084878669),
 (32, 0.92520925187091885),
 (33, 0.92894818633254295),
 (34, 0.93261969138669665),
 (35, 0.93603947790603437),
 (36, 0.93922954087316757),
 (37, 0.94219798220305251),
 (38, 0.94504203146948418),
 (39, 0.947809515903),
 (40, 0.95046690454845262),
 (41, 0.9528462556121835),
 (42, 0.95513130271862212),
 (43, 0.95728572237699527),
 (44, 0.9593794476669365),
 (45, 0.96138569806994756),
 (46, 0.96329077418826958),
 (47, 0.96499316695709192),
 (48, 0.9665820358594166),
 (49, 0.96811447180119914)]

Performing KMeans with 15 clusters on these 50-component PCA fingerprints


In [10]:
Km_pca50=KMeans(n_clusters=15,random_state=42)

In [11]:
clusts=Km_pca50.fit_predict(pca_fing)

Now for some fancy plots!


In [14]:
rst=np.random.RandomState(42)
c_arr=[tuple(rst.uniform(0,1,4)) for i in range(15)]

Plotting the points based on their cluster number as a function of their first 2 PCA components. The labels are the formula of the first compound in the cluster.


In [12]:
labels=[Formulas[clusts==i][0] for i in sorted(set(clusts))]
list(enumerate(labels))


Out[12]:
[(0, 'Er2 Ag2 Te4'),
 (1, 'Ag2 Ru2 O8'),
 (2, 'Y2 Ag2 Te4'),
 (3, 'Li2 Ag1 Sb1'),
 (4, 'Ag2 C2 O6'),
 (5, 'Ce2 Ag2 Ge2'),
 (6, 'Li2 Ag6 O4'),
 (7, 'Ag4 Ru4 O12'),
 (8, 'Nb1 Ag1 O3'),
 (9, 'Mn2 Ag2 O6'),
 (10, 'Ag1 Bi1 S2'),
 (11, 'Li1 Ag1 Te1'),
 (12, 'Cs2 Ag2 Cl4'),
 (13, 'Ca2 Ag2 Bi2'),
 (14, 'Rb2 Ag2 O4')]

In [15]:
plt.figure(figsize=(18,12))
for i in range(15):
    plt.scatter(pca_fing[clusts==i,0],pca_fing[clusts==i,1],c=c_arr[i],label="Cluster "+labels[i])
plt.legend()


Out[15]:
<matplotlib.legend.Legend at 0x10d0a0690>

Doing the same thing in 3d for the first 3 PCA components. mplot3d isn't the best tool for 3d plotting due to occlusion effects. Final plots will need to be done with mayavi or something


In [17]:
%matplotlib inline
import mpl_toolkits.mplot3d.axes3d as p3
# Plot result
fig = plt.figure(figsize=(10,10))
ax = p3.Axes3D(fig)
#ax.view_init(7, -80)
for i in range(14,-1,-1):
    ax.plot3D(pca_fing[clusts==i,0],pca_fing[clusts==i,1],pca_fing[clusts==i,3],'o',c=c_arr[i],label="Cluster "+labels[i])
#plt.title('Without connectivity constraints (time %.2fs)' % elapsed_time)
plt.legend()


Out[17]:
<matplotlib.legend.Legend at 0x106458890>

We have discovered a cluster which seems to contain only perovsklites. We print out the formulas in this cluster ( using Pandas) and see that they are almost exclusively perovskites or very close


In [19]:
Df["cluster_pca50"]=clusts

In [20]:
perov_clust_forms=Df[Df["cluster_pca50"]==Df.ix[0]["cluster_pca50"]]["Formula"]

In [21]:
perov_clust_forms


Out[21]:
0          Nb1 Ag1 O3
32          Ag1 B1 O3
33         Zn1 Ag1 F3
86        Cs2 Ag2 Cl6
108       Tl1 Ag1 Cl3
132        Ni1 Ag1 F3
173        Rb1 Ag1 O3
208        Mg1 Ag1 F3
230       Tl2 Ag2 Cl6
231        K2 Ag2 Cl6
240       Tl1 Ag1 Br3
241       Rb1 Ag1 Br3
246        Ta2 Ag2 O6
256        Cu2 Ag2 F6
257        Mg2 Ag2 F6
262        Cu1 Ag1 F3
269        Rb2 Ag2 F6
272        Cs2 Ag2 F6
308        K4 Ag4 F12
332       Cs1 Ag1 Br3
342       Rb1 Ag1 Cl3
343        K1 Ag1 Cl3
407        Al1 Ag1 O3
413         Y1 Ag1 O3
429        Tl1 Ag1 F3
445        K1 Ag1 Br3
485        Sr1 Ag1 O3
504        La1 Al1 O3
505        La2 Al2 O6
512        Al1 Bi1 O3
             ...     
13883    Mn4 Tl4 Cl12
13947     Tl1 Ge1 Cl3
13950     Cs1 Tm1 Cl3
13967     Rb2 In2 Cl6
13968     Tl1 In1 Cl3
14003    Rb6 Mg6 Cl18
14030     Tl1 Cu1 Cl3
14041    Cs9 Hg9 Cl27
14053     Rb1 Sn1 Cl3
14067     Li1 Ti1 Cl3
14075    Rb4 Mg4 Cl12
14103      K1 Ge1 Cl3
14134     Rb1 Pb1 Cl3
14195      K1 Cr1 Cl3
14196      K2 Cr2 Cl6
14197     Cs1 Sm1 Cl3
14211     Cs2 Pb2 Cl6
14248      La2 Co2 O6
14249      Sm1 Co1 O3
14250      Nd1 Co1 O3
14251      Pr1 Co1 O3
14256      La1 Co1 O3
14289      Co1 Te1 O3
14317       K1 Co1 F3
14368      Sr1 Co1 O3
14396      Tl1 Co1 F3
14440     La4 Co3 O10
14445     Rb6 Co6 F18
14528      Gd1 Co1 O3
14678     Sr8 Co8 O23
Name: Formula, dtype: object

Since the clusters in Kmeans are arbitrarily ordered, so in order to get pretty similarity matrices, we need to reorder them. We choose to do this based on the distance of cluster center to cluster center of cluster 0. This is not terribly necessary but leads to nicer plots


In [18]:
from collections import Counter
Counter(clusts)


Out[18]:
Counter({0: 938,
         1: 1728,
         2: 1507,
         3: 97,
         4: 891,
         5: 777,
         6: 998,
         7: 550,
         8: 491,
         9: 1505,
         10: 492,
         11: 283,
         12: 1204,
         13: 1212,
         14: 2049})

We plot the similarity matrix for the cluster centers before(1st figure) and after (2nd figure) reordering. We see that the second figure much more ordered


In [20]:
dist_centers=euclidean_distances(Km_pca50.cluster_centers_)

In [21]:
print dist_centers[0]
plt.imshow(dist_centers)
plt.colorbar()


[  0.           3.93535105   3.90032104  11.7278847    4.79061938
   4.34218444   4.00873426   4.74158359   7.58895776   4.24982168
   6.47418226   7.54039307   5.19884462   4.48322222   4.32923034]
Out[21]:
<matplotlib.colorbar.Colorbar at 0x1040c2a90>