Lets Cluster the fingerprints with new oxidation number fingerprints included

Loading data and reading in formulas and fingerprints

By this point we already have the fingerprints for Ones,Z,Chi and oxidation number calculated. These are stored, along with the chemical formulae, in the file Fingerprint_lt50.csv in the data folder. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated.



In [4]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline









    



/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')



In [5]:

    
Df=pd.read_csv("../data/FingerPrint_lt50.csv",sep='\t',index_col=0)



In [6]:

    
Df.head()









    Out[6]:






  
    
      
      Formula
      Ones_1
      Ones_2
      Ones_3
      Ones_4
      Ones_5
      Ones_6
      Ones_7
      Ones_8
      Ones_9
      ...
      Oxi_91
      Oxi_92
      Oxi_93
      Oxi_94
      Oxi_95
      Oxi_96
      Oxi_97
      Oxi_98
      Oxi_99
      Oxi_100
    
  
  
    
      0
      Nb1 Ag1 O3
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.00000
      -1.000000
      -1.000000
      -1.000000
      -1.000000
      ...
      -0.012953
      -0.095604
      -0.104102
      -0.183833
      -0.356473
      -0.505818
      -0.536482
      -0.473015
      -0.425665
      -0.483642
    
    
      1
      Li2 Ag6 O4
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.00000
      -1.000000
      -1.000000
      -1.000000
      -0.999999
      ...
      -0.097331
      -0.222620
      -0.219186
      -0.149412
      -0.046637
      0.069006
      0.131393
      0.098103
      -0.040059
      -0.296037
    
    
      2
      Cs2 Ag2 Cl4
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.00000
      -1.000000
      -1.000000
      -1.000000
      -1.000000
      ...
      0.136590
      0.131849
      -0.020244
      -0.112980
      -0.003839
      0.224747
      0.338938
      0.204959
      -0.099369
      -0.436527
    
    
      3
      Ag2 Hg1 I4
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.00000
      -1.000000
      -1.000000
      -1.000000
      -1.000000
      ...
      -0.055794
      -0.147931
      -0.240709
      -0.256481
      -0.141461
      0.054555
      0.253675
      0.414680
      0.427099
      0.152495
    
    
      4
      Ag2 C2 O6
      -1.0
      -1.0
      -1.0
      -0.999999
      -0.99997
      -0.999462
      -0.993801
      -0.954192
      -0.782973
      ...
      -0.153798
      -0.262954
      -0.317802
      -0.218989
      0.012370
      0.284588
      0.437742
      0.290560
      -0.120417
      -0.541639
    
  

5 rows × 401 columns

We load up all the formulas and the fingerprints and check that they have the correct shapes



In [7]:

    
Formulas=Df["Formula"].values



In [8]:

    
Fingerprint=Df.drop("Formula",axis=1).values



In [6]:

    
Fingerprint.shape









    Out[6]:





(14722, 400)



In [7]:

    
Formulas.shape









    Out[7]:





(14722,)

Checking how many components of PCA we need

We perform hundred component PCA on the fingerprints and then check the number of components we want to keep in order to retain a large part of the covariance. We plot the total covariance captured as a function of number of components kept in the plot below



In [9]:

    
from sklearn.decomposition import PCA



In [10]:

    
pca=PCA(n_components=100)



In [11]:

    
pca_fing=pca.fit_transform(Fingerprint)



In [13]:

    
plt.plot(np.arange(1,101),np.cumsum(pca.explained_variance_ratio_))
plt.ylabel("Explained Variance Ratio")
plt.xlabel("Number of Components")









    Out[13]:





<matplotlib.text.Text at 0x1147f7890>

We enumerate the elements of the plot below



In [12]:

    
list(enumerate(np.cumsum(pca.explained_variance_ratio_)))









    Out[12]:





[(0, 0.15492476023975613),
 (1, 0.26728530460418154),
 (2, 0.37097694951145038),
 (3, 0.44911330754675549),
 (4, 0.50686173834960302),
 (5, 0.54818714088119103),
 (6, 0.58323300926758548),
 (7, 0.61567526741159406),
 (8, 0.64623301261143551),
 (9, 0.6753853103743449),
 (10, 0.70011819172487566),
 (11, 0.72254301595249826),
 (12, 0.74121355408205825),
 (13, 0.7579571516374779),
 (14, 0.77366155752734567),
 (15, 0.78832712266163718),
 (16, 0.80150606731796237),
 (17, 0.8130421783532743),
 (18, 0.82345035235024044),
 (19, 0.83304796471075315),
 (20, 0.84128592259890633),
 (21, 0.84891253858063653),
 (22, 0.85612295472810085),
 (23, 0.86324319793702187),
 (24, 0.86963881334327131),
 (25, 0.87567663526488293),
 (26, 0.8814613061272828),
 (27, 0.88708658630136927),
 (28, 0.89220666872309595),
 (29, 0.89720519583809344),
 (30, 0.90199058325264481),
 (31, 0.90650415919498806),
 (32, 0.910664538351043),
 (33, 0.91475877696229735),
 (34, 0.91832314786481828),
 (35, 0.92184296738755589),
 (36, 0.92502799745327802),
 (37, 0.92797838048097792),
 (38, 0.93083995533706443),
 (39, 0.93359494994852588),
 (40, 0.93629850693996486),
 (41, 0.93885631469169761),
 (42, 0.94118796491430512),
 (43, 0.94346161380692661),
 (44, 0.9456581168116599),
 (45, 0.94776312192981538),
 (46, 0.94979411263935365),
 (47, 0.95172814689819907),
 (48, 0.95353701589368189),
 (49, 0.95533286673473061),
 (50, 0.95702062847558489),
 (51, 0.95860214376507369),
 (52, 0.96010822881070557),
 (53, 0.9615894937874655),
 (54, 0.96294043321718892),
 (55, 0.96424470224888537),
 (56, 0.96551317435166328),
 (57, 0.96671472402433367),
 (58, 0.96787841304338018),
 (59, 0.96899651264141484),
 (60, 0.9701019619923319),
 (61, 0.97118533711317234),
 (62, 0.97222455091039539),
 (63, 0.97324018959278835),
 (64, 0.97422451626610385),
 (65, 0.97515178406429515),
 (66, 0.97604814105988658),
 (67, 0.97690252220910234),
 (68, 0.97770769793947354),
 (69, 0.97848200440333211),
 (70, 0.97923812684516931),
 (71, 0.97995537929644994),
 (72, 0.98064971759334563),
 (73, 0.98132967064422572),
 (74, 0.98195180709688024),
 (75, 0.98256518620625544),
 (76, 0.98317100515889366),
 (77, 0.98375568419439641),
 (78, 0.98433561619439247),
 (79, 0.98489320453973905),
 (80, 0.98542745237298235),
 (81, 0.98594422304702989),
 (82, 0.98645136251061649),
 (83, 0.9869401532124682),
 (84, 0.9874225119646699),
 (85, 0.98788801135567161),
 (86, 0.98833163749463104),
 (87, 0.98876475120261154),
 (88, 0.98917907792733772),
 (89, 0.98956437454205848),
 (90, 0.98993744486144664),
 (91, 0.99030179400213569),
 (92, 0.99065429916676195),
 (93, 0.99100122998011819),
 (94, 0.99133936412772694),
 (95, 0.99166729025381883),
 (96, 0.99198146956530953),
 (97, 0.99228373293134642),
 (98, 0.99257719807720723),
 (99, 0.99286402707305377)]

With 50 components we might already be pushing the limits for DBSCAN clustering which we need to try. So lets stick to 50 and then ramp it up if necessary. We are capturing almost 96% of the covariance in this scenario



In [13]:

    
pca_fing50=pca_fing[:,0:50]

Clustering using Kmeans

We perform 15 cluster Kmeans on this 50 component pca fingerprint to get representative clusters.



In [14]:

    
from sklearn.cluster import KMeans



In [15]:

    
Km=KMeans(n_clusters=15,random_state=42,n_init=50)



In [16]:

    
clust_km50=Km.fit_predict(pca_fing50)

We output the number of elements in each cluster



In [17]:

    
from collections import Counter
print Counter(clust_km50)









    



Counter({5: 1690, 2: 1595, 11: 1459, 0: 1407, 6: 1276, 13: 1251, 10: 1229, 1: 868, 8: 822, 12: 818, 7: 777, 3: 601, 4: 550, 14: 283, 9: 96})

Sorting the cluster centers in order of distance from cluster 0 to obtain dissimilarity matrix

We rename the clusters in ascending order based on distance from cluster 0. Then we reorder the fingerprints by cluster number and then compute similarity matrix based on the euclidean distance



In [147]:

    
from sklearn.metrics.pairwise import euclidean_distances

Using Pandas sorting routines to sort the Fingerprints.



In [19]:

    
dist_center=euclidean_distances(Km.cluster_centers_,Km.cluster_centers_[0])

dist_sorted=sorted(dist_center)
sort_key=[]
for i in range(15):
    sort_key.append(np.where(dist_center==dist_sorted[i])[0][0])









    



/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)



In [20]:

    
clust_km50_s=np.zeros(len(clust_km50))
for i in range(15):
    clust_km50_s[clust_km50==sort_key[i]]=int(i)



In [21]:

    
Counter(clust_km50_s)









    Out[21]:





Counter({0.0: 1407,
         1.0: 1595,
         2.0: 1690,
         3.0: 1251,
         4.0: 1229,
         5.0: 601,
         6.0: 1459,
         7.0: 1276,
         8.0: 822,
         9.0: 868,
         10.0: 550,
         11.0: 818,
         12.0: 283,
         13.0: 777,
         14.0: 96})



In [22]:

    
Df["clust_pca50"]=clust_km50_s



In [23]:

    
Df_sorted=Df.sort_values(by="clust_pca50")



In [24]:

    
Formula_s=Df_sorted["Formula"]



In [25]:

    
Finger_s=Df_sorted.drop(["Formula","clust_pca50"],axis=1).values



In [26]:

    
Finger_s.shape









    Out[26]:





(14722, 400)

We now perform pca on this new sorted set of fingerprints. We could have just sorted the earlier pca fingerprints instead, however PCA is cheap and we did not have the PCA fingerprints in the pandas dataframe



In [27]:

    
pca2=PCA(n_components=50)
fing_pca50_s=pca2.fit_transform(Finger_s)
print np.sum(pca2.explained_variance_ratio_)









    



0.955332866735

We now calculate eulidean distances and plot the similarity matrix



In [28]:

    
dist_fing_s=euclidean_distances(fing_pca50_s)



In [29]:

    
np.amax(dist_fing_s)









    Out[29]:





21.518883998765343



In [31]:

    
clust_km50_s=Df_sorted["clust_pca50"].values
#fing_714=Finger_s[clust_km50_s>6]
#dist_fing_714=euclidean_distances(fing_714)
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s[::2,::2],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()









    Out[31]:





<matplotlib.colorbar.Colorbar at 0x10c15dcd0>

Let us try to estimate eps and number of elements for DBSCAN from one tight cluster (cluster 13)



In [32]:

    
fing_13=fing_pca50_s[clust_km50_s==10]
plt.figure(figsize=(8,8))
plt.imshow(euclidean_distances(fing_13),cmap=plt.cm.viridis)
plt.colorbar()









    Out[32]:





<matplotlib.colorbar.Colorbar at 0x10e825390>

Testing DBSCAN (This didn't work so well)

I basically searched over a lot of eps and min_smaples values but never ended up getting very useful clusters. Either the number of unclassified points was too high or the clustering was terrible. We should probably do a proper GridSearchCV on this, but then the validation score isnt well defined. Most of the cells have been deleted in this section

Thought for later: Maybe use the adjusted_rand_score with the Kmens clusters as a validation metric.



In [71]:

    
from sklearn.cluster import DBSCAN



In [89]:

    
Db=DBSCAN(eps=1.5,min_samples=5,metric='precomputed')



In [90]:

    
for eps in np.linspace(0.5,12.0,25):
    Db.eps=eps
    clust_db=Db.fit_predict(dist_fing_s)
    print eps, Counter(clust_db)[-1], np.amax(clust_db)









    



0.5 14535 23
0.979166666667 13102 138
1.45833333333 10827 270
1.9375 8567 328
2.41666666667 6106 303
2.89583333333 3585 185
3.375 1979 85
3.85416666667 1043 47
4.33333333333 536 20
4.8125 236 6
5.29166666667 120 1
5.77083333333 53 0
6.25 20 0






    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
KeyboardInterrupt:

DBSCAN seems difficult to converge. Lets try Agglomerative CLustering

Another sensational Failure. The dataset is way too large and no reasonable number of clusters ever finishes running. May need to look into hyper-parameters more in depth, but nothing looks incredibly promising



In [161]:

    
from sklearn.cluster import AgglomerativeClustering



In [162]:

    
Ag=AgglomerativeClustering(n_clusters=5)



In [ ]:

    
clust_ag=Ag.fit_predict(fing_pca50_s[:,0:10])

Lets stick to kmeans and try other metrics

The Similarity matrix based on cosine distances looks rather prettier than the one based on euclidean distances



In [33]:

    
from sklearn.metrics.pairwise import cosine_distances



In [35]:

    
clust_km50_s=Df_sorted["clust_pca50"].values
dist_cosine_s=cosine_distances(fing_pca50_s)
plt.figure(figsize=(10,10))
plt.imshow(dist_cosine_s[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting after using cosine distances")
plt.colorbar()









    Out[35]:





<matplotlib.colorbar.Colorbar at 0x10d4d3550>

Note that in the plot above we plot every second column and row because imshow otherwise took too long

Lets try 20 cluster Kmeans

We redo all the steps above for Kmeans with 20 clusters. Nothing super dramatic happens but the clusters tighten up a little bit, as expected



In [36]:

    
Km2=KMeans(n_clusters=20, random_state=42,n_init=50)



In [37]:

    
clust_km50_20=Km2.fit_predict(pca_fing50)



In [38]:

    
from sklearn.metrics import confusion_matrix



In [39]:

    
Counter(clust_km50_20)









    Out[39]:





Counter({0: 1072,
         1: 96,
         2: 1291,
         3: 284,
         4: 927,
         5: 920,
         6: 165,
         7: 417,
         8: 1196,
         9: 776,
         10: 569,
         11: 383,
         12: 566,
         13: 440,
         14: 581,
         15: 1242,
         16: 1088,
         17: 562,
         18: 832,
         19: 1315})

Lets reorder by this 20 cluster kmeans



In [40]:

    
dist_center2=euclidean_distances(Km2.cluster_centers_,Km2.cluster_centers_[0])

dist_sorted2=sorted(dist_center2)
sort_key2=[]
for i in range(20):
    sort_key2.append(np.where(dist_center2==dist_sorted2[i])[0][0])









    



/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)



In [41]:

    
print sort_key2









    



[0, 8, 5, 4, 15, 19, 18, 10, 2, 16, 14, 9, 12, 17, 13, 11, 3, 6, 7, 1]



In [42]:

    
clust_km50_s2=np.zeros(len(clust_km50_20),dtype=int)
for i in range(20):
    clust_km50_s2[clust_km50_20==sort_key2[i]]=int(i)



In [43]:

    
Counter(clust_km50_s2)









    Out[43]:





Counter({0: 1072,
         1: 1196,
         2: 920,
         3: 927,
         4: 1242,
         5: 1315,
         6: 832,
         7: 569,
         8: 1291,
         9: 1088,
         10: 581,
         11: 776,
         12: 566,
         13: 562,
         14: 440,
         15: 383,
         16: 284,
         17: 165,
         18: 417,
         19: 96})



In [44]:

    
Df["clust_pca50_20"]=clust_km50_s2



In [45]:

    
Df_sorted2=Df.sort_values(by="clust_pca50_20")



In [46]:

    
Df_sorted2.head()









    Out[46]:






  
    
      
      Formula
      Ones_1
      Ones_2
      Ones_3
      Ones_4
      Ones_5
      Ones_6
      Ones_7
      Ones_8
      Ones_9
      ...
      Oxi_93
      Oxi_94
      Oxi_95
      Oxi_96
      Oxi_97
      Oxi_98
      Oxi_99
      Oxi_100
      clust_pca50
      clust_pca50_20
    
  
  
    
      9696
      Na2 Ta6 O16
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.000000
      ...
      -0.311716
      -0.355335
      -0.296676
      -0.130305
      0.056547
      0.137849
      0.036604
      -0.228062
      5.0
      0
    
    
      2340
      Ba1 Er2 F8
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.000000
      -1.000000
      ...
      0.081689
      0.064050
      0.008039
      -0.048995
      -0.056819
      -0.026578
      -0.052168
      -0.227748
      0.0
      0
    
    
      11096
      Pr12 Re4 O32
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.000000
      -0.999995
      ...
      0.100897
      0.070573
      -0.004767
      -0.091327
      -0.136998
      -0.160468
      -0.249889
      -0.447702
      0.0
      0
    
    
      5159
      Fe1 Sn1 F6
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.000000
      -0.999999
      ...
      -0.200641
      -0.036734
      0.101498
      0.146109
      0.092443
      -0.029515
      -0.178233
      -0.362784
      0.0
      0
    
    
      13213
      Mn2 Cd2 F10
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -1.0
      -0.999999
      -0.999985
      ...
      -0.015693
      -0.070280
      -0.091334
      -0.093592
      -0.097708
      -0.114368
      -0.194406
      -0.391995
      0.0
      0
    
  

5 rows × 403 columns



In [47]:

    
Formula_s2=Df_sorted2["Formula"]
Finger_s2=Df_sorted2.drop(["Formula","clust_pca50","clust_pca50_20"],axis=1).values



In [97]:

    
Finger_s2.shape









    Out[97]:





(14722, 400)



In [49]:

    
pca3=PCA(n_components=50)
finger_pca_s2=pca3.fit_transform(Finger_s2)



In [50]:

    
dist_fing_s2=euclidean_distances(finger_pca_s2)



In [51]:

    
plt.figure(figsize=(10,10))
plt.imshow(dist_fing_s2[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting")
plt.colorbar()









    Out[51]:





<matplotlib.colorbar.Colorbar at 0x10d75d650>



In [52]:

    
dist_cosine_s2=cosine_distances(finger_pca_s2)



In [148]:

    
plt.figure(figsize=(10,10))
plt.imshow(dist_cosine_s2[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting using cosine distances")
plt.colorbar()









    Out[148]:





<matplotlib.colorbar.Colorbar at 0x10efedf90>

Lets try DBSCAN with the cosine distances

My terrible luck with DBSCAN continues. Basically because the similarity matrix looked tighter with cosine distances, I tried DBSCAN with cosine distances. The same problems as earlier persist though



In [55]:

    
from sklearn.cluster import DBSCAN
Db_try=DBSCAN(eps=0.1,min_samples=5,metric='precomputed')



In [104]:

    
Db_try.eps=0.12
Db_try.min_samples=9
Db_cluster_cos=Db_try.fit_predict(dist_cosine_s2)
print Counter(Db_cluster_cos)









    



Counter({-1: 6025, 13: 3448, 1: 1118, 84: 862, 0: 849, 113: 165, 34: 116, 68: 95, 29: 93, 114: 90, 37: 77, 65: 71, 4: 58, 94: 55, 22: 52, 110: 43, 69: 36, 67: 34, 41: 33, 27: 32, 75: 30, 59: 28, 101: 28, 76: 27, 98: 27, 111: 27, 5: 25, 96: 25, 8: 24, 46: 24, 56: 24, 58: 24, 2: 23, 38: 23, 23: 22, 109: 22, 35: 20, 47: 18, 54: 18, 90: 18, 17: 17, 24: 17, 30: 17, 44: 17, 11: 16, 36: 16, 112: 16, 9: 15, 48: 15, 53: 15, 64: 15, 72: 15, 85: 15, 104: 15, 106: 15, 3: 14, 7: 14, 19: 14, 25: 14, 66: 14, 91: 14, 93: 14, 6: 13, 12: 13, 57: 13, 73: 13, 79: 13, 83: 13, 97: 13, 102: 13, 42: 12, 74: 12, 78: 12, 82: 12, 26: 11, 43: 11, 50: 11, 60: 11, 70: 11, 71: 11, 81: 11, 108: 11, 14: 10, 15: 10, 16: 10, 20: 10, 21: 10, 31: 10, 40: 10, 45: 10, 52: 10, 61: 10, 62: 10, 77: 10, 80: 10, 86: 10, 87: 10, 88: 10, 95: 10, 100: 10, 10: 9, 18: 9, 28: 9, 32: 9, 33: 9, 55: 9, 89: 9, 92: 9, 99: 9, 103: 9, 107: 9, 105: 8, 39: 7, 51: 7, 63: 7, 115: 7, 49: 4})



In [75]:

    
Counter(Db_cluster_cos)









    Out[75]:





Counter({-1: 4723,
         0: 8377,
         1: 38,
         2: 14,
         3: 58,
         4: 27,
         5: 15,
         6: 17,
         7: 14,
         8: 17,
         9: 14,
         10: 44,
         11: 13,
         12: 19,
         13: 11,
         14: 39,
         15: 33,
         16: 16,
         17: 19,
         18: 11,
         19: 26,
         20: 24,
         21: 88,
         22: 13,
         23: 35,
         24: 14,
         25: 8,
         26: 45,
         27: 24,
         28: 11,
         29: 32,
         30: 11,
         31: 24,
         32: 16,
         33: 11,
         34: 15,
         35: 19,
         36: 16,
         37: 11,
         38: 14,
         39: 44,
         40: 20,
         41: 137,
         42: 11,
         43: 21,
         44: 7,
         45: 17,
         46: 46,
         47: 32,
         48: 27,
         49: 12,
         50: 13,
         51: 11,
         52: 11,
         53: 16,
         54: 10,
         55: 11,
         56: 12,
         57: 11,
         58: 22,
         59: 165,
         60: 28,
         61: 10,
         62: 13,
         63: 13,
         64: 15,
         65: 11})

Vanity Project: Testing out label propogation

We label the first 10 elements of each of the 15 cluster KMeans with the cluster number and see how Kmeans compares with label propogation. The resulting similarity matrix ( after resorting) is actually pretty reasonable. But Kmeans still gives you the prettiest clusters



In [138]:

    
Ylabel=np.ones(len(clust_km50_s),dtype=int)*-1
Ylabel[0:10]=0
for i in range(1,15):
    Ylabel[np.where(clust_km50_s==i)[0][0:10]]=i



In [139]:

    
from sklearn.semi_supervised import LabelPropagation
Lab=LabelPropagation()



In [140]:

    
Lab=LabelPropagation(gamma=0.5)
Lab.fit(fing_pca50_s,Ylabel)
clusts_lab=Lab.predict(fing_pca50_s)
Counter(clusts_lab)









    Out[140]:





Counter({0: 1296,
         1: 1446,
         2: 1053,
         3: 1446,
         4: 1015,
         5: 837,
         6: 1040,
         7: 967,
         8: 706,
         9: 879,
         10: 1245,
         11: 1092,
         12: 277,
         13: 1330,
         14: 93})



In [127]:

    
from sklearn.metrics import adjusted_rand_score,confusion_matrix,accuracy_score



In [141]:

    
confusion_matrix(clust_km50_s,clusts_lab)









    Out[141]:





array([[ 990,   83,   13,   75,   38,  153,    4,   23,    0,    0,   27,
           0,    0,    1,    0],
       [  76, 1130,    0,  240,    0,    1,   61,    4,   72,    3,    8,
           0,    0,    0,    0],
       [  75,   17,  670,    1,   50,  112,    1,   75,   11,   14,  463,
           2,    0,  199,    0],
       [  15,    7,    2,  873,   22,    0,   81,   23,    5,    0,  212,
          11,    0,    0,    0],
       [  91,    3,   33,    4,  871,    1,   26,   35,    0,  106,    8,
           1,    0,   50,    0],
       [   6,    0,   19,    0,    0,  558,    0,    0,    0,    0,    0,
           4,   14,    0,    0],
       [   9,   78,  138,  138,   11,    0,  804,   32,    3,   96,    4,
          42,    0,  104,    0],
       [   5,    0,  156,   57,    5,    3,    7,  753,    1,    0,    3,
         281,    3,    2,    0],
       [   9,  128,    4,   11,    8,    8,   29,    3,  613,    0,    0,
           0,    0,    9,    0],
       [   1,    0,   13,    0,    0,    0,    6,    0,    0,  634,    0,
           0,    0,  214,    0],
       [   2,    0,    0,   27,    0,    0,    0,    0,    0,    0,  519,
           2,    0,    0,    0],
       [   0,    0,    3,   20,   10,    0,   21,   15,    1,    2,    1,
         745,    0,    0,    0],
       [  17,    0,    0,    0,    0,    1,    0,    4,    0,    0,    0,
           1,  260,    0,    0],
       [   0,    0,    2,    0,    0,    0,    0,    0,    0,   24,    0,
           0,    0,  751,    0],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           3,    0,    0,   93]])



In [142]:

    
adjusted_rand_score(clust_km50_s,clusts_lab)









    Out[142]:





0.46570994459981463



In [143]:

    
Counter(clusts_lab)









    Out[143]:





Counter({0: 1296,
         1: 1446,
         2: 1053,
         3: 1446,
         4: 1015,
         5: 837,
         6: 1040,
         7: 967,
         8: 706,
         9: 879,
         10: 1245,
         11: 1092,
         12: 277,
         13: 1330,
         14: 93})



In [144]:

    
Df_sorted["clust_lab"]=clusts_lab



In [133]:

    
Df_sorted.columns









    Out[133]:





Index([u'Formula', u'Ones_1', u'Ones_2', u'Ones_3', u'Ones_4', u'Ones_5',
       u'Ones_6', u'Ones_7', u'Ones_8', u'Ones_9',
       ...
       u'Oxi_93', u'Oxi_94', u'Oxi_95', u'Oxi_96', u'Oxi_97', u'Oxi_98',
       u'Oxi_99', u'Oxi_100', u'clust_pca50', u'clust_lab'],
      dtype='object', length=403)



In [145]:

    
Df_sort_lab=Df_sorted.sort_values(by="clust_lab")
fing_lab_s=Df_sort_lab.drop(['Formula','clust_pca50','clust_lab'],axis=1).values
pca4=PCA(n_components=50)
fing_lab_pca_s=pca4.fit_transform(fing_lab_s)
dist_lab_cos_s=cosine_distances(fing_lab_pca_s)



In [149]:

    
plt.figure(figsize=(10,10))
plt.imshow(dist_lab_cos_s[::3,::3],cmap=plt.cm.magma)
plt.title("Similarity Matrix after sorting by Label Propogation Algorithm")
plt.colorbar()









    Out[149]:





<matplotlib.colorbar.Colorbar at 0x1190f80d0>

Note that this is no way means Kmeans is awesome or that Label Propogation sucks. We were using the Kmeans labels to label things, so the best it could do was reproduce Kmeans, more or less. It is actually impressive that it gives pretty reasonable results. Higher number of labels would help, but this is not worth redoing as we know what we will get



In [ ]:

	Formula	Ones_1	Ones_2	Ones_3	Ones_4	Ones_5	Ones_6	Ones_7	Ones_8	Ones_9	...	Oxi_91	Oxi_92	Oxi_93	Oxi_94	Oxi_95	Oxi_96	Oxi_97	Oxi_98	Oxi_99	Oxi_100
0	Nb1 Ag1 O3	-1.0	-1.0	-1.0	-1.000000	-1.00000	-1.000000	-1.000000	-1.000000	-1.000000	...	-0.012953	-0.095604	-0.104102	-0.183833	-0.356473	-0.505818	-0.536482	-0.473015	-0.425665	-0.483642
1	Li2 Ag6 O4	-1.0	-1.0	-1.0	-1.000000	-1.00000	-1.000000	-1.000000	-1.000000	-0.999999	...	-0.097331	-0.222620	-0.219186	-0.149412	-0.046637	0.069006	0.131393	0.098103	-0.040059	-0.296037
2	Cs2 Ag2 Cl4	-1.0	-1.0	-1.0	-1.000000	-1.00000	-1.000000	-1.000000	-1.000000	-1.000000	...	0.136590	0.131849	-0.020244	-0.112980	-0.003839	0.224747	0.338938	0.204959	-0.099369	-0.436527
3	Ag2 Hg1 I4	-1.0	-1.0	-1.0	-1.000000	-1.00000	-1.000000	-1.000000	-1.000000	-1.000000	...	-0.055794	-0.147931	-0.240709	-0.256481	-0.141461	0.054555	0.253675	0.414680	0.427099	0.152495
4	Ag2 C2 O6	-1.0	-1.0	-1.0	-0.999999	-0.99997	-0.999462	-0.993801	-0.954192	-0.782973	...	-0.153798	-0.262954	-0.317802	-0.218989	0.012370	0.284588	0.437742	0.290560	-0.120417	-0.541639

	Formula	Ones_1	Ones_2	Ones_3	Ones_4	Ones_5	Ones_6	Ones_7	Ones_8	Ones_9	...	Oxi_93	Oxi_94	Oxi_95	Oxi_96	Oxi_97	Oxi_98	Oxi_99	Oxi_100	clust_pca50
9696	Na2 Ta6 O16	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.000000	-1.000000	...	-0.311716	-0.355335	-0.296676	-0.130305	0.056547	0.137849	0.036604	-0.228062	5.0
2340	Ba1 Er2 F8	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.000000	-1.000000	...	0.081689	0.064050	0.008039	-0.048995	-0.056819	-0.026578	-0.052168	-0.227748	0.0
11096	Pr12 Re4 O32	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.000000	-0.999995	...	0.100897	0.070573	-0.004767	-0.091327	-0.136998	-0.160468	-0.249889	-0.447702	0.0
5159	Fe1 Sn1 F6	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.000000	-0.999999	...	-0.200641	-0.036734	0.101498	0.146109	0.092443	-0.029515	-0.178233	-0.362784	0.0
13213	Mn2 Cd2 F10	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-1.0	-0.999999	-0.999985	...	-0.015693	-0.070280	-0.091334	-0.093592	-0.097708	-0.114368	-0.194406	-0.391995	0.0