Lets cluster these Fingerprints

In this notebook we perform clustering tests on the fingerprint functions. Note that at this point we don't yet have the oxidation state fingerprints and only have those for ones, Z and Chi. We do the clustering with oxidation state fingerprints added in a later notebook. We have the fingerprints and formulae saved in the Pandas DataFrame Fingerprint_lt50_old.csv file. Note that we only consider compounds where the total number of atoms in the unit cell <50 and for which oxidation number can be calculated. Also note that the struct files are from the materials project which are not always accurate.

Loading the CSV into Pandas dataframe and extracting formulae and fingerprints


In [6]:
import pandas as pd
import numpy as np
import tqdm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [7]:
%matplotlib inline

In [8]:
Df=pd.read_csv("../data/FingerPrint_lt50_old.csv",sep='\t',index_col=0)

In [9]:
Df.head()


Out[9]:
Formula Ones_1 Ones_2 Ones_3 Ones_4 Ones_5 Ones_6 Ones_7 Ones_8 Ones_9 ... Chi_91 Chi_92 Chi_93 Chi_94 Chi_95 Chi_96 Chi_97 Chi_98 Chi_99 Chi_100
0 Nb1 Ag1 O3 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.497277 -0.453894 -0.191895 0.064329 0.104619 -0.118050 -0.394161 -0.537855 -0.587967 -0.654697
1 Li2 Ag6 O4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -0.999999 ... -0.120349 -0.206105 -0.217994 -0.183563 -0.075639 0.073079 0.137550 0.062229 -0.112490 -0.359644
2 Cs2 Ag2 Cl4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... 0.259454 0.218056 0.031980 -0.098497 -0.054451 0.137466 0.285717 0.202032 -0.097068 -0.462973
3 Ag2 Hg1 I4 -1.0 -1.0 -1.0 -1.000000 -1.00000 -1.000000 -1.000000 -1.000000 -1.000000 ... -0.059718 -0.141794 -0.241085 -0.272397 -0.160438 0.054698 0.275458 0.446003 0.474980 0.227803
4 Ag2 C2 O6 -1.0 -1.0 -1.0 -0.999999 -0.99997 -0.999462 -0.993801 -0.954192 -0.782973 ... -0.123078 -0.202793 -0.201147 -0.164696 -0.108094 0.009786 0.153578 0.153906 -0.070010 -0.394319

5 rows × 301 columns


In [10]:
Formulas=Df["Formula"].values
Formulas


Out[10]:
array(['Nb1 Ag1 O3', 'Li2 Ag6 O4', 'Cs2 Ag2 Cl4', ..., 'Y6 Co6 O18',
       'Y1 Co1 F5', 'Mg1 Co4 S8'], dtype=object)

In [11]:
Fingerprints=Df.drop("Formula",axis=1).values

In [12]:
from sklearn.metrics import euclidean_distances

Testing out Inertia for Kmeans with different n_clusters to see if there is a well defined "elbow" or a sharp drop.


In [13]:
from sklearn.cluster import KMeans

In [15]:
inertia_array=np.zeros(19)
for i in tqdm.tqdm_notebook(range(2,20)):
    Km_run=KMeans(n_clusters=i)
    Km_run.fit_predict(Fingerprints)
    inertia_array[i-2]=Km_run.inertia_




In [16]:
inertia_array2=np.zeros(6)
for i in tqdm.tqdm_notebook(range(20,26)):
    Km_run=KMeans(n_clusters=i)
    Km_run.fit_predict(Fingerprints)
    inertia_array2[i-20]=Km_run.inertia_




In [32]:
inertia_array=np.append(inertia_array,inertia_array2)

In [41]:
inertia_array=inertia_array[inertia_array>0]

In [42]:
plt.figure(figsize=(8,8))
plt.plot(np.arange(2,26),inertia_array)
plt.ylabel("Inertia")
plt.xlabel("Number of Clusters")


Out[42]:
[<matplotlib.lines.Line2D at 0x1124a3f90>]

We decide on 15 clusters. There is no real elbow to speak of and we shall see later that 15 gives us fairly tight clusters

Lets use PCA to make the dataset smaller

We use the explained variance ratio to decide the number of components. We keep 50 components which keeps ~97% of variance


In [8]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
pc=PCA(n_components=50)

In [9]:
pca_fing=pc.fit_transform(Fingerprints)

In [10]:
list(enumerate(np.cumsum(pc.explained_variance_ratio_)))


Out[10]:
[(0, 0.16149118125330231),
 (1, 0.27367676124440887),
 (2, 0.37250488053590131),
 (3, 0.44928486557661768),
 (4, 0.50712061537192721),
 (5, 0.54981904927756264),
 (6, 0.5853508611030771),
 (7, 0.6199933456382144),
 (8, 0.65102686933536513),
 (9, 0.68001285644235121),
 (10, 0.70764070985117467),
 (11, 0.73034622458312459),
 (12, 0.7499285382066283),
 (13, 0.76809153805162245),
 (14, 0.78435923367535199),
 (15, 0.79901524025260173),
 (16, 0.81279379573299504),
 (17, 0.82483369075007018),
 (18, 0.83574923304734461),
 (19, 0.84572638458306681),
 (20, 0.8541211132340204),
 (21, 0.86218460989479784),
 (22, 0.86988930772944628),
 (23, 0.87697220787494989),
 (24, 0.88343686872852345),
 (25, 0.88968751572190052),
 (26, 0.89574043598524988),
 (27, 0.90158254950196837),
 (28, 0.90718878573617356),
 (29, 0.91206621481073447),
 (30, 0.91678640751493468),
 (31, 0.92125512084878669),
 (32, 0.92520925187091885),
 (33, 0.92894818633254295),
 (34, 0.93261969138669665),
 (35, 0.93603947790603437),
 (36, 0.93922954087316757),
 (37, 0.94219798220305251),
 (38, 0.94504203146948418),
 (39, 0.947809515903),
 (40, 0.95046690454845262),
 (41, 0.9528462556121835),
 (42, 0.95513130271862212),
 (43, 0.95728572237699527),
 (44, 0.9593794476669365),
 (45, 0.96138569806994756),
 (46, 0.96329077418826958),
 (47, 0.96499316695709192),
 (48, 0.9665820358594166),
 (49, 0.96811447180119914)]

Performing KMeans with 15 clusters on these 50-component PCA fingerprints


In [10]:
Km_pca50=KMeans(n_clusters=15,random_state=42)

In [11]:
clusts=Km_pca50.fit_predict(pca_fing)

Now for some fancy plots!


In [14]:
rst=np.random.RandomState(42)
c_arr=[tuple(rst.uniform(0,1,4)) for i in range(15)]

Plotting the points based on their cluster number as a function of their first 2 PCA components. The labels are the formula of the first compound in the cluster.


In [12]:
labels=[Formulas[clusts==i][0] for i in sorted(set(clusts))]
list(enumerate(labels))


Out[12]:
[(0, 'Er2 Ag2 Te4'),
 (1, 'Ag2 Ru2 O8'),
 (2, 'Y2 Ag2 Te4'),
 (3, 'Li2 Ag1 Sb1'),
 (4, 'Ag2 C2 O6'),
 (5, 'Ce2 Ag2 Ge2'),
 (6, 'Li2 Ag6 O4'),
 (7, 'Ag4 Ru4 O12'),
 (8, 'Nb1 Ag1 O3'),
 (9, 'Mn2 Ag2 O6'),
 (10, 'Ag1 Bi1 S2'),
 (11, 'Li1 Ag1 Te1'),
 (12, 'Cs2 Ag2 Cl4'),
 (13, 'Ca2 Ag2 Bi2'),
 (14, 'Rb2 Ag2 O4')]

In [15]:
plt.figure(figsize=(18,12))
for i in range(15):
    plt.scatter(pca_fing[clusts==i,0],pca_fing[clusts==i,1],c=c_arr[i],label="Cluster "+labels[i])
plt.legend()


Out[15]:
<matplotlib.legend.Legend at 0x10d0a0690>

Doing the same thing in 3d for the first 3 PCA components. mplot3d isn't the best tool for 3d plotting due to occlusion effects. Final plots will need to be done with mayavi or something


In [17]:
%matplotlib inline
import mpl_toolkits.mplot3d.axes3d as p3
# Plot result
fig = plt.figure(figsize=(10,10))
ax = p3.Axes3D(fig)
#ax.view_init(7, -80)
for i in range(14,-1,-1):
    ax.plot3D(pca_fing[clusts==i,0],pca_fing[clusts==i,1],pca_fing[clusts==i,3],'o',c=c_arr[i],label="Cluster "+labels[i])
#plt.title('Without connectivity constraints (time %.2fs)' % elapsed_time)
plt.legend()


Out[17]:
<matplotlib.legend.Legend at 0x106458890>

We have discovered a cluster which seems to contain only perovsklites. We print out the formulas in this cluster ( using Pandas) and see that they are almost exclusively perovskites or very close


In [19]:
Df["cluster_pca50"]=clusts

In [20]:
perov_clust_forms=Df[Df["cluster_pca50"]==Df.ix[0]["cluster_pca50"]]["Formula"]

In [21]:
perov_clust_forms


Out[21]:
0          Nb1 Ag1 O3
32          Ag1 B1 O3
33         Zn1 Ag1 F3
86        Cs2 Ag2 Cl6
108       Tl1 Ag1 Cl3
132        Ni1 Ag1 F3
173        Rb1 Ag1 O3
208        Mg1 Ag1 F3
230       Tl2 Ag2 Cl6
231        K2 Ag2 Cl6
240       Tl1 Ag1 Br3
241       Rb1 Ag1 Br3
246        Ta2 Ag2 O6
256        Cu2 Ag2 F6
257        Mg2 Ag2 F6
262        Cu1 Ag1 F3
269        Rb2 Ag2 F6
272        Cs2 Ag2 F6
308        K4 Ag4 F12
332       Cs1 Ag1 Br3
342       Rb1 Ag1 Cl3
343        K1 Ag1 Cl3
407        Al1 Ag1 O3
413         Y1 Ag1 O3
429        Tl1 Ag1 F3
445        K1 Ag1 Br3
485        Sr1 Ag1 O3
504        La1 Al1 O3
505        La2 Al2 O6
512        Al1 Bi1 O3
             ...     
13883    Mn4 Tl4 Cl12
13947     Tl1 Ge1 Cl3
13950     Cs1 Tm1 Cl3
13967     Rb2 In2 Cl6
13968     Tl1 In1 Cl3
14003    Rb6 Mg6 Cl18
14030     Tl1 Cu1 Cl3
14041    Cs9 Hg9 Cl27
14053     Rb1 Sn1 Cl3
14067     Li1 Ti1 Cl3
14075    Rb4 Mg4 Cl12
14103      K1 Ge1 Cl3
14134     Rb1 Pb1 Cl3
14195      K1 Cr1 Cl3
14196      K2 Cr2 Cl6
14197     Cs1 Sm1 Cl3
14211     Cs2 Pb2 Cl6
14248      La2 Co2 O6
14249      Sm1 Co1 O3
14250      Nd1 Co1 O3
14251      Pr1 Co1 O3
14256      La1 Co1 O3
14289      Co1 Te1 O3
14317       K1 Co1 F3
14368      Sr1 Co1 O3
14396      Tl1 Co1 F3
14440     La4 Co3 O10
14445     Rb6 Co6 F18
14528      Gd1 Co1 O3
14678     Sr8 Co8 O23
Name: Formula, dtype: object

Since the clusters in Kmeans are arbitrarily ordered, so in order to get pretty similarity matrices, we need to reorder them. We choose to do this based on the distance of cluster center to cluster center of cluster 0. This is not terribly necessary but leads to nicer plots


In [18]:
from collections import Counter
Counter(clusts)


Out[18]:
Counter({0: 938,
         1: 1728,
         2: 1507,
         3: 97,
         4: 891,
         5: 777,
         6: 998,
         7: 550,
         8: 491,
         9: 1505,
         10: 492,
         11: 283,
         12: 1204,
         13: 1212,
         14: 2049})

We plot the similarity matrix for the cluster centers before(1st figure) and after (2nd figure) reordering. We see that the second figure much more ordered


In [20]:
dist_centers=euclidean_distances(Km_pca50.cluster_centers_)

In [21]:
print dist_centers[0]
plt.imshow(dist_centers)
plt.colorbar()


[  0.           3.93535105   3.90032104  11.7278847    4.79061938
   4.34218444   4.00873426   4.74158359   7.58895776   4.24982168
   6.47418226   7.54039307   5.19884462   4.48322222   4.32923034]
Out[21]:
<matplotlib.colorbar.Colorbar at 0x1040c2a90>

In [22]:
clust_centers=Km_pca50.cluster_centers_
sort_args=[np.argwhere(dist_centers[0]==i )[0][0] for i in np.sort(dist_centers[0])]
centers_sorted=clust_centers[[np.argwhere(dist_centers[0]==i )[0][0] for i in np.sort(dist_centers[0])]]
plt.imshow(euclidean_distances(centers_sorted))
plt.colorbar()


Out[22]:
<matplotlib.colorbar.Colorbar at 0x106e11b90>

We now resort the fingerprints based on these renamed cluster labels using pandas sort. We shall use these to plot similarity matrices which shall look more ordered


In [23]:
sort_args


Out[23]:
[0, 2, 1, 6, 9, 14, 5, 13, 7, 4, 12, 10, 11, 8, 3]

In [24]:
clusts_sorted=np.zeros(len(clusts),dtype=int)
for i in range(15):
    clusts_sorted[clusts==i]=sort_args.index(i)

In [25]:
Df["cluster_pca50"]=clusts_sorted
Df_sorted=Df.sort_values(by="cluster_pca50")
clust_sort=Df_sorted["cluster_pca50"].values
fing_sort=Df_sorted.drop(["cluster_pca50","Formula"],axis=1).values

In [26]:
Counter(clust_sort)


Out[26]:
Counter({0: 938,
         1: 1507,
         2: 1728,
         3: 998,
         4: 1505,
         5: 2049,
         6: 777,
         7: 1212,
         8: 550,
         9: 891,
         10: 1204,
         11: 492,
         12: 283,
         13: 491,
         14: 97})

We compute the euclidean distance between these fingerprints and store this distance matrix.As the dataset is large, we plot the whole similarity matrix in 2 seperate parts as shown below. We also plot the whole similarity matrix but after taking only every second row/column


In [29]:
fing_06=fing_sort[clust_sort<7]

dist_06=euclidean_distances(fing_06)

plt.figure(figsize=(10,10))
plt.title("Similarity Matrix for Clusters 0-6 after sorting")
plt.imshow(dist_06)
plt.colorbar()


Out[29]:
<matplotlib.colorbar.Colorbar at 0x10fbff610>

In [30]:
fing_714=fing_sort[clust_sort>6]

dist_714=euclidean_distances(fing_714)

plt.figure(figsize=(10,10))
plt.imshow(dist_714,cmap=plt.)
plt.title("Similarity Matrix for Clusters 7-14 after sorting")
plt.colorbar()


Out[30]:
<matplotlib.colorbar.Colorbar at 0x10ff49b50>

In [37]:
dist_014=euclidean_distances(fing_sort)
plt.figure(figsize=(10,10))
plt.imshow(dist_014[::2,::2])
plt.title("Similarity Matrix for all Clusters after sorting")
plt.colorbar()


Out[37]:
<matplotlib.colorbar.Colorbar at 0x1071454d0>

In [ ]:


In [31]:
fing_914=fing_sort[clust_sort>9]

dist_914=euclidean_distances(fing_914)

plt.figure(figsize=(10,10))
plt.title("Similarity Matrix for Clusters 10-14 after sorting")
plt.imshow(dist_914)
plt.colorbar()


Out[31]:
<matplotlib.colorbar.Colorbar at 0x10fff2a50>

We see that the last few clusters are very tight whereas the first few clusters, not so much. We probably need more clusters or these structures might be hard to seperate using Kmeans

Lets match results with Agglomerative clustering (Incredibly slow)

Tried Agglomerative clustering. However it is ridiculously slow and not usable even when reducing the number of PCA components


In [39]:
from sklearn.cluster import AgglomerativeClustering

In [40]:
Ag=AgglomerativeClustering(n_clusters=15)

In [42]:
pc2=PCA(n_components=10)
pca_fing_sort10=pc2.fit_transform(fing_sort)

In [47]:
import os

In [ ]:
clust_ag=Ag.fit_predict(pca_fing_sort40)

os.system('say "your program has finished"')

In [48]:
Counter(clust_ag)


Out[48]:
Counter({0: 585,
         1: 463,
         2: 425,
         3: 516,
         4: 470,
         5: 253,
         6: 234,
         7: 512,
         8: 143,
         9: 262,
         10: 263,
         11: 159,
         12: 292,
         13: 185,
         14: 238})

Lets refactor these structures into the periodic table

Now lets get onto the fun part. We shall create a new array refactor_array of dimensions 92x92x92x15, 92 being the number of elements in our dataset. Each elements [a,b,c,d] shall contain the number of structures having elements with atomic numbers (a,b,c) and in cluster D. This will allow is where in the periodic table all of these clusters lie


In [27]:
from pymatgen import Composition

In [28]:
Formulas_s=Df_sorted["Formula"].values

clusters_s=Df_sorted["cluster_pca50"].values

Comps_s=[Composition(i) for i in Formulas_s]

Z_max=max([ E.Z for comp in Comps_s for E in comp.elements ])

Z_array= np.array([[E.Z for E in comp.elements] for comp in Comps_s ])

In [113]:
Z_array.shape


Out[113]:
(14722, 3)

Note that the way I have set it up, the refactor array is 6 times degenerate as all permutations of (a,b,c) (the first three indices) have the same values stored


In [33]:
from itertools import permutations

In [402]:
refactor_array=np.zeros((Z_max,Z_max,Z_max,np.amax(clusters_s)+1))
for i in tqdm.tqdm_notebook(range(len(clusters_s))):
    for index in set(permutations(Z_array[i])):
        refactor_array[index[0]-1,index[1]-1,index[2]-1][clusters_s[i]]+=1




In [362]:
Counter(clusters_s)


Out[362]:
Counter({0: 938,
         1: 1507,
         2: 1728,
         3: 998,
         4: 1505,
         5: 2049,
         6: 777,
         7: 1212,
         8: 550,
         9: 891,
         10: 1204,
         11: 492,
         12: 283,
         13: 491,
         14: 97})

In [403]:
refactor_summed=np.sum(refactor_array,axis=3)

In [63]:
refactor_summed[7].shape


Out[63]:
(92, 92)

In [404]:
count=0
xarr=np.zeros(8464)
yarr=np.zeros(8464)
zarr=np.zeros(8464)
for (a,b),c in np.ndenumerate(refactor_summed[7]):
    if c>0:
        xarr[count]=a+1
        yarr[count]=b+1
        zarr[count]=c
        count=count+1

Plotting Barplots for all the Oxides ternaries as a function of atomic numbers


In [405]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(45, 45)
ax.bar3d(xarr[0:count], yarr[0:count],np.zeros(count),0.85,0.85,zarr[0:count], color='c', zsort='average')
plt.title("Concentrations of all Ternaries with Oxygen in them",fontsize=32)
plt.xlabel("Z1-->",fontsize=32)
plt.ylabel("Z2-->",fontsize=32)


Out[405]:
<matplotlib.text.Text at 0x181616e90>

We now write up new code to create stacked barplots for all oxide ternaries where the different colors denote different clusters.


In [406]:
np.random.seed(13)
c2_arr=np.random.rand(15,4)
count=0
count8=0
xarr=np.zeros(8464)
yarr=np.zeros(8464)
zarr=np.zeros(8464)
dzarr=np.zeros(8464)
clust_arr=np.zeros((8464,4))
label_arr=np.zeros(8464)
Z_tag=7
for (a,b,c),d in np.ndenumerate(refactor_array[Z_tag]):
    if d>0:
        if c==8:
            count8=count8+1
        xarr[count]=a+1
        yarr[count]=b+1
        zarr[count]=np.sum(refactor_array[Z_tag,a,b,0:c])
        clust_arr[count]=c2_arr[c]
        label_arr[count]=c
        dzarr[count]=d
        count=count+1
xarr=xarr[0:count]
yarr=yarr[0:count]
zarr=zarr[0:count]
dzarr=dzarr[0:count]
clust_arr=clust_arr[0:count]
label_arr=label_arr[0:count]
print count


7800

In [407]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
fig = plt.figure(figsize=(25,25))

ax = fig.add_subplot(111, projection='3d')
ax.view_init(45, 45)
ax.bar3d(xarr, yarr,zarr,0.95,0.95,dzarr,color=clust_arr,zsort='average',alpha=0.7)
plt.title("Concentrations of all Ternaries with Oxygen by cluster",fontsize=32)
ax.set_xlabel("Z1-->",fontsize=32)
ax.set_ylabel("Z2-->",fontsize=32)
ax.set_zlabel("No of Compounds",fontsize=32)
proxies=[plt.Rectangle((0,0),2,2,fc=c2_arr[i]) for i in range(15)]
names=["Cluster"+str(i) for i in range(15)]
ax.legend(proxies,names)


Out[407]:
<matplotlib.legend.Legend at 0x167059210>

barplot for all oxide ternaries which belong to the perovskite cluster


In [410]:
np.random.seed(13)
c2_arr=np.random.rand(15,4)
count=0
xarr=np.zeros(8464)
yarr=np.zeros(8464)
zarr=np.zeros(8464)
dzarr=np.zeros(8464)
clust_arr=np.zeros((8464,4))
label_arr=np.zeros(8464)
Z_tag=7
for (a,b,c),d in np.ndenumerate(refactor_array[Z_tag]):
    if d>0 and c==13:
        xarr[count]=a+1
        yarr[count]=b+1
        clust_arr[count]=c2_arr[c]
        label_arr[count]=c
        dzarr[count]=d
        count=count+1
xarr=xarr[0:count]
yarr=yarr[0:count]
zarr=zarr[0:count]
dzarr=dzarr[0:count]
clust_arr=clust_arr[0:count]
label_arr=label_arr[0:count]

In [411]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(45, 45)
ax.bar3d(xarr, yarr,zarr,0.95,0.95,dzarr,color='c',zsort='average',alpha=0.7)
plt.title("Concentrations of all Ternaries with Oxygen in Perovskite cluster",fontsize=32)
ax.set_xlabel("Z1-->",fontsize=20)
ax.set_ylabel("Z2-->",fontsize=20)
ax.set_zlabel("No of Compounds",fontsize=20)


Out[411]:
<matplotlib.text.Text at 0x1765fa510>

We should probably plot only one half of all these graphs as the thing is symmetric about the diagonal


In [412]:
labels=[Formulas_s[clusters_s==i][0] for i in range(np.amax(clusters_s)+1)]
list(enumerate(labels))


Out[412]:
[(0, 'Li16 Fe2 S12'),
 (1, 'Sm4 Tm4 O12'),
 (2, 'Hg8 O2 F12'),
 (3, 'Mn1 In2 Te4'),
 (4, 'Na8 Ge8 Se20'),
 (5, 'Cs12 Bi8 O18'),
 (6, 'Ho1 Ge2 Rh2'),
 (7, 'Sm8 Ag4 Se14'),
 (8, 'Yb6 Mn6 O18'),
 (9, 'Sn2 C4 O8'),
 (10, 'K12 Ge4 Te12'),
 (11, 'Ba6 Zr1 O8'),
 (12, 'Nd2 O2 F2'),
 (13, 'K1 Ag1 Cl3'),
 (14, 'Si1 Hg1 P2')]

We write new code to plot stacked barplots for only a subset of all the clusters. Note that all of these code as well as the code required to plot the scatterplots found later have now been put in fingerprint_refactoring.py in the src folder


In [413]:
np.random.seed(45)
c2_arr=np.random.rand(15,4)

In [416]:
def get_barplot_arrays(ref_array,Z_tag,list_clusts):
    count=0
    xarr=np.zeros(8464)
    yarr=np.zeros(8464)
    zarr=np.zeros(8464)
    dzarr=np.zeros(8464)
    clust_arr=np.zeros((8464,4))
    #label_arr=np.zeros(8464)
    Z=Z_tag-1
    refactor_block=(ref_array[Z])[:,:,list_clusts]
    for (a,b,c),d in np.ndenumerate(refactor_block):
        if (d>0 and b>=a):
            xarr[count]=a+1
            yarr[count]=b+1
            zarr[count]=np.sum(refactor_block[a,b,0:c])
            clust_arr[count]=c2_arr[list_clusts[c]]
            dzarr[count]=d
            count=count+1
    xarr=xarr[0:count]
    yarr=yarr[0:count]
    zarr=zarr[0:count]
    dzarr=dzarr[0:count]
    clust_arr=clust_arr[0:count]
    #label_arr=label_arr[0:count]
    return (xarr,yarr,zarr,dzarr,clust_arr)

In [379]:
refactor_array.shape


Out[379]:
(92, 92, 92, 15)

In [417]:
%matplotlib inline

l_clust=[8,13,12]

(xarr,yarr,zarr,dzarr,clust_arr)=get_barplot_arrays(refactor_array,8,l_clust)


fig = plt.figure(figsize=(25,25))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(45, 60)

ax.bar3d(xarr, yarr,zarr,0.95,0.95,dzarr,color=clust_arr,zsort='average',alpha=1.0)
plt.title("Concentrations of all Ternaries with Oxygen in a the clusters specified",fontsize=32)
ax.set_xlabel("Z1-->",fontsize=32)
ax.set_ylabel("Z2-->",fontsize=32)
ax.set_zlabel("No of Compounds",fontsize=32)
proxies=[plt.Rectangle((0,0),2,2,fc=c2_arr[i]) for i in l_clust]
names=["Cluster "+labels[i] for i in l_clust]
ax.legend(proxies,names,fontsize=28,loc=3)


Out[417]:
<matplotlib.legend.Legend at 0x171311350>

Rejigging code to only plot one half of these bar plots


In [418]:
def get_barplot_summed(ref_summed,Z_tag):
    count=0
    xarr=np.zeros(8464)
    yarr=np.zeros(8464)
    zarr=np.zeros(8464)
    dzarr=np.zeros(8464)
    clust_arr=np.zeros((8464,4))
    for (a,b),d in np.ndenumerate(ref_summed[Z_tag-1]):
        if (d>0 and b>=a):
            xarr[count]=a+1
            yarr[count]=b+1
            dzarr[count]=d
            count=count+1
    xarr=xarr[0:count]
    yarr=yarr[0:count]
    zarr=zarr[0:count]
    dzarr=dzarr[0:count]
    return (xarr,yarr,zarr,dzarr)

In [419]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
xarr,yarr,zarr,dzarr=get_barplot_summed(refactor_summed,8)
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(40,-10)
ax.bar3d(xarr, yarr,zarr,0.85,0.85,dzarr, color='c', zsort='average')
plt.title("Concentrations of all Ternaries with Oxygen in them",fontsize=32)
plt.xlabel("Z1-->",fontsize=32)
plt.ylabel("Z2-->",fontsize=32)


Out[419]:
<matplotlib.text.Text at 0x190ad43d0>

Lets create Heat maps

We know plot heatmaps using scatter plot. These could also have been done using imshow but that has too much postprocessing and interpolation and is also slower. Note that the colorbars in all the scatterplts have been shortened in range so as to show more detail in the plots


In [420]:
def get_scatterplot_summed(ref_summed,Z_tag):
    count=0
    xarr=np.zeros(8464)
    yarr=np.zeros(8464)
    zarr=np.zeros(8464)
    dzarr=np.zeros(8464)
    clust_arr=np.zeros((8464,4))
    for (a,b),d in np.ndenumerate(ref_summed[Z_tag-1]):
        if d>0:
            xarr[count]=a+1
            yarr[count]=b+1
            dzarr[count]=d
            count=count+1
    xarr=xarr[0:count]
    yarr=yarr[0:count]
    zarr=zarr[0:count]
    dzarr=dzarr[0:count]
    return (xarr,yarr,zarr,dzarr)

In [421]:
%matplotlib inline
plt.rcParams["figure.subplot.wspace"]=0.05  
plt.rcParams["xtick.labelsize"]=18  
plt.rcParams["ytick.labelsize"]=18  
plt.figure(figsize=(25,12))
plt.subplot(1,2,1)
xarr,yarr,zarr,dzarr=get_scatterplot_summed(refactor_summed,8)
plt.scatter(xarr,yarr,c=dzarr,s=50)
plt.xlim([0,95])
plt.ylim([0,95])
plt.gca().set_xticks(np.arange(0,95,5))
plt.gca().set_yticks(np.arange(0,95,5))
plt.clim(0,10)
plt.colorbar()
xarr,yarr,zarr,dzarr=get_barplot_summed(refactor_summed,8)
ax = plt.subplot(1,2,2, projection='3d')
ax.view_init(50,-20)
ax.bar3d(xarr, yarr,zarr,0.85,0.85,dzarr, color='c', zsort='average')
ax.set_ylabel("Z2-->",fontsize=16)
ax.set_xlabel("Z1-->",fontsize=16)


Out[421]:
<matplotlib.text.Text at 0x115f1b990>

In [422]:
plt.rcParams["xtick.labelsize"]=10  
plt.rcParams["ytick.labelsize"]=10  
plt.figure(figsize=(20,20))
xarr,yarr,zarr,dzarr=get_scatterplot_summed(refactor_summed,8)
plt.scatter(xarr[dzarr>0],yarr[dzarr>0],c=dzarr[dzarr>0],s=100)
plt.xlim([0,93])
plt.ylim([0,93])
plt.gca().set_xticks(np.arange(0,93,1))
plt.gca().set_yticks(np.arange(0,93,1))
plt.clim(1,12)
plt.colorbar(shrink=0.7)


Out[422]:
<matplotlib.colorbar.Colorbar at 0x115eb2610>

Refactoring the Periodic Table

Instead of using just the atomic number as the index, let us regroup the rows and columns into representative blocks such as Group1,Group2 , Group3, electronegative P block, D-block and F-block elements and then relabel the x and y axis by the chemical symbol of the elements


In [389]:
Group1=[1,3,4,11,12,19,20,37,38,55,56,87,88]
Group2=[5,13,31,49,81]
Group3=[6,14,32,50,82]
Group4=[i+j+1 for i in Group3 for j in range(3)]
Nobel_gases=[2]+[i+4 for i in Group3]
D_block=[21,22,23,24,25,26,27,28,29,30,39,40,41,42,43,44,45,46,47,48,72,73,74,75,76,77,78,79,80]
F_block=[i for i in range(57,72)]
F_block.extend([j for j in range(89,93)])

In [390]:
refactored_Zlist=Group1+Group2+Group3+Group4+D_block+F_block+Nobel_gases

In [392]:
El_list=[]
with open("../data/temp.dat",'r') as f:
    for el in f.readlines():
        El_list.append(el.strip(" ").strip('\n').strip('"'))

In [394]:
from pymatgen import Element
El_refactored_list=[E for Z in refactored_Zlist for E in El_list  if Element(E).Z==Z]

In [423]:
def get_scatterplot_summed_refactored(ref_summed,Z_tag,Zlist):
    count=0
    xarr=np.zeros(8464)
    yarr=np.zeros(8464)
    dzarr=np.zeros(8464)
    clust_arr=np.zeros((8464,4))
    for (a,b),d in np.ndenumerate(ref_summed[Z_tag-1]):
        if d>0:
            xarr[count]=Zlist.index(a+1)
            yarr[count]=Zlist.index(b+1)
            dzarr[count]=d
            count=count+1
    xarr=xarr[0:count]
    yarr=yarr[0:count]
    dzarr=dzarr[0:count]
    return (xarr,yarr,dzarr)

In [424]:
plt.rcParams["xtick.labelsize"]=10 
plt.rcParams["ytick.labelsize"]=10  
plt.figure(figsize=(20,20))
xarr,yarr,dzarr=get_scatterplot_summed_refactored(refactor_summed,8,refactored_Zlist)
 
plt.xlim([-1,len(El_refactored_list)])
plt.ylim([-1,len(El_refactored_list)])
plt.xticks(np.arange(0,len(El_refactored_list)),El_refactored_list)
plt.yticks(np.arange(0,len(El_refactored_list)),El_refactored_list)
plt.clim(1,12)
plt.grid()



In [428]:
def get_scatterplot_refactored_clust_list(ref_array,Z_tag,list_clusts,Zlist):
    count=0
    xarr=np.zeros(8464)
    yarr=np.zeros(8464)
    dzarr=np.zeros(8464)
    clust_arr=np.zeros((8464,4))
    #label_arr=np.zeros(8464)
    Z=Z_tag-1
    refactor_block1=(ref_array[Z])[:,:,list_clusts]
    ref_summed=np.sum(refactor_block1,axis=2)
    for (a,b),d in np.ndenumerate(ref_summed):
        if d>0:
            xarr[count]=Zlist.index(a+1)
            yarr[count]=Zlist.index(b+1)
            dzarr[count]=d
            count=count+1
    xarr=xarr[0:count]
    yarr=yarr[0:count]
    dzarr=dzarr[0:count]
    return (xarr,yarr,zarr,dzarr)

In [435]:
plt.rcParams["xtick.labelsize"]=9 
plt.rcParams["ytick.labelsize"]=10  
plt.figure(figsize=(20,20))
xarr,yarr,zarr,dzarr=get_scatterplot_refactored_clust_list(refactor_array,8,[13],refactored_Zlist)
plt.scatter(xarr,yarr,c=dzarr,s=100)
plt.xlim([-1,len(El_refactored_list)])
plt.ylim([-1,len(El_refactored_list)])
plt.xticks(np.arange(0,len(El_refactored_list)),El_refactored_list)
plt.yticks(np.arange(0,len(El_refactored_list)),El_refactored_list)
plt.grid()
plt.clim(0,16)


Lets create subplots to show where the oxides in each of the clusters lies after this refactoring.


In [434]:
plt.rcParams["xtick.labelsize"]=9 
plt.rcParams["ytick.labelsize"]=10  
plt.figure(figsize=(20,20))

for i in range(15):
    plt.subplot(4,4,i+1)
    xarr,yarr,zarr,dzarr=get_scatterplot_refactored_clust_list(refactor_array,8,[i],refactored_Zlist)
    plt.scatter(xarr,yarr,c=dzarr,s=20)
    plt.xlim([-1,len(El_refactored_list)])
    plt.ylim([-1,len(El_refactored_list)])
    plt.xticks([])
    plt.yticks([])
plt.clim(0,16)



In [ ]: