Creating DataFrame for Clustering

Here we shall try to create a basic dataframe which we can pass to the Scikit-Learn clustering algorithms. Sklearn requires all the features to be stored in a 2-D array (numpy array/ scipy sparse matrix/pandas dataframe). The aim is to create a dataframe with similar structure to the sample datasets in sklearn. As this is unsupervised learning, we won't have a target data-structure but only a feature data structure, which we shall call 'Dataframe'.


In [1]:
import scipy.sparse
import numpy as np
import sklearn as skl
import pylab as plt


/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

We read in the data from the Unique_ICSD.dat file (which should always contain the raw data after whatever filtering process we apply on the master icsd-ternaries.csv file). Dataframe will be of dimension (nsmaples $\times$ nfeatures), where:

nsamples: Number of unique ternary compounds

nfeatures:  columns 0:104: Number of atoms of element as defined by the dictionary dict_elements.
            column  105: Space Group number

Dataframe will be a scipy csr sparse matrix as this feature space is by definition very sparse.

Making an array of space group numbers


In [3]:
import csv
with open('ICSD/Unique_ICSD.dat','r') as f:
            data_1=csv.reader(f,"excel-tab")
            list_data1=[[element.strip() for element in row] for row in data_1]
for row1 in list_data1:
    row1[1]=row1[1].replace(' ','')
list_space=[row1[1].rstrip('Z').rstrip('S').rstrip("H").rstrip('R') for row1 in list_data1]

In [4]:
with open("ICSD/spacegroups.dat",'r') as f:
    dat=csv.reader(f,dialect='excel-tab',quoting=csv.QUOTE_NONE)
    list_dat=[element.strip() for row in dat for element in row ]
    list1=[[int(list_dat[i*2]),list_dat[i*2+1]] for i in range(int(len(list_dat)/2))]
dict_space={}
for i in range(len(list1)):
    dict_space[list1[i][1]]=list1[i][0]
with open('ICSD/spacegroups_2.dat','r') as f1:
        f=f1.readlines()
        for line in f:
            data2=[element.strip() for element in line.split()]
            if data2[1] not in dict_space.keys():
                dict_space[data2[1]]=int(data2[0])
                
with open('ICSD/spacegroups_3.dat','r') as f1:
        f=f1.readlines()
        for line in f:
            data3=[element.strip() for element in line.split()]
            if data3[0] not in dict_space.keys():
                dict_space[data3[0]]=int(data3[1])

In [5]:
space_num_array=np.zeros(len(list_space),dtype=float)
for i,s in enumerate(list_space):    
        space_num_array[i]=dict_space[s]

Making an array storing element occurences


In [9]:
from pymatgen.matproj.rest import MPRester
from pymatgen.core import Element, Composition

In [10]:
element_universe = [str(e) for e in Element]
dict_element={}
for i,j in enumerate(element_universe):
    dict_element[str(j)]=i
dict_element['D']=103
dict_element['T']=104

In [11]:
stoich_array=np.zeros((len(list_data1),len(dict_element)),dtype=float)
for index,entry in enumerate(list_data1):
    comp=Composition(entry[2])
    temp_dict=dict(comp.get_el_amt_dict())
    for key in temp_dict.keys():
        stoich_array[index][dict_element[key]]= temp_dict[key]

Combining these two features to form Dataframe


In [17]:
Dataframe=scipy.sparse.csr_matrix(np.hstack((stoich_array,space_num_array[:,np.newaxis])))

In [19]:
print(Dataframe[0:3])


  (0, 1)	8.0
  (0, 35)	1.0
  (0, 90)	6.0
  (0, 105)	216.0
  (1, 62)	4.0
  (1, 66)	1.0
  (1, 97)	1.0
  (1, 105)	14.0
  (2, 53)	3.0
  (2, 66)	0.5
  (2, 82)	4.0
  (2, 105)	148.0

Saving Dataframe using numpy save


In [20]:
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

In [24]:
save_sparse_csr("Dataframe",Dataframe)

In [ ]: