We load the Dataframe using code found in stackoverflow("http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format") and apply clustering algorithms on them
In [25]:
import scipy.sparse
import numpy as np
import sklearn as skl
import pylab as plt
%matplotlib inline
In [2]:
def load_sparse_csr(filename):
loader = np.load(filename)
return scipy.sparse.csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
In [4]:
Dataframe=load_sparse_csr("Dataframe.npz")
In [5]:
from sklearn.cross_validation import train_test_split
In [6]:
train_Dat,test_Dat=train_test_split(Dataframe,test_size=0.2,random_state=42)
In [8]:
from sklearn.cluster import KMeans
In [10]:
clust=KMeans(n_clusters=10)
In [11]:
Dat=train_Dat.toarray()
In [12]:
clusters=clust.fit_predict(Dat)
In [13]:
print clusters[0:10]
In [18]:
cluster_freq=np.zeros(10,dtype=float)
for i in clusters:
cluster_freq[i]+=1
print map(int,cluster_freq)
In [36]:
plt.figure(figsize=(8,8))
for i in range(10):
plt.plot(np.arange(len(Dat[0])),clust.cluster_centers_[i],label="cluster"+str(i))
plt.xlabel("Feature Number")
plt.ylabel("value of feature for cluster center")
plt.legend()
Out[36]:
We can see that only 6 of the clusters have significant occupancies. So we are probably better off with a differentclusters. Also, the cluster centers for the 4 clusters with very low occupancies have completely nonsensical coordinates, which can be seen by the diverging spikes on the above plot.
In [44]:
Dat=Dataframe.toarray()
In [51]:
print "Number of clusters Number of points in each cluster Inertia"
for nclt in range(2,20):
clust2=KMeans(n_clusters=nclt)
clusters2=clust2.fit_predict(Dat)
cluster_freq=np.zeros(nclt,dtype=float)
for i in clusters2:
cluster_freq[i]+=1
print nclt,map(int,cluster_freq),clust2.inertia_
Lets see what is going on with 10 clusters
In [91]:
nclt=10
clust2=KMeans(n_clusters=nclt,n_init=50,random_state=42)
clusters2=clust2.fit_predict(Dat)
cluster_freq=np.zeros(nclt,dtype=float)
for i in clusters2:
cluster_freq[i]+=1
print nclt,map(int,cluster_freq),clust2.inertia_
Some of the clusters always end up almost empty. Basically the Kmeans algorithm is failing. Still let us look at the cluster centers of the well populated clusters
Let's see what the spaceGroup numbers are of the cluster center points.
In [94]:
print clust2.cluster_centers_[:,105]
Lets plot the cluster centers in stoichiometric space. We dont plot clusters with low occupancy as those have garbage values.
In [95]:
plt.figure(figsize=(8,8))
num_x=104
for i in range(10):
if i not in [2,4,5,8]:
plt.plot(np.arange(num_x),clust2.cluster_centers_[i][0:num_x],label="cluster"+str(i))
plt.xlabel("Feature Number")
plt.ylabel("value of feature for cluster center")
plt.legend()
Out[95]:
In [88]:
nclt=10
clust3=KMeans(n_clusters=nclt,n_init=50,random_state=42)
X_new=clust3.fit_transform(Dat)
In [89]:
print X_new[0]
In [90]:
#min_dist=zeros(len(X_new))
plt.figure(figsize=(10,10))
min_dist=np.amin(X_new,axis=1)
plt.plot(np.arange(len(X_new)),min_dist)
Out[90]:
In [108]:
nclt=50
clust4=KMeans(n_clusters=nclt,n_init=10,init='random',random_state=42)
clusters4=clust4.fit_predict(Dat)
cluster_freq=np.zeros(nclt,dtype=float)
for i in clusters4:
cluster_freq[i]+=1
print nclt,map(int,cluster_freq),clust4.inertia_
In [113]:
print(clust4.cluster_centers_[:,105])
In [ ]:
In [ ]: