k-means是一种搜寻中心的无监督算法,是一种迭代的不确定方法。
k-means需要指定簇的数量k来作为算法的输入参数,至于如何选择k的值,目前还没有什么好办法,只能通过多次运算比较结果来确定
评估结果的质量:
准备工作: k-means产生的每个簇都可以用以下的指标进行评估
采用所谓的轮廓系数来评估k-means的结果,它的值介于-1~1
In [1]:
import numpy as np
import matplotlib.pyplot as plt
def get_random_data():
x_1 = np.random.normal(loc=0.2,scale=0.2,size=(100,100))
x_2 = np.random.normal(loc=0.9,scale=0.1,size=(100,100))
x = np.r_[x_1,x_2]
return x
In [2]:
x = get_random_data()
print x
In [5]:
print np.shape(x)
In [6]:
# plot
plt.cla()
plt.figure(1)
plt.title("Generated Data")
plt.scatter(x[:,0],x[:,1])
plt.show()
In [8]:
# 我们定义一个函数来执行k-means算法,就能对给定的数据进行聚类
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
In [11]:
def form_clusters(x,k):
'''
build cluster
'''
# k是划分出的簇的个数
no_clusters = k
model = KMeans(n_clusters=no_clusters,init='random')
model.fit(x)
labels = model.labels_
print labels
# 计算轮廓系数
sh_score = silhouette_score(x,labels)
return sh_score
In [12]:
sh_scores = []
for i in range(1,5):
sh_score = form_clusters(x,i+1)
sh_scores.append(sh_score)
no_clusters = [i+1 for i in range(1,5)]
In [13]:
# plot
plt.figure(2)
plt.plot(no_clusters,sh_scores)
plt.title('Cluster Quality')
plt.xlabel('No of clusters k')
plt.ylabel('silhouette coeffcient')
plt.show()