k-近邻算法

优点：精度高、对异常值不敏感、无数据输入假定
缺点：计算复杂度高、空间复杂度高、无法给出数据的内在含义
适用数据范围：数值型和标称型

k-近邻算法更适用于数据集很大很全的情况？

算法思想及过程

对要分类的样本，在已有样本中寻找最近邻的 K 个样本，以这 K 个样本的分类标签中出现次数最多的标签作为待分类样本的分类标签。

寻找最近邻时，通过距离(欧几里得距离)来计算。

计算距离时注意归一化



In [1]:

    
from numpy import *
import operator
def createDataSet():
	group = array([[1.0, 1.1],[1.0, 1.0],[0, 0],[0, 0.1]])
	labels = ['A', 'A', 'B', 'B']
	return group, labels

先了解程序清单 2-1 要用到的几个函数

tile 函数将数组横向及纵向复制得到新的数组



In [2]:

    
from numpy import *
tile(1, 3)









    Out[2]:





array([1, 1, 1])



In [3]:

    
tile(2.5,(2,4))









    Out[3]:





array([[ 2.5,  2.5,  2.5,  2.5],
       [ 2.5,  2.5,  2.5,  2.5]])



In [4]:

    
tile([1,3],(2,3))









    Out[4]:





array([[1, 3, 1, 3, 1, 3],
       [1, 3, 1, 3, 1, 3]])



In [5]:

    
a=[1,3]
a=array(a)
tile(a,(2,3))









    Out[5]:





array([[1, 3, 1, 3, 1, 3],
       [1, 3, 1, 3, 1, 3]])

** 是幂运算



In [6]:

    
3**2









    Out[6]:





9



In [7]:

    
a=array([[1, 2], [3, 4]])
a**2









    Out[7]:





array([[ 1,  4],
       [ 9, 16]])



In [8]:

    
b=mat([[1, 2], [3, 4]])
b**2









    Out[8]:





matrix([[ 7, 10],
        [15, 22]])

可见幂运算对array来说是element-wise的，而对于matrix来说，就是矩阵的乘法做幂。

同样地，array的乘法是element-wise的，matrix的乘法就是线性代数的矩阵乘法

矩阵除法要用

linalg.solve(A, B)



In [9]:

    
linalg.solve(b**2,b)









    Out[9]:





matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

.I 求矩阵的逆

.T 求矩阵的转置



In [10]:

    
b.I









    Out[10]:





matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])



In [11]:

    
b.T









    Out[11]:





matrix([[1, 3],
        [2, 4]])

sum 求和



In [12]:

    
a=array([[1, 2],[3, 4]])
a.sum()









    Out[12]:





10



In [13]:

    
a.sum(0)









    Out[13]:





array([4, 6])



In [14]:

    
a.sum(1)









    Out[14]:





array([3, 7])

sum(0) 按列求和

sum(1) 按行求和

min() max() 两个函数同样0列1行



In [15]:

    
a.min()









    Out[15]:





1



In [16]:

    
a.min(0)









    Out[16]:





array([1, 2])



In [17]:

    
a.min(1)









    Out[17]:





array([1, 3])

dict.get(x,0)

在字典中查找指定的键对应的值，若找不到则返回第二个参数的值



In [18]:

    
d={'a':1,'b':2,'c':3,'d':4}
d.get('b')









    Out[18]:





2



In [19]:

    
d.get('e')

该函数为

get(key,default=None)

第二个参数默认为None，则若不指定第二个参数，函数返回None



In [20]:

    
d.get('e',5)









    Out[20]:





5

Python 2.7 中

dict.iteritems() 返回迭代器
dict.items() 返回字典的复制

Python 3 中

dict.items() 返回迭代器
dict.iteritems() 该函数在 Python 3 中不存在了

我用的 Python 3，所以下面的代码中，我用的是 dict.items()

operator.itemgetter 函数可以获取一个对象指定序号的数据
operator.itemgetter 获取的不是值，而是一个函数，通过该函数作用到对象上才能获取值。
一般该函数用在 sorted 函数中。
需要 import operator 模块



In [21]:

    
a={'a':3,'b':2,'c':5,'d':1}
import operator
sorted(a.items(),key=operator.itemgetter(1),reverse=False)









    Out[21]:





[('d', 1), ('b', 2), ('a', 3), ('c', 5)]

以上方法可以对字典按值排序
排序从小到大，reverse=True则是从大到小



In [22]:

    
sorted(a.items(),key=operator.itemgetter(0),reverse=False)









    Out[22]:





[('a', 3), ('b', 2), ('c', 5), ('d', 1)]

以上方法按键排序

程序清单 2-1



In [23]:

    
def classify0(inX, dataSet, labels, k):
	dataSetSize = dataSet.shape[0]
	diffMat = tile(inX, (dataSetSize, 1)) - dataSet
	sqDiffMat = diffMat ** 2
	sqDistances = sqDiffMat.sum(axis=1)
	distances = sqDistances ** 0.5
	sortedDistIndicies = distances.argsort()
	classCount = {}
	for i in range(k):
		voteIlabel = labels[sortedDistIndicies[i]]
		classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
	sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
	return sortedClassCount[0][0]

程序清单 2-2



In [24]:

    
def file2matrix(filename):
	fr = open(filename)
	arrayOfLines = fr.readlines()
	numberOfLines = len(arrayOfLines)
	returnMat = zeros((numberOfLines, 3))
	classLabelVector = []
	index = 0
	for line in arrayOfLines:
		line = line.strip()
		listFromLine = line.split('\t')
		returnMat[index, :] = listFromLine[0:3]
		classLabelVector.append(int(listFromLine[-1]))
		index += 1
	return returnMat, classLabelVector



In [25]:

    
datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')



In [26]:

    
datingDataMat









    Out[26]:





array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       ..., 
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]])



In [27]:

    
datingLabels[0:20]









    Out[27]:





[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]



In [28]:

    
%matplotlib inline



In [29]:

    
import matplotlib 
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 2], 10.0*array(datingLabels), 255.0*array(datingLabels))
plt.show()



In [30]:

    
def autoNorm(dataset):
    minVals = dataset.min(0)
    maxVals = dataset.max(0)
    ranges = maxVals - minVals
    m = dataset.shape[0]
    normDataset = dataset - tile(minVals, (m, 1))
    normDataset = normDataset / (tile(ranges, (m, 1)))
    return normDataset, ranges, minVals



In [31]:

    
normMat, ranges, minVals = autoNorm(datingDataMat)



In [32]:

    
normMat









    Out[32]:





array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       ..., 
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])



In [33]:

    
ranges









    Out[33]:





array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])



In [34]:

    
minVals









    Out[34]:





array([ 0.      ,  0.      ,  0.001156])



In [35]:

    
def datingClassTest():
    hoRatio = 0.1
    datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with %d,the real answer is %d" %(classifierResult, datingLabels[i])) 
        #print(classifierResult)
        if classifierResult != datingLabels[i]:
            errorCount+=1.0
    print("the total error rate is %f" %(errorCount / float(numTestVecs)))



In [36]:

    
datingClassTest()









    



the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 1
the total error rate is 0.050000



In [37]:

    
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned consumed per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)
    print("you will probably like this person " + str(resultList[classifierResult - 1]))



In [38]:

    
classifyPerson()









    



percentage of time spent playing video games?10
frequent flier miles earned consumed per year?10000
liters of ice cream consumed per year?0.5
you will probably like this person in small doses



In [39]:

    
def img2vector(filename):
    returnVect=zeros((1, 1024))
    fr=open(filename)
    for i in range(32):
        linestr=fr.readline()
        for j in range(32):
            returnVect[0, 32 * i + j] = int(linestr[j])
    return returnVect



In [40]:

    
testVector = img2vector('Ch02/digits/trainingDigits/0_13.txt')



In [41]:

    
testVector[0, 0:31]









    Out[41]:





array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])



In [42]:

    
from os import listdir
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('Ch02/digits/trainingDigits')
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('Ch02/digits/trainingDigits/%s' %(fileNameStr))
    testFileList = listdir("Ch02/digits/testDigits/")
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('Ch02/digits/testDigits/%s' %(fileNameStr))
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        #print("the classifier came back with %d, the real answer is %d " %(classifierResult, classNumStr))
        if classifierResult != classNumStr:
            print("the classifier came back with %d, the real answer is %d " %(classifierResult, classNumStr))
            errorCount += 1.0
    print("the total number of errors is %d" %(errorCount))
    print("the total error rate is %f" %(errorCount / float(mTest)))



In [43]:

    
handwritingClassTest()









    



the classifier came back with 3, the real answer is 8 
the classifier came back with 1, the real answer is 8 
the classifier came back with 7, the real answer is 9 
the classifier came back with 6, the real answer is 5 
the classifier came back with 9, the real answer is 3 
the classifier came back with 1, the real answer is 8 
the classifier came back with 6, the real answer is 8 
the classifier came back with 3, the real answer is 5 
the classifier came back with 1, the real answer is 9 
the classifier came back with 7, the real answer is 1 
the total number of errors is 10
the total error rate is 0.010571



In [ ]: