
  • 优点: 精度高、对异常值不敏感、无数据输入假定
  • 缺点: 计算复杂度高、空间复杂度高、无法给出数据的内在含义
  • 适用数据范围: 数值型和标称型



对要分类的样本,在已有样本中寻找最近邻的 K 个样本,以这 K 个样本的分类标签中出现次数最多的标签作为待分类样本的分类标签。



In [1]:
from numpy import *
import operator
def createDataSet():
	group = array([[1.0, 1.1],[1.0, 1.0],[0, 0],[0, 0.1]])
	labels = ['A', 'A', 'B', 'B']
	return group, labels

先了解程序清单 2-1 要用到的几个函数

  • tile 函数将数组横向及纵向复制得到新的数组

In [2]:
from numpy import *
tile(1, 3)

array([1, 1, 1])

In [3]:

array([[ 2.5,  2.5,  2.5,  2.5],
       [ 2.5,  2.5,  2.5,  2.5]])

In [4]:

array([[1, 3, 1, 3, 1, 3],
       [1, 3, 1, 3, 1, 3]])

In [5]:

array([[1, 3, 1, 3, 1, 3],
       [1, 3, 1, 3, 1, 3]])

  • ** 是幂运算

In [6]:


In [7]:
a=array([[1, 2], [3, 4]])

array([[ 1,  4],
       [ 9, 16]])

In [8]:
b=mat([[1, 2], [3, 4]])

matrix([[ 7, 10],
        [15, 22]])




linalg.solve(A, B)

In [9]:

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

.I 求矩阵的逆

.T 求矩阵的转置

In [10]:

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

In [11]:

matrix([[1, 3],
        [2, 4]])

  • sum 求和

In [12]:
a=array([[1, 2],[3, 4]])


In [13]:

array([4, 6])

In [14]:

array([3, 7])

sum(0) 按列求和

sum(1) 按行求和

min() max() 两个函数同样0列1行

In [15]:


In [16]:

array([1, 2])

In [17]:

array([1, 3])

  • dict.get(x,0)


In [18]:


In [19]:




In [20]:


Python 2.7

  • dict.iteritems() 返回迭代器
  • dict.items() 返回字典的复制

Python 3

  • dict.items() 返回迭代器
  • dict.iteritems() 该函数在 Python 3 中不存在了

我用的 Python 3,所以下面的代码中,我用的是 dict.items()

operator.itemgetter 函数可以获取一个对象指定序号的数据
operator.itemgetter 获取的不是值,而是一个函数,通过该函数作用到对象上才能获取值。
一般该函数用在 sorted 函数中。
需要 import operator 模块

In [21]:
import operator

[('d', 1), ('b', 2), ('a', 3), ('c', 5)]


In [22]:

[('a', 3), ('b', 2), ('c', 5), ('d', 1)]


程序清单 2-1

In [23]:
def classify0(inX, dataSet, labels, k):
	dataSetSize = dataSet.shape[0]
	diffMat = tile(inX, (dataSetSize, 1)) - dataSet
	sqDiffMat = diffMat ** 2
	sqDistances = sqDiffMat.sum(axis=1)
	distances = sqDistances ** 0.5
	sortedDistIndicies = distances.argsort()
	classCount = {}
	for i in range(k):
		voteIlabel = labels[sortedDistIndicies[i]]
		classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
	sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
	return sortedClassCount[0][0]

程序清单 2-2

In [24]:
def file2matrix(filename):
	fr = open(filename)
	arrayOfLines = fr.readlines()
	numberOfLines = len(arrayOfLines)
	returnMat = zeros((numberOfLines, 3))
	classLabelVector = []
	index = 0
	for line in arrayOfLines:
		line = line.strip()
		listFromLine = line.split('\t')
		returnMat[index, :] = listFromLine[0:3]
		index += 1
	return returnMat, classLabelVector

In [25]:
datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')

In [26]:

array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]])

In [27]:

[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

In [28]:
%matplotlib inline

In [29]:
import matplotlib 
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 2], 10.0*array(datingLabels), 255.0*array(datingLabels))

In [30]:
def autoNorm(dataset):
    minVals = dataset.min(0)
    maxVals = dataset.max(0)
    ranges = maxVals - minVals
    m = dataset.shape[0]
    normDataset = dataset - tile(minVals, (m, 1))
    normDataset = normDataset / (tile(ranges, (m, 1)))
    return normDataset, ranges, minVals

In [31]:
normMat, ranges, minVals = autoNorm(datingDataMat)

In [32]:

array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])

In [33]:

array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])

In [34]:

array([ 0.      ,  0.      ,  0.001156])

In [35]:
def datingClassTest():
    hoRatio = 0.1
    datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with %d,the real answer is %d" %(classifierResult, datingLabels[i])) 
        if classifierResult != datingLabels[i]:
    print("the total error rate is %f" %(errorCount / float(numTestVecs)))

In [36]:

the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 3
the classifier came back with 1,the real answer is 1
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 3
the classifier came back with 3,the real answer is 3
the classifier came back with 2,the real answer is 2
the classifier came back with 1,the real answer is 1
the classifier came back with 3,the real answer is 1
the total error rate is 0.050000

In [37]:
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned consumed per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('Ch02/datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)
    print("you will probably like this person " + str(resultList[classifierResult - 1]))

In [38]:

percentage of time spent playing video games?10
frequent flier miles earned consumed per year?10000
liters of ice cream consumed per year?0.5
you will probably like this person in small doses

In [39]:
def img2vector(filename):
    returnVect=zeros((1, 1024))
    for i in range(32):
        for j in range(32):
            returnVect[0, 32 * i + j] = int(linestr[j])
    return returnVect

In [40]:
testVector = img2vector('Ch02/digits/trainingDigits/0_13.txt')

In [41]:
testVector[0, 0:31]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])

In [42]:
from os import listdir
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('Ch02/digits/trainingDigits')
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        trainingMat[i, :] = img2vector('Ch02/digits/trainingDigits/%s' %(fileNameStr))
    testFileList = listdir("Ch02/digits/testDigits/")
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('Ch02/digits/testDigits/%s' %(fileNameStr))
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        #print("the classifier came back with %d, the real answer is %d " %(classifierResult, classNumStr))
        if classifierResult != classNumStr:
            print("the classifier came back with %d, the real answer is %d " %(classifierResult, classNumStr))
            errorCount += 1.0
    print("the total number of errors is %d" %(errorCount))
    print("the total error rate is %f" %(errorCount / float(mTest)))

In [43]:

the classifier came back with 3, the real answer is 8 
the classifier came back with 1, the real answer is 8 
the classifier came back with 7, the real answer is 9 
the classifier came back with 6, the real answer is 5 
the classifier came back with 9, the real answer is 3 
the classifier came back with 1, the real answer is 8 
the classifier came back with 6, the real answer is 8 
the classifier came back with 3, the real answer is 5 
the classifier came back with 1, the real answer is 9 
the classifier came back with 7, the real answer is 1 
the total number of errors is 10
the total error rate is 0.010571

