本文之編寫程序涉及到API介紹,程序的完整實現(xiàn)悍汛,具體算法原理請查看之前所寫的KNN算法介紹
一谜疤、基礎(chǔ)準備
1、python基礎(chǔ)
** sorted: **
python內(nèi)置的全局排序方法杭抠,它返回一個新的list,新的list的元素基于小于運算符(lt)來排序恳啥。
print(sorted([5, 2, 3, 1, 4]))
[1, 2, 3, 4, 5]
print(sorted("This is a test string from Andrew".split(), key=str.lower))
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
Sorting basic:
>>> print sorted([5, 2, 3, 1, 4])
[1, 2, 3, 4, 5]
>>> L = [5, 2, 3, 1, 4]
>>> L.sort()
>>> print L
[1, 2, 3, 4, 5]
Sorting cmp:
>>>L = [('b',2),('a',1),('c',3),('d',4)]
>>>print sorted(L, cmp=lambda x,y:cmp(x[1],y[1]))
[('a', 1), ('b', 2), ('c', 3), ('d', 4)]
Sorting keys:
>>>L = [('b',2),('a',1),('c',3),('d',4)]
>>>print sorted(L, key=lambda x:x[1]))
[('a', 1), ('b', 2), ('c', 3), ('d', 4)]
Sorting reverse:
>>> print sorted([5, 2, 3, 1, 4], reverse=True)
[5, 4, 3, 2, 1]
>>> print sorted([5, 2, 3, 1, 4], reverse=False)
[1, 2, 3, 4, 5]
>>> L = [('d',2),('a',4),('b',3),('c',2)]
>>> print sorted(L, key=lambda x:(x[1],x[0]))
>>>[('c', 2), ('d', 2), ('b', 3), ('a', 4)]
** Counter.most_common**
返回一個TopN列表偏灿。如果n沒有被指定,則返回所有元素钝的。當(dāng)多個元素計數(shù)值相同時翁垂,排列是無確定順序的。
data = [1,2,2,3,3,2,1,1,2,]
print(Counter(data).most_common(1))
# >> [(2, 4)]
print(Counter(data).most_common(2))
# >> [(2, 4), (1, 3)]
print(Counter(data).most_common(1)[0][0])
# >> 2
print(Counter(data).most_common(1)[0][1])
# >> 4
2硝桩、numpy基礎(chǔ)
** np.shape**
shape函數(shù)是numpy.core.fromnumeric中的函數(shù)沿猜,它的功能是讀取矩陣的長度
data = [[1,2,2],[3,3,2]]
print(np.shape(data)[0])
#>> 2
print(np.shape(data)[1])
#>> 3
** np.zeros**
用法:zeros(shape, dtype=float, order='C')
返回:返回來一個給定形狀和類型的用0填充的數(shù)組;
參數(shù):shape:形狀
dtype:數(shù)據(jù)類型碗脊,可選參數(shù)啼肩,默認numpy.float64
order:可選參數(shù),c代表與C語言類似衙伶,行優(yōu)先祈坠;F代表列優(yōu)先
print(np.zeros(5))
# >> [ 0. 0. 0. 0. 0.]
print(np.zeros((5), dtype=bool))
# >> [False False False False False]
print(np.zeros([3,2]))
# >> [[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
** np.tile**
tile函數(shù)位于python模塊 numpy.lib.shape_base中,他的功能是重復(fù)某個數(shù)組矢劲。比如tile(A,n)赦拘,功能是將數(shù)組A重復(fù)n次,
data=[[1,2],[3,4]]
print(np.tile(data,[2]))
# >>[[1 2 1 2]
[3 4 3 4]]
print(np.tile(data,[2,2]))
# >>[[1 2 1 2]
[3 4 3 4]
[1 2 1 2]
[3 4 3 4]]
** np.linalg.norm**
1500724082(1).png
>>> x = np.array([3, 4])
>>> np.linalg.norm(x)
5.
>>> np.linalg.norm(x, ord=2)
5.
>>> np.linalg.norm(x, ord=1)
7.
>>> np.linalg.norm(x, ord=np.inf)
4
二芬沉、完整程序
# -*- coding: utf-8 -*-
import math
import numpy as np
from collections import Counter
import warnings
## 經(jīng)典測試
# 流程
# 1躺同、對數(shù)據(jù)訓(xùn)練數(shù)據(jù)進行歸一化,(防止某些參數(shù)值過大影響距離計算)
# 2花嘶、按個取出測試數(shù)據(jù)(歸一化)笋籽,計算該個數(shù)據(jù)到所有訓(xùn)練數(shù)據(jù)的歐幾里得距離
# 3、對該個數(shù)據(jù)到各點的數(shù)據(jù)距離進行排序
# 4椭员、過濾出排名前幾名的點车海,列出各點的歸類
# 5、計算最大歸類就該分類了。
# k-Nearest Neighbor算法
def k_nearest_neighbors(data, predict, classLabel, k=5):
if len(data) >= k:
warnings.warn("k is too small")
distances = []
dataSize = data.shape[0]
# KNN的算法核心就是歐式距離的計算
# 第一種方案:
diffMat = np.tile(predict, (dataSize, 1)) - data
sqDiffMat = diffMat ** 2
euclidean_distance = sqDiffMat.sum(axis=1) ** 0.5
for i in range(dataSize):
# 將距離加上結(jié)果
distances.append([euclidean_distance[i], classLabel[i]])
# 第二種方案:
# for i in range(dataSize):
# features = data[i]
# # 計算predict點到各點的距離
# euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
# # 將距離加上結(jié)果
# distances.append([euclidean_distance, classLabel[i]])
# 根據(jù)距離排序,并返回分類結(jié)果
sorted_distances =[i[1] for i in sorted(distances)]
# 取前K個值的數(shù)據(jù)
top_nearest = sorted_distances[:k]
# 返回最多分類
group_res = Counter(top_nearest).most_common(1)[0][0]
return group_res
def file2Mat(testFileName, parammterNumber):
fr = open(testFileName)
lines = fr.readlines()
lineNums = len(lines)
resultMat = np.zeros((lineNums, parammterNumber))
classLabelVector = []
for i in range(lineNums):
line = lines[i].strip()
itemMat = line.split('\t')
resultMat[i, :] = itemMat[0:parammterNumber]
classLabelVector.append(itemMat[-1])
fr.close()
return resultMat, classLabelVector;
# 為了防止某個屬性對結(jié)果產(chǎn)生很大的影響侍芝,所以有了這個優(yōu)化研铆,比如:10000,4.5,6.8 10000就對結(jié)果基本起了決定作用
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normMat = np.zeros(np.shape(dataSet))
size = normMat.shape[0]
normMat = dataSet - np.tile(minVals, (size, 1))
normMat = normMat / np.tile(ranges, (size, 1))
return normMat, minVals, ranges
if __name__=='__main__':
trainigSetFileName= 'data\\datingTrainingSet.txt'
testFileName = 'data\\datingTestSet.txt'
# 讀取訓(xùn)練數(shù)據(jù)
trianingMat, classLabel = file2Mat(trainigSetFileName, 3)
# 都數(shù)據(jù)進行歸一化的處理
trianingMat, minVals, ranges = autoNorm(trianingMat)
# 讀取測試數(shù)據(jù)
testMat, testLabel = file2Mat(testFileName, 3)
correct = 0
total = 0
testSize = testMat.shape[0]
print(testSize)
print(testMat.shape[1])
testSize = testMat.shape[0]
for i in range(testSize):
predict = testMat[i]
# 你可以調(diào)整這個k看看準確率的變化,你也可以使用matplotlib畫出k對應(yīng)的準確率州叠,找到最好的k值
res = k_nearest_neighbors(trianingMat, (predict - minVals) / ranges, classLabel, k = 5)
if testLabel[i] == res:
correct += 1
total += 1
print(correct) # 準確率
print(correct/(testSize*1.0)) # 準確率
print(k_nearest_neighbors(trianingMat, [4,2,1], classLabel, k = 5)) # 預(yù)測一條記錄