優(yōu)點(diǎn)
精度高、對(duì)異常值不敏感俭尖、無(wú)數(shù)據(jù)輸入假定
缺點(diǎn)
計(jì)算復(fù)雜度高纫塌、空間復(fù)雜度高
適用數(shù)據(jù)范圍
數(shù)值型和標(biāo)稱型
標(biāo)稱型:標(biāo)稱型目標(biāo)變量的結(jié)果只在有限目標(biāo)集中取值,如真與假(主要用來(lái)分類)
數(shù)值型:數(shù)值型目標(biāo)變量可以從無(wú)限的數(shù)值集合中取值绊茧,如0.2300,1111.1111等(主要用來(lái)回歸)
工作原理
存在一個(gè)樣本數(shù)據(jù)集合打掘,也稱作訓(xùn)練樣本集华畏,并且樣本集中每個(gè)數(shù)據(jù)都存在標(biāo)簽,即我們知道樣本集中每一個(gè)數(shù)據(jù)與所屬分類的對(duì)應(yīng)關(guān)系尊蚁。輸入沒(méi)有標(biāo)簽的新數(shù)據(jù)后亡笑,將新數(shù)據(jù)的每個(gè)特征與樣本集中數(shù)據(jù)對(duì)應(yīng)的特征進(jìn)行比較,然后算法提取樣本集中特征最相似數(shù)據(jù)(最近鄰)的分類標(biāo)簽横朋。一般來(lái)說(shuō)仑乌,我們只選擇樣本數(shù)據(jù)集中前k個(gè)最相似的數(shù)據(jù),這就是k-近鄰算法中k的出處琴锭,通常是k不大于20的整數(shù)晰甚。最后,選擇k個(gè)最相似數(shù)據(jù)中出現(xiàn)次數(shù)最多的分類决帖,作為新數(shù)據(jù)的分類厕九。
《統(tǒng)計(jì)學(xué)習(xí)方法》中的解釋
給定一個(gè)訓(xùn)練數(shù)據(jù)集,對(duì)新的輸入實(shí)例地回,在訓(xùn)練數(shù)據(jù)集中找到與該實(shí)例最鄰近的k個(gè)實(shí)例扁远,這k個(gè)實(shí)例的多數(shù)屬于某個(gè)類,就把該輸入實(shí)例分到這個(gè)類刻像。
k-近鄰算法的一般流程
1.收集數(shù)據(jù):anyway
2.準(zhǔn)備數(shù)據(jù):距離計(jì)算所需要的數(shù)值畅买,最好是結(jié)構(gòu)化的數(shù)據(jù)格式。
3.分析數(shù)據(jù):anyway
4.訓(xùn)練算法:此步驟不適用于k-近鄰算法
5.測(cè)試算法:計(jì)算錯(cuò)誤率
6.使用算法:首先需要輸入樣本數(shù)據(jù)和結(jié)構(gòu)化的輸出結(jié)果细睡,然后運(yùn)行k-鄰近算法判定輸入數(shù)據(jù)分別屬于哪個(gè)分類谷羞,最后應(yīng)用對(duì)計(jì)算出的分類執(zhí)行后續(xù)的處理。
對(duì)未知類別屬性的數(shù)據(jù)集中的每個(gè)點(diǎn)依次執(zhí)行以下操作:
1.計(jì)算一直類別數(shù)據(jù)集中的點(diǎn)與當(dāng)前點(diǎn)之間的距離
2.按照距離遞增次序排序
3.選取與當(dāng)前距離最小的k個(gè)點(diǎn)
4.確定前k個(gè)點(diǎn)所在類別出現(xiàn)的頻率
5.返回前k個(gè)點(diǎn)出現(xiàn)頻率最高的類別作為當(dāng)前點(diǎn)的預(yù)測(cè)分類
具體代碼
import numpy as np
import operator
import matplotlib
import matplotlib.pyplot as plt
import os
def create_date_set():
group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = ['A', 'A', 'B', 'B']
return group, labels
'''
:parameter
輸入向量:in_x, 輸入的訓(xùn)練樣本:data_set, 標(biāo)簽向量:labels,
表示用于選擇最近鄰居的數(shù)目
'''
def classify0(in_x, data_set, labels, k):
data_set_size = data_set.shape[0]
# tile(original, (a, b)) 將原來(lái)的矩陣行復(fù)制倍,列復(fù)制a倍
# 計(jì)算歐氏距離
diff_mat = np.tile(in_x, (data_set_size, 1)) - data_set
sq_diff_mat = diff_mat ** 2
# 相加為一個(gè)列向量
sq_distances = sq_diff_mat.sum(axis=1)
# 開方
distances = sq_distances ** 0.5
# 從小到大排列溜徙,返回該值在原來(lái)值中的索引
sorted_dist_indices = distances.argsort()
class_count = {}
# 計(jì)算在鄰居中哪一類最多
for i in range(k):
votel_label = labels[sorted_dist_indices[i]]
class_count[votel_label] = class_count.get(votel_label, 0) + 1
sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True) #
return sorted_class_count[0][0]
# 讀取文件湃缎,形成數(shù)據(jù)集和標(biāo)簽
def file2matrix(filename):
with open(filename, 'r', encoding='UTF-8') as fr:
lines = fr.readlines()
number_of_lines = len(lines)
mat = np.zeros((number_of_lines, 3))
class_label_vector = []
index = 0
for line in lines:
line = line.strip()
content = line.split('\t')
mat[index, :] = content[0:3]
class_label_vector.append(int(content[-1]))
index += 1
return mat, class_label_vector
# 歸一化特征值
def auto_norm(data_set):
min_value = data_set.min(0)
max_value = data_set.max(0)
ranges = max_value - min_value
norm_data_set = np.zeros(np.shape(data_set))
m = data_set.shape[0]
norm_data_set = data_set - np.tile(min_value, (m, 1))
norm_data_set = norm_data_set / np.tile(ranges, (m, 1))
return norm_data_set, ranges, min_value
# 測(cè)試
def dating_class_test():
ho_ratio = 0.2
dating_data_mat, dating_labels = file2matrix("./MLiA_SourceCode/machinelearninginaction/Ch02"
"/datingTestSet2.txt")
nor_mat, ranges, min_vals = auto_norm(dating_data_mat)
m = nor_mat.shape[0]
num_test_vecs = int(m * ho_ratio)
error_count = 0.0
for i in range(num_test_vecs):
classifier_result = classify0(nor_mat[i, :], nor_mat[num_test_vecs:m, :],
dating_labels[num_test_vecs:m], 3)
print("the classifier came back with: %d, the real answer is: %d"
% (classifier_result, dating_labels[i]))
if classifier_result != dating_labels[i]:
error_count += 1
print("the total error rate is: %f" % (error_count / float(num_test_vecs)))
# 約會(huì)網(wǎng)站預(yù)測(cè)函數(shù)
def classify_person():
result_list = ['not at all', 'in small doses', 'in large doses']
percent_tats = float(input("percentage of time spent playing video games?"))
ice_cream = float(input("liters of ice cream consumed per year?"))
ff_miles = float(input("frequent flier miles earned per year?"))
dating_data_mat, dating_labels = file2matrix("./MLiA_SourceCode/machinelearninginaction/Ch02"
"/datingTestSet2.txt")
nor_mat, ranges, min_vals = auto_norm(dating_data_mat)
in_arr = np.array([ff_miles, percent_tats, ice_cream])
classifier_result = classify0((in_arr - min_vals) / ranges, nor_mat, dating_labels, 3)
print("You will probably like this person: ", result_list[classifier_result - 1])
# 將圖片轉(zhuǎn)換為vector
def img2vector(filename):
vector = np.zeros((1, 1024))
with open(filename, 'r', ecoding='utf-8') as fp:
for i in range(32):
line_str = fp.readline()
for j in range(32):
vector[0, 32 * i * j] = int(line_str[j])
return vector
項(xiàng)目地址:https://github.com/RJzz/Machine.git
關(guān)于k值的選擇
1.k值的減小就意味著模型整體變復(fù)雜购公,相當(dāng)于用較大領(lǐng)域中的訓(xùn)練實(shí)例進(jìn)行預(yù)測(cè),容易發(fā)生過(guò)擬合雁歌。
2.k值過(guò)大,意味著整體的模型變簡(jiǎn)單知残。
3.在應(yīng)用中靠瞎,k值一般取一個(gè)比較小的數(shù)值,通常采用交叉驗(yàn)證法來(lái)選取最優(yōu)的k值求妹。
后續(xù)
這樣的kNN實(shí)際上代碼非常的高乏盐,優(yōu)化的方法可以是構(gòu)造kd樹,kd樹是一種對(duì)k維空間中的實(shí)例點(diǎn)進(jìn)行存儲(chǔ)以便對(duì)其進(jìn)行快速檢索的樹形數(shù)據(jù)結(jié)構(gòu)制恍。