1. k近鄰模型
k 近鄰法呆贿,k-nearest neighbor, k-NN,是一種基本的分類與回歸的算法唱蒸。其三大要素:k的選取臼隔、距離判別公式祈噪、分類決策.
k 近鄰法模型:
代表與 x 最近鄰的 k 個點的鄰域锄弱。
1眷射、k取值
取值小焚廊,結(jié)構(gòu)復(fù)雜知允,相似誤差小闯捎,但容易過擬合椰弊;
取值大,結(jié)構(gòu)簡單瓤鼻,相似誤差大秉版。
在應(yīng)用中,k 一般選擇較小的值茬祷,可通過交叉驗證來確定最佳的 k 值清焕。此外,k 一般取奇數(shù)祭犯,防止出現(xiàn)類別數(shù)相等的情況秸妥。
2、距離度量
一般使用歐氏距離沃粗,或者更一般的距離粥惧。
這里, 代表特征維度.
3最盅、分類決策
多數(shù)表決突雪,即k個最近點類別多數(shù)的那個類別。
2. kd樹
kd 樹涡贱,k-dimensional tree,一種分割 k 維度數(shù)據(jù)空間的數(shù)據(jù)結(jié)構(gòu)咏删。
創(chuàng)建kd樹:
1、選擇當(dāng)前維度下數(shù)據(jù)的中位數(shù)對應(yīng)的樣本當(dāng)作父節(jié)點问词,從而饵婆,劃分剩余數(shù)據(jù)劃分到左、右子樹戏售;
2、選取當(dāng)前維度的下一個維度草穆,對左灌灾、右子樹重復(fù)操作 1。
最近鄰搜索:
1悲柱、搜索目標(biāo)節(jié)點在kd 樹對應(yīng)的“最佳葉節(jié)點”锋喜。
具體是從根節(jié)點出發(fā),根據(jù)目標(biāo)節(jié)點在相應(yīng)維度下的值進(jìn)行劃分,直到劃分到葉子結(jié)點嘿般。
2段标、向上遍歷。
具體是從“最佳葉節(jié)點”出發(fā)炉奴,如果當(dāng)前點比“最佳葉節(jié)點”更靠近目標(biāo)節(jié)點逼庞,則該節(jié)點為當(dāng)前最佳點,并且檢查該節(jié)點的兄弟節(jié)點瞻赶;
直到根節(jié)點赛糟,搜索結(jié)束。
3. 實踐
3.1暴力法
使用暴力法進(jìn)行實現(xiàn)砸逊, 個樣本璧南,維數(shù)據(jù)建模的算法復(fù)雜度,因為計算距離時復(fù)雜度, 找出k個最領(lǐng)域時復(fù)雜度。
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
## Example 1: iris for classification( 3 classes)
# X, y = datasets.load_iris(return_X_y=True)
# Example 2
X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# my k-NN
class KNN():
def __init__(self,data,K=3):
self.X = data[0]
self.y = data[1]
self.K = K
def fit(self, X_test):
diffX = np.repeat(X_test, len(self.X), axis=0) - np.tile(self.X,(len(X_test),1))
square_diffX = (diffX**2).sum(axis=1).reshape(len(X_test),len(self.X))
sorted_index = square_diffX.argsort()
predict = [0 for _ in range(len(X_test))]
for i in range(len(X_test)):
class_count={}
for j in range(self.K):
vote_label = self.y[sorted_index[i][j]]
class_count[vote_label] = class_count.get(vote_label,0) + 1
sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
predict[i] = sorted_count[0][0]
return predict
def predict(self, X_test):
return self.fit(X_test)
def score(self,X,y):
y_pred = self.predict(X)
return 1 - np.count_nonzero(y-y_pred)/len(y)
knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))
運(yùn)行結(jié)果:0.9415204678362573
.
使用sklearn的API:
# sklearn API
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
print(neigh.score(X_test,y_test))
運(yùn)行結(jié)果:0.9415204678362573
只要kNN法在實現(xiàn)上 k 的選擇师逸,以及距離的定義一致司倚,結(jié)果是完全相同的。
3.2 kd樹
定義kd 樹篓像。時間復(fù)雜度:动知。
# kd-tree
class KDNode:
'''
vaule: [X,y]
'''
def __init__(self, value=None, parent=None, left=None, right=None, index=None):
self.value = value
self.parent = parent
self.left = left
self.right = right
@property
def brother(self):
if not self.parent:
bro = None
else:
if self.parent.left is self:
bro = self.parent.right
else:
bro = self.parent.left
return bro
class KDTree():
def __init__(self,K=3):
self.root = KDNode()
self.K = K
def _build(self, data, axis=0,parent=None):
'''
data:[X,y]
'''
# choose median point
if len(data) == 0:
root = KDNode()
return root
data = np.array(sorted(data, key=lambda x:x[axis]))
median = int(len(data)/2)
loc = data[median]
root = KDNode(loc,parent=parent)
new_axis = (axis+1)%(len(data[0])-1)
if len(data[:median,:]) == 0:
root.left = None
else:
root.left = self._build(data[:median,:],axis=new_axis,parent=root)
if len(data[median+1:,:]) == 0:
root.right = None
else:
root.right = self._build(data[median+1:,:],axis=new_axis,parent=root)
self.root = root
return root
def fit(self, X, y):
# concat X,y
data = np.concatenate([X, y.reshape(-1,1)],axis=1)
root = self._build(data)
def _get_eu_distance(self,arr1:np.ndarray, arr2:np.ndarray) -> float:
return ((arr1 - arr2) ** 2).sum() ** 0.5
def _search_node(self,current,point,result={},class_count={}):
# Get max_node, max_distance.
if not result:
max_node = None
max_distance = float('inf')
else:
# find the nearest (node, distance) tuple
max_node, max_distance = sorted(result.items(), key=lambda n:n[1],reverse=True)[0]
node_dist = self._get_eu_distance(current.value[:-1],point)
if len(result) == self.K:
if node_dist < max_distance:
result.pop(max_node)
result[current] = node_dist
class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
class_count[max_node.value[-1]] = class_count.get(max_node.value[-1],1) - 1
elif len(result) < self.K:
result[current] = node_dist
class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
return result,class_count
def search(self,point):
# find the point belongs to which leaf node(current).
current = self.root
axis = 0
while current:
if point[axis] < current.value[axis]:
prev = current
current = current.left
else:
prev = current
current = current.right
axis = (axis+1)%len(point)
current = prev
# search k nearest points
result = {}
class_count={}
while current:
result,class_count = self._search_node(current,point,result,class_count)
if current.brother:
result,class_count = self._search_node(current.brother,point,result,class_count)
current = current.parent
return result,class_count
def predict(self,X_test):
predict = [0 for _ in range(len(X_test))]
for i in range(len(X_test)):
_,class_count = self.search(X_test[i])
sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
predict[i] = sorted_count[0][0]
return predict
def score(self,X,y):
y_pred = self.predict(X)
return 1 - np.count_nonzero(y-y_pred)/len(y)
def print_tree(self,X_train,y_train):
height = int(math.log(len(X_train))/math.log(2))
max_width = pow(2, height)
node_width = 2
in_level = 1
root = self.fit(X_train,y_train)
from collections import deque
q = deque()
q.append(root)
while q:
count = len(q)
width = int(max_width * node_width / in_level)
in_level += 1
while count>0:
node = q.popleft()
if node.left:
q.append(node.left )
if node.right:
q.append(node.right)
node_str = (str(node.value) if node else '').center(width)
print(node_str, end=' ')
count -= 1
print("\n")
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))
運(yùn)行結(jié)果:0.9590643274853801
Q1: 為什么結(jié)果和上述暴力法的不一樣?
A1: 應(yīng)該是在向上回溯的過程遗淳,這里對每個節(jié)點的兄弟節(jié)點進(jìn)行計算距離拍柒,并且如果兄弟節(jié)點是當(dāng)前最佳節(jié)點,并沒有繼續(xù)向下搜索了屈暗。正確做法應(yīng)該是拆讯,如果當(dāng)前節(jié)點是當(dāng)前最佳點,并對兄弟節(jié)點搜索养叛,如果兄弟節(jié)點成為最佳點种呐,還得對該兄弟節(jié)點進(jìn)行向下搜索(我猜的)。
比較kd 樹和蠻力法的運(yùn)行效率
自定義樣本集
# Example 3
X, y = datasets.make_blobs(n_samples=10000, centers=3,
n_features=3, random_state=0)
import time
# 暴力法
st = time.time()
knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))
et = time.time()
print("use", et-st)
# kd tree
st = time.time()
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))
et = time.time()
print("use", et-st)
運(yùn)行結(jié)果:
# 暴力法
0.9993333333333333
use 2.8174679279327393
# kd tree
0.9973333333333333
use 0.7757344245910645
在精度相近下弃甥, kd-tree的運(yùn)行時間是更短的爽室。
參考: