李航統(tǒng)計學習方法(三)----k近鄰算法

k近鄰算法

給定一個訓練數(shù)據(jù)集，對新的輸入實例周霉，在訓練數(shù)據(jù)集中找到跟它最近的k個實例雨席，根據(jù)這k個實例的類判斷它自己的類（一般采用多數(shù)表決的方法）。

image.png

k近鄰模型

模型有3個要素——距離度量方法捌显、k值的選擇和分類決策規(guī)則茁彭。

模型

當3要素確定的時候，對任何實例（訓練或輸入）扶歪，它所屬的類都是確定的理肺，相當于將特征空間分為一些子空間。

image.png

距離度量
對n維實數(shù)向量空間Rn击罪，經(jīng)常用Lp距離或曼哈頓Minkowski距離哲嘲。

Lp距離定義如下：

image

當p=2時，稱為歐氏距離：

image

當p=1時媳禁，稱為曼哈頓距離：

image

當p=∞眠副，它是各個坐標距離的最大值，即：

image

用圖表示如下：

image.png

k值的選擇

k較小竣稽，容易被噪聲影響囱怕，發(fā)生過擬合。

k較大毫别，較遠的訓練實例也會對預(yù)測起作用娃弓，容易發(fā)生錯誤。

分類決策規(guī)則

使用0-1損失函數(shù)衡量岛宦，那么誤分類率是：

image

N_k是近鄰集合台丛，要使左邊最小，右邊的

image

必須最大，所以多數(shù)表決=經(jīng)驗最小化挽霉。

k近鄰法的實現(xiàn)：kd樹

算法核心在于怎么快速搜索k個近鄰出來防嗡，樸素做法是線性掃描，不可取侠坎，這里介紹的方法是kd樹蚁趁。

構(gòu)造kd樹

對數(shù)據(jù)集T中的子集S初始化S=T，取當前節(jié)點node=root取維數(shù)的序數(shù)i=0实胸，對S遞歸執(zhí)行：

找出S的第i維的中位數(shù)對應(yīng)的點他嫡，通過該點，且垂直于第i維坐標軸做一個超平面庐完。該點加入node的子節(jié)點钢属。該超平面將空間分為兩個部分，對這兩個部分分別重復此操作（S=S'假褪，++i署咽，node=current），直到不可再分生音。

image.png

   T = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
     
    class node:
        def __init__(self, point):
            self.left = None
            self.right = None
            self.point = point
            pass
        
    def median(lst):
        m = len(lst) / 2
        return lst[m], m
     
    def build_kdtree(data, d):
        data = sorted(data, key=lambda x: x[d])
        p, m = median(data)
        tree = node(p)
     
        del data[m]
        print data, p
     
        if m > 0: tree.left = build_kdtree(data[:m], not d)
        if len(data) > 1: tree.right = build_kdtree(data[m:], not d)
        return tree
     
    kd_tree = build_kdtree(T, 0)
    print kd_tree

可視化

可視化的話則要費點功夫保存中間結(jié)果宁否，并恰當?shù)卣故境鰜?/p>

    # -*- coding:utf-8 -*-
    # Filename: kdtree.py
    # Author：hankcs
    # Date: 2015/2/4 15:01
    import copy
    import itertools
    from matplotlib import pyplot as plt
    from matplotlib.patches import Rectangle
    from matplotlib import animation
     
    T = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
     
     
    def draw_point(data):
        X, Y = [], []
        for p in data:
            X.append(p[0])
            Y.append(p[1])
        plt.plot(X, Y, 'bo')
     
     
    def draw_line(xy_list):
        for xy in xy_list:
            x, y = xy
            plt.plot(x, y, 'g', lw=2)
     
     
    def draw_square(square_list):
        currentAxis = plt.gca()
        colors = itertools.cycle(["r", "b", "g", "c", "m", "y", '#EB70AA', '#0099FF'])
        for square in square_list:
            currentAxis.add_patch(
                Rectangle((square[0][0], square[0][1]), square[1][0] - square[0][0], square[1][1] - square[0][1],
                          color=next(colors)))
     
     
    def median(lst):
        m = len(lst) / 2
        return lst[m], m
     
     
    history_quare = []
     
     
    def build_kdtree(data, d, square):
        history_quare.append(square)
        data = sorted(data, key=lambda x: x[d])
        p, m = median(data)
     
        del data[m]
        print data, p
     
        if m >= 0:
            sub_square = copy.deepcopy(square)
            if d == 0:
                sub_square[1][0] = p[0]
            else:
                sub_square[1][1] = p[1]
            history_quare.append(sub_square)
            if m > 0: build_kdtree(data[:m], not d, sub_square)
        if len(data) > 1:
            sub_square = copy.deepcopy(square)
            if d == 0:
                sub_square[0][0] = p[0]
            else:
                sub_square[0][1] = p[1]
            build_kdtree(data[m:], not d, sub_square)
     
     
    build_kdtree(T, 0, [[0, 0], [10, 10]])
    print history_quare
     
     
    # draw an animation to show how it works, the data comes from history
    # first set up the figure, the axis, and the plot element we want to animate
    fig = plt.figure()
    ax = plt.axes(xlim=(0, 2), ylim=(-2, 2))
    line, = ax.plot([], [], 'g', lw=2)
    label = ax.text([], [], '')
     
    # initialization function: plot the background of each frame
    def init():
        plt.axis([0, 10, 0, 10])
        plt.grid(True)
        plt.xlabel('x_1')
        plt.ylabel('x_2')
        plt.title('build kd tree (www.hankcs.com)')
        draw_point(T)
     
     
    currentAxis = plt.gca()
    colors = itertools.cycle(["#FF6633", "g", "#3366FF", "c", "m", "y", '#EB70AA', '#0099FF', '#66FFFF'])
     
    # animation function.  this is called sequentially
    def animate(i):
        square = history_quare[i]
        currentAxis.add_patch(
            Rectangle((square[0][0], square[0][1]), square[1][0] - square[0][0], square[1][1] - square[0][1],
                      color=next(colors)))
        return
     
    # call the animator.  blit=true means only re-draw the parts that have changed.
    anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(history_quare), interval=1000, repeat=False,
                                   blit=False)
    plt.show()
    anim.save('kdtree_build.gif', fps=2, writer='imagemagick')

image.png

搜索kd樹

上面的代碼其實并沒有搜索kd樹，現(xiàn)在來實現(xiàn)搜索缀遍。

搜索跟二叉樹一樣來慕匠，是一個遞歸的過程。先找到目標點的插入位置域醇，然后往上走台谊，逐步用自己到目標點的距離畫個超球體，用超球體圈住的點來更新最近鄰（或k最近鄰）譬挚。以最近鄰為例锅铅，實現(xiàn)如下（本實現(xiàn)由于測試數(shù)據(jù)簡單，沒有做超球體與超立體相交的邏輯）：

    # -*- coding:utf-8 -*-
    # Filename: search_kdtree.py
    # Author：hankcs
    # Date: 2015/2/4 15:01
     
    T = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
     
     
    class node:
        def __init__(self, point):
            self.left = None
            self.right = None
            self.point = point
            self.parent = None
            pass
     
        def set_left(self, left):
            if left == None: pass
            left.parent = self
            self.left = left
     
        def set_right(self, right):
            if right == None: pass
            right.parent = self
            self.right = right
     
     
    def median(lst):
        m = len(lst) / 2
        return lst[m], m
     
     
    def build_kdtree(data, d):
        data = sorted(data, key=lambda x: x[d])
        p, m = median(data)
        tree = node(p)
     
        del data[m]
     
        if m > 0: tree.set_left(build_kdtree(data[:m], not d))
        if len(data) > 1: tree.set_right(build_kdtree(data[m:], not d))
        return tree
     
     
    def distance(a, b):
        print a, b
        return ((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2) ** 0.5
     
     
    def search_kdtree(tree, d, target):
        if target[d] < tree.point[d]:
            if tree.left != None:
                return search_kdtree(tree.left, not d, target)
        else:
            if tree.right != None:
                return search_kdtree(tree.right, not d, target)
     
        def update_best(t, best):
            if t == None: return
            t = t.point
            d = distance(t, target)
            if d < best[1]:
                best[1] = d
                best[0] = t
     
     
        best = [tree.point, 100000.0]
        while (tree.parent != None):
            update_best(tree.parent.left, best)
            update_best(tree.parent.right, best)
            tree = tree.parent
        return best[0]
     
     
    kd_tree = build_kdtree(T, 0)
    print search_kdtree(kd_tree, 0, [9, 4])

輸出

[8, 1] [9, 4]
[5, 4] [9, 4]  
[9, 6] [9, 4]
[9, 6]

最后編輯于：2018.02.11 23:06:56

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末减宣，一起剝皮案震驚了整個濱河市盐须，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌漆腌，老刑警劉巖贼邓，帶你破解...
沈念sama閱讀 218,122評論 6贊 505
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異闷尿，居然都是意外死亡塑径，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,070評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門填具，熙熙樓的掌柜王于貴愁眉苦臉地迎上來统舀，“玉大人，你說我怎么就攤上這事“笤郏” “怎么了绰筛？”我有些...
開封第一講書人閱讀 164,491評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長描融。經(jīng)常有香客問我，道長衡蚂，這世上最難降的妖魔是什么窿克？我笑而不...
開封第一講書人閱讀 58,636評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮毛甲，結(jié)果婚禮上年叮，老公的妹妹穿的比我還像新娘。我一直安慰自己玻募，他們只是感情好只损，可當我...
茶點故事閱讀 67,676評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著七咧，像睡著了一般跃惫。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上艾栋，一...
開封第一講書人閱讀 51,541評論 1贊 305
城市分裂傳說
那天爆存，我揣著相機與錄音，去河邊找鬼蝗砾。笑死先较，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的悼粮。我是一名探鬼主播闲勺，決...
沈念sama閱讀 40,292評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼扣猫！你這毒婦竟也來了菜循？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,211評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤苞笨，失蹤者是張志新（化名）和其女友劉穎债朵，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體瀑凝，經(jīng)...
沈念sama閱讀 45,655評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡序芦，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,846評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了粤咪。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片谚中。...
茶點故事閱讀 39,965評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出宪塔，到底是詐尸還是另有隱情磁奖，我是刑警寧澤，帶...
沈念sama閱讀 35,684評論 5贊 347
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布某筐，位于F島的核電站比搭，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏南誊。R本人自食惡果不足惜身诺，卻給世界環(huán)境...
茶點故事閱讀 41,295評論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望抄囚。院中可真熱鬧霉赡，春花似錦、人聲如沸幔托。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,894評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽重挑。三九已至嗓化，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間攒驰，已是汗流浹背蟆湖。一陣腳步聲響...
開封第一講書人閱讀 33,012評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留玻粪，地道東北人隅津。一個月前我還...
沈念sama閱讀 48,126評論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像劲室，于是被迫代替她去往敵國和親伦仍。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 44,914評論 2贊 355

李航統(tǒng)計學習方法(三)----k近鄰算法

k近鄰算法

k近鄰模型

模型

k值的選擇

分類決策規(guī)則

k近鄰法的實現(xiàn)：kd樹

可視化

搜索kd樹

推薦閱讀更多精彩內(nèi)容