機器學習筆記 - 18. 聚類實踐（講師：鄒博）

本次目標

2018-07-23 19_54_56-【鄒博_chinahadoop】機器學習升級版VII（七）.png

K-Means (K均值)

2018-07-23 19_58_14-【鄒博_chinahadoop】機器學習升級版VII（七）.png

Vector Quantization（矢量量化）

2018-07-23 19_58_33-【鄒博_chinahadoop】機器學習升級版VII（七）.png

傳統(tǒng)模型

2018-07-23 19_59_50-【鄒博_chinahadoop】機器學習升級版VII（七）.png

傳統(tǒng)圖像分類

2018-07-24 10_01_16-【鄒博_chinahadoop】機器學習升級版VII（七）.png

2018-08-05 18_51_36-【鄒博_chinahadoop】機器學習升級版VII（七）.png

如圖，下面三個樣本的維度不同：
第一個是20x144維
第二個是10x144維
第三個是15x144維
因為維度不同翅睛，所以無法喂給Logistic回歸(LR)灿巧。
但是換一個思路，因為都是x144袍辞，所以可以將20,10,15看成各自獨立的樣本。每一張圖片，看成是一個詞袋厕九，第一個袋子放了20個樣本，第二個袋子放了10個樣本地回，第三個袋子放了15個樣本扁远。
然后對于這45個樣本，做一個聚類刻像。
如中間部分畅买，做了5種聚類，即5個簇细睡。(K=5)
然后觀察三張圖片各自樣本谷羞，落在這5個簇的聚類情況。
即特征變成了5維的向量。
即各個圖片湃缎，在分類器看來犀填，就是長度為5的列向量。
然后將列向量橫過來嗓违，就形成3行（3張圖片）五列的數(shù)據(jù)九巡，如此即可將其喂給SVM或者隨機森林都是可以的。
當然也可以映射為1000個數(shù)的向量蹂季，即變成3x1000冕广。
無論是映射成5個還是1000個，其實都比原始維度要低的多偿洁。即通過聚類撒汉，達到降維的目的。
而中間這一步（即聚類）涕滋，是一個無監(jiān)督的學習睬辐。可以看做對144維向量宾肺，映射到某一個數(shù)字上去溉委。對Vector做了量化，即矢量量化爱榕，即VQ瓣喊。

2018-08-05 19_02_48-【鄒博_chinahadoop】機器學習升級版VII（七）.png

然后回過頭來看Vector Quantization這張ppt，
任何一張圖片可以看成RGB三通道的彩色圖像黔酥，任何一個像素點是三維的一個數(shù)字藻三，換句話說，任何一個像素點其實是一個向量跪者。
假定是800x600的圖片棵帽，有48萬個向量，將這項矢量做聚類渣玲，用簇的中心代替原始的像素逗概。即可以用調色板，將圖像壓縮了忘衍。

2018-08-05 19_17_14-【鄒博_chinahadoop】機器學習升級版VII（七）.png

課堂問答

問：圖像有噪聲會有影響么逾苫？
答：應該有影響，但這個影響枚钓，有些時候可以自適應铅搓。比如圖像偏紅，其中有一片是綠色的搀捷，作為噪聲星掰。如果將噪聲作為一個簇，則問題不大。但是椒鹽噪聲氢烘，而且顏色隨機怀偷，會影響很大。不過椒鹽噪聲播玖，如果顏色一致椎工，則會聚類，影響就不會很大黎棠。

問：聚類跟PCA類似么？
答：不類似镰绎。

問：聚類主要用在哪脓斩？
答：一般不會把聚類作為主算法。比如用卷積網(wǎng)絡畴栖，識別物體随静，有可能把聚類當成中間環(huán)節(jié)，對數(shù)據(jù)做變換吗讶。此時聚類會產(chǎn)生價值燎猛。比如預處理，中間預處理照皆，或者后處理的一個步驟重绷。完善整個環(huán)節(jié)中的其中一環(huán)。聚類可以幫助選擇特征膜毁。

問：聚類可以達到降維的目的昭卓？
答：是的

問：如果是RGB，是RGB一起聚類瘟滨，還是RGB各自聚類候醒？
答：是一起聚類。

問：如果是多張圖片杂瘸，是一起聚類倒淫，還是各自聚類？
答：多張圖片败玉，也是一起聚類敌土，一起做特征。

問：選特征與VQ是什么關系呢运翼？
答：我們是利用VQ來最終選特征纯赎，選出來這1000個數(shù)（特征）。

問：在應用中南蹂，具體如何選擇使用Kmeans最大密度聚類犬金，譜聚類，DBScan哪個算法？需要逐個嘗試么晚顷？
答：如果數(shù)據(jù)量特別大峰伙，只能選Kmeans；
如果數(shù)據(jù)量不是特別大该默，可以選密度最大聚類瞳氓，或DBSCAN。
如果數(shù)據(jù)量中等或偏小栓袖，可使用譜聚類試一試匣摘。因為譜聚類不會過分探討特征距離是怎么來的。

問：圖像壓縮是不是把圖像所有的顏色壓縮到100種顏色中去裹刮？
答：是的

問：可否理解為處理后的結果就是把占據(jù)圖像主要色彩的部分給留下來音榜？
答：不是。我們是把所有的像素給替換了捧弃。

問：聚類之后的操作是干什么赠叼？
答：之后的操作，可能是做SVM违霞， Logisitc回歸嘴办，或者將特征喂給卷積網(wǎng)絡。具體問題具體處理买鸽。

問：能用聚類做二值化么涧郊？
答：當然能。比如可以選擇黑白圖像做二值化眼五。

問：把圖像變成字符的圖像底燎，是不是也可以用VQ?

2018-08-05 19_43_18-【鄒博_chinahadoop】機器學習升級版VII（七）.png

答：先用VQ，倒是有可能弹砚。

問：聚類可以用于一致性問題么双仍？就是把所有點演化成最終匯聚到一個點上？
答：這個取決于應用場景桌吃。

AP聚類

2018-08-05 19_12_38-【鄒博_chinahadoop】機器學習升級版VII（七）.png

時間復雜度太高朱沃，計算速度慢一些，只能適用于中小型數(shù)據(jù)茅诱。
調參還好逗物。

MeanShift

把均值做移動，然后生成下面的結果

2018-08-05 19_46_40-【鄒博_chinahadoop】機器學習升級版VII（七）png.png

密度聚類

參數(shù)有ε瑟俭， m以及聚類數(shù)目翎卓。

2018-08-05 19_47_44-【鄒博_chinahadoop】機器學習升級版VII（七）.png

譜聚類

參數(shù)為給出相似度的值，實際上就是標準差摆寄。

2018-08-05 19_49_15-【鄒博_chinahadoop】機器學習升級版VII（七）.png

用處失暴，在于做分隔：

2018-08-05 19_50_23-【鄒博_chinahadoop】機器學習升級版VII（七）.png

代碼解析

視頻對應的代碼：鏈接：https://pan.baidu.com/s/1LOeY5sWr7dyX-LBhSFI9WQ 密碼：cipq
參見：第十八課_代碼.zip

1. K-means

代碼

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.colors
import matplotlib.pyplot as plt
import sklearn.datasets as ds
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_mutual_info_score,\
    adjusted_rand_score, silhouette_score
from sklearn.cluster import KMeans


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 400
    centers = 4
    data, y = ds.make_blobs(N, n_features=2, centers=centers, random_state=2)
    data2, y2 = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)
    data3 = np.vstack((data[y == 0][:], data[y == 1][:50], data[y == 2][:20], data[y == 3][:5]))
    y3 = np.array([0] * 100 + [1] * 50 + [2] * 20 + [3] * 5)
    m = np.array(((1, 1), (1, 3)))
    data_r = data.dot(m)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    cm = matplotlib.colors.ListedColormap(list('rgbm'))
    data_list = data, data, data_r, data_r, data2, data2, data3, data3
    y_list = y, y, y, y, y2, y2, y3, y3
    titles = '原始數(shù)據(jù)', 'KMeans++聚類', '旋轉后數(shù)據(jù)', '旋轉后KMeans++聚類',\
             '方差不相等數(shù)據(jù)', '方差不相等KMeans++聚類', '數(shù)量不相等數(shù)據(jù)', '數(shù)量不相等KMeans++聚類'

    model = KMeans(n_clusters=4, init='k-means++', n_init=5)
    plt.figure(figsize=(8, 9), facecolor='w')
    for i, (x, y, title) in enumerate(zip(data_list, y_list, titles), start=1):
        plt.subplot(4, 2, i)
        plt.title(title)
        if i % 2 == 1:
            y_pred = y
        else:
            y_pred = model.fit_predict(x)
        print(i)
        print('Homogeneity：', homogeneity_score(y, y_pred))
        print('completeness：', completeness_score(y, y_pred))
        print('V measure：', v_measure_score(y, y_pred))
        print('AMI：', adjusted_mutual_info_score(y, y_pred))
        print('ARI：', adjusted_rand_score(y, y_pred))
        print('Silhouette：', silhouette_score(x, y_pred), '\n')
        plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=30, cmap=cm, edgecolors='none')
        x1_min, x2_min = np.min(x, axis=0)
        x1_max, x2_max = np.max(x, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':')
    plt.tight_layout(2, rect=(0, 0, 1, 0.97))
    plt.suptitle('數(shù)據(jù)分布對KMeans聚類的影響', fontsize=18)
    plt.show()

結果如下：

2018-08-05 19_54_52-Figure 1.png

1
Homogeneity： 1.0
completeness： 1.0
V measure： 1.0
AMI： 1.0
ARI： 1.0
Silhouette： 0.616436816839852 

2
Homogeneity： 0.9898828240244267
completeness： 0.9899006758819153
V measure： 0.9898917498726852
AMI： 0.9897991568445268
ARI： 0.9933165272203728
Silhouette： 0.6189656317733315 

3
Homogeneity： 1.0
completeness： 1.0
V measure： 1.0
AMI： 1.0
ARI： 1.0
Silhouette： 0.5275987244664399 

4
Homogeneity： 0.7248868671759175
completeness： 0.7260584887742589
V measure： 0.7254722049396414
AMI： 0.7226116135013659
ARI： 0.6703250071796025
Silhouette： 0.5349498852778517 

5
Homogeneity： 1.0
completeness： 1.0
V measure： 1.0
AMI： 1.0
ARI： 1.0
Silhouette： 0.4790725752982868 

6
Homogeneity： 0.7449364376693913
completeness： 0.7755445167472194
V measure： 0.7599323988656884
AMI： 0.7428234768685047
ARI： 0.7113213508090338
Silhouette： 0.5737260449304202 

7
Homogeneity： 1.0
completeness： 1.0
V measure： 1.0
AMI： 1.0
ARI： 1.0
Silhouette： 0.5975066093204152 

8
Homogeneity： 0.9776347312784609
completeness： 0.9728632742060752
V measure： 0.975243166591057
AMI： 0.9721283376882836
ARI： 0.9906840043816505
Silhouette： 0.6013877858619149

silhouette系數(shù)

輪廓系數(shù)（Silhouette Coefficient）坯门，是聚類效果好壞的一種評價方式。最早由 Peter J. Rousseeuw 在 1986 提出逗扒。它結合內聚度和分離度兩種因素古戴。可以用來在相同原始數(shù)據(jù)的基礎上用來評價不同算法矩肩、或者算法不同運行方式對聚類結果所產(chǎn)生的影響现恼。

si接近1，則說明樣本i聚類合理黍檩；
si接近-1叉袍，則說明樣本i更應該分類到另外的簇；
若si 近似為0刽酱，則說明樣本i在兩個簇的邊界上喳逛。
聚類評估算法-輪廓系數(shù)（Silhouette Coefficient ）
如第二張圖(KMeans++聚類)，其silhouette系數(shù)是0.6189656317733315肛跌，為某簇的點到該簇其他樣本點的平均距離為A艺配，以及該點到其他簇的樣本平均距離為B察郁。然后1- A/B衍慎，即得到輪廓系數(shù)。
最優(yōu)情況皮钠，是等于1稳捆，但是所有數(shù)據(jù)都不會等于1的。因為其到自身樣本的平均距離不可能為0麦轰，其到其他簇的平均距離不可能為無窮大乔夯。一定是小于1的數(shù)字。

KMean算法本質

KMean算法本質上認為樣本的每一個簇都是服從方差相等的高斯分布款侵，如果方差不相等末荐，則需要EM算法期望最大化算法來進行建模，計算新锈。

代碼解析

# make_blobs: 這個函數(shù)本質是用于生成若干個高斯分布的樣本甲脏。
# N = 400, 表明生成400個樣本
# centers = 4，表明生成4個中心點（簇）
# n_features = 2: 設置為2維數(shù)據(jù)妹笆，因為好畫圖
# random_state = 2块请，用于保證每次生成數(shù)據(jù)一樣，生成隨機數(shù)種子拳缠，保證樣本是特定的
N = 400
centers = 4
data, y = ds.make_blobs(N, n_features=2, centers=centers, random_state=2)

# cluster_std = (1,2.5,0.5,2)墩新，表明對高斯分布的方差做個變換，默認情況下窟坐，方差都是1.
data2, y2 = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)

# 以下代碼海渊，使得四個類別的樣本不均衡
# y == 0: 取全部绵疲；y == 1: 取50個; y == 2: 取20個; y ==3: 取5個
data3 = np.vstack((data[y == 0][:], data[y == 1][:50], data[y == 2][:20], data[y == 3][:5]))
# 使得y3與data3數(shù)量一致
y3 = np.array([0] * 100 + [1] * 50 + [2] * 20 + [3] * 5)

# 對矩陣做乘法，本質上就是對矩陣做旋轉
m = np.array(((1, 1), (1, 3)))
data_r = data.dot(m)

# 生成KMeans的模型切省，init為初始樣本的方法, n_init = 5: 做5次計算最岗，從5次結果選擇最好的結果輸出，默認是10
model = KMeans(n_clusters=4, init='k-means++', n_init=5)
# 預測聚類
y_pred = model.fit_predict(x)

from sklearn.metrics import homogeneity_score, \
completeness_score, \
v_measure_score, \
adjusted_mutual_info_score, \
adjusted_rand_score, \
silhouette_score
...
...
# 衡量指標的方法在sklearn里面都已經(jīng)內置了朝捆，只需要調用即可
# 這些指標：取1是最優(yōu)的般渡，取0是最差的；當然輪廓系數(shù)理論上是達不到1的芙盘。
print('Homogeneity：', homogeneity_score(y, y_pred))
print('completeness：', completeness_score(y, y_pred))
print('V measure：', v_measure_score(y, y_pred))
print('AMI：', adjusted_mutual_info_score(y, y_pred))
print('ARI：', adjusted_rand_score(y, y_pred))
print('Silhouette：', silhouette_score(x, y_pred), '\n')

# 樣本x是二維數(shù)據(jù)驯用，所以通過下面方法做散狀圖
# 計算出來的y_pred為類別(c: classification)
# cmap: 顏色的映射表: r:紅，g:綠儒老，b:藍蝴乔，m:品紅
cm = matplotlib.colors.ListedColormap(list('rgbm'))
plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=30, cmap=cm, edgecolors='none')

輸出為三維聚類圖像的代碼

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.colors
import matplotlib.pyplot as plt
import sklearn.datasets as ds
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_mutual_info_score,\
    adjusted_rand_score, silhouette_score
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 400
    centers = 4
    # data, y = ds.make_blobs(N, n_features=2, centers=centers, random_state=2)
    # data2, y2 = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)
    data, y = ds.make_blobs(N, n_features=3, centers=centers, random_state=2)
    data2, y2 = ds.make_blobs(N, n_features=3, centers=centers, cluster_std=(1,2.5,0.5,2), random_state=2)
    data3 = np.vstack((data[y == 0][:], data[y == 1][:50], data[y == 2][:20], data[y == 3][:5]))
    y3 = np.array([0] * 100 + [1] * 50 + [2] * 20 + [3] * 5)
    # m = np.array(((1, 1), (1, 3)))
    # 對3x3的矩陣，乘上一個數(shù)據(jù)驮樊，就是對數(shù)據(jù)做一個旋轉薇正，平移，對稱囚衔，錯切的變換
    m = np.array(((1, 1, 1), (1, 3, 2), (3, 6, 1)))
    data_r = data.dot(m)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    cm = matplotlib.colors.ListedColormap(list('rgbm'))
    data_list = data, data, data_r, data_r, data2, data2, data3, data3
    y_list = y, y, y, y, y2, y2, y3, y3
    titles = '原始數(shù)據(jù)', 'KMeans++聚類', '旋轉后數(shù)據(jù)', '旋轉后KMeans++聚類',\
             '方差不相等數(shù)據(jù)', '方差不相等KMeans++聚類', '數(shù)量不相等數(shù)據(jù)', '數(shù)量不相等KMeans++聚類'

    model = KMeans(n_clusters=4, init='k-means++', n_init=5)
    # plt.figure(figsize=(8, 9), facecolor='w')
    fig = plt.figure(figsize=(8, 9), facecolor='w')
    for i, (x, y, title) in enumerate(zip(data_list, y_list, titles), start=1):
        # plt.subplot(4, 2, i)
        ax = fig.add_subplot(4, 2, i, projection='3d')
        plt.title(title)
        if i % 2 == 1:
            y_pred = y
        else:
            y_pred = model.fit_predict(x)
        print(i)
        print('Homogeneity：', homogeneity_score(y, y_pred))
        print('completeness：', completeness_score(y, y_pred))
        print('V measure：', v_measure_score(y, y_pred))
        print('AMI：', adjusted_mutual_info_score(y, y_pred))
        print('ARI：', adjusted_rand_score(y, y_pred))
        print('Silhouette：', silhouette_score(x, y_pred), '\n')
        # plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=30, cmap=cm, edgecolors='none')
        ax.scatter(x[:, 0], x[:, 1], x[:, 2], s=30, c=y_pred, cmap=cm, edgecolors='none', depthshade=True)
        # x1_min, x2_min = np.min(x, axis=0)
        # x1_max, x2_max = np.max(x, axis=0)
        # x1_min, x1_max = expand(x1_min, x1_max)
        # x2_min, x2_max = expand(x2_min, x2_max)
        # plt.xlim((x1_min, x1_max))
        # plt.ylim((x2_min, x2_max))
        ax.grid(b=True, ls=':')
    plt.tight_layout(2, rect=(0, 0, 1, 0.97))
    plt.suptitle('數(shù)據(jù)分布對KMeans聚類的影響', fontsize=18)
    plt.show()

效果如下：

2018-08-17 14_29_31-Figure 1.png

問答

問：四個特征怎么畫挖腰？
答：四個特征很麻煩，畫不了练湿。猴仑。。
一般是先畫三個特征肥哎，或者是先做PCA（主成分分析）辽俗，然后再去畫它

2. Criteria

# !/usr/bin/python
# -*- coding:utf-8 -*-

from sklearn import metrics


if __name__ == "__main__":
    y = [0, 0, 0, 1, 1, 1]
    y_hat = [0, 0, 1, 1, 2, 2]
    h = metrics.homogeneity_score(y, y_hat)
    c = metrics.completeness_score(y, y_hat)
    print('y: {0}, y_hat: {1}'.format(y, y_hat))
    print('同一性(Homogeneity)：', h)
    print('完整性(Completeness)：', c)
    v2 = 2 * c * h / (c + h)
    v = metrics.v_measure_score(y, y_hat)
    print('V-Measure：', v2, v)

    print()
    y = [0, 0, 0, 1, 1, 1]
    y_hat = [0, 0, 1, 3, 3, 3]
    print('y: {0}, y_hat: {1}'.format(y, y_hat))
    h = metrics.homogeneity_score(y, y_hat)
    c = metrics.completeness_score(y, y_hat)
    v = metrics.v_measure_score(y, y_hat)
    print('同一性(Homogeneity)：', h)
    print('完整性(Completeness)：', c)
    print('V-Measure：', v)

    # 允許不同值
    print()
    y = [0, 0, 0, 1, 1, 1]
    y_hat = [1, 1, 1, 0, 0, 0]
    print('y: {0}, y_hat: {1}'.format(y, y_hat))
    h = metrics.homogeneity_score(y, y_hat)
    c = metrics.completeness_score(y, y_hat)
    v = metrics.v_measure_score(y, y_hat)
    print('同一性(Homogeneity)：', h)
    print('完整性(Completeness)：', c)
    print('V-Measure：', v)

    print()
    y = [0, 0, 1, 1]
    y_hat = [0, 1, 0, 1]
    print('y: {0}, y_hat: {1}'.format(y, y_hat))
    ari = metrics.adjusted_rand_score(y, y_hat)
    print('adjusted_rand_score: {0}'.format(ari))

    print()
    y = [0, 0, 0, 1, 1, 1]
    y_hat = [0, 0, 1, 1, 2, 2]
    print('y: {0}, y_hat: {1}'.format(y, y_hat))
    ari = metrics.adjusted_rand_score(y, y_hat)
    print('adjusted_rand_score: {0}'.format(ari))

輸出值：

y: [0, 0, 0, 1, 1, 1], y_hat: [0, 0, 1, 1, 2, 2]
同一性(Homogeneity)： 0.6666666666666669
完整性(Completeness)： 0.420619835714305
V-Measure： 0.5158037429793889 0.5158037429793889

y: [0, 0, 0, 1, 1, 1], y_hat: [0, 0, 1, 3, 3, 3]
同一性(Homogeneity)： 1.0
完整性(Completeness)： 0.6853314789615865
V-Measure： 0.8132898335036762

y: [0, 0, 0, 1, 1, 1], y_hat: [1, 1, 1, 0, 0, 0]
同一性(Homogeneity)： 1.0
完整性(Completeness)： 1.0
V-Measure： 1.0

y: [0, 0, 1, 1], y_hat: [0, 1, 0, 1]
adjusted_rand_score: -0.49999999999999994

y: [0, 0, 0, 1, 1, 1], y_hat: [0, 0, 1, 1, 2, 2]
adjusted_rand_score: 0.24242424242424246

名詞解釋

同一性 Hommogeneity

homo.png

這種衡量標準建立在一個簇只包含一個類別樣本的信息熵，h越大說明分類越好

完整性 Completeness

comple.png

這種衡量標準建立在同種樣本屬于同一個簇的信息熵篡诽，h越大說明分類越好

V-Measure

vmeasure.png

歸一性和完整性的加權平均
2 * 均一性 * 完整性 / (均一性 + 完整性)

ARI

ari.png

3. 矢量量化

代碼
以下代碼的本質崖飘，是將圖片做一百個簇的聚類，每個簇的中心點代表具體的RGB顏色杈女，即得到100種顏色朱浴。
然后將圖片512x512個像素，每一個像素歸類到這一百個簇的某一個（即獲取對應簇的中心點的值）碧信，然后重新繪制赊琳。
本質就是將原始圖片，使用聚類之后的100個顏色重新繪制~~~

# !/usr/bin/python
# -*- coding: utf-8 -*-

from PIL import Image
import numpy as np
from sklearn.cluster import KMeans
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


def restore_image(cb, cluster, shape):
    row, col, dummy = shape
    image = np.empty((row, col, 3))
    index = 0
    for r in range(row):
        for c in range(col):
            image[r, c] = cb[cluster[index]]
            index += 1
    return image


def show_scatter(a):
    N = 10
    print('原始數(shù)據(jù)：\n', a)
    density, edges = np.histogramdd(a, bins=[N,N,N], range=[(0,1), (0,1), (0,1)])
    # 將數(shù)據(jù)顯示為傳統(tǒng)小數(shù)形式
    np.set_printoptions(suppress=True, linewidth=500, edgeitems=N)
    print('原始density: \n', density)
    print('原始density.shape: \n', density.shape)
    density /= density.sum()
    print('轉換后density: \n', density)
    x = y = z = np.arange(N)
    d = np.meshgrid(x, y, z)
    print('d = \n', d)
    fig = plt.figure(1, facecolor='w')
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(d[1], d[0], d[2], c='r', s=100*density/density.max(), marker='o', edgecolors='k', depthshade=True)
    ax.set_xlabel('紅色分量')
    ax.set_ylabel('綠色分量')
    ax.set_zlabel('藍色分量')
    plt.title('圖像顏色三維頻數(shù)分布', fontsize=13)

    plt.figure(2, facecolor='w')
    den = density[density > 0]
    print(den.shape)
    den = np.sort(den)[::-1]
    t = np.arange(len(den))
    plt.plot(t, den, 'r-', t, den, 'go', lw=2)
    plt.title('圖像顏色頻數(shù)分布', fontsize=13)
    plt.grid(True)

    plt.show()


if __name__ == '__main__':
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    num_vq = 100
    im = Image.open('.\\lena.png')     # son.bmp(100)/flower2.png(200)/son.png(60)/lena.png(50)
    # 圖像是512x512像素
    image = np.array(im).astype(np.float) / 255
    #這里構建512x512像素砰碴，每個像素各有RGB三原色的復合矩陣
    image = image[:, :, :3]
    # 降維躏筏，使其成為262144 x 3的矩陣
    image_v = image.reshape((-1, 3))
    show_scatter(image_v)

    N = image_v.shape[0]    # 圖像像素總數(shù)
    print('圖像像素總數(shù): ', N)
    # 選擇足夠多的樣本(如1000個)，計算聚類中心
    idx = np.random.randint(0, N, size=1000)
    print('idx is: \n', idx)
    image_sample = image_v[idx]
    print('image_sample: \n', image_sample)
    model = KMeans(num_vq)
    model.fit(image_sample)
    c = model.predict(image_v)  # 聚類結果
    # 聚類結果的值事實上對應聚類中心的索引
    print('聚類結果：\n', c, '聚類結果個數(shù): \n', len(c))
    print('聚類中心：\n', model.cluster_centers_, '聚類中心個數(shù): \n', len(model.cluster_centers_))
    # 得到聚類結果后的使用方式：
    # 如果想知道第1個聚類結果的聚類中心的值呈枉，用如下代碼實現(xiàn)：
    print('聚類結果第1個聚類中心的值是: \n', model.cluster_centers_[c[0]])

    plt.figure(figsize=(12, 6), facecolor='w')
    plt.subplot(121)
    plt.axis('off')
    plt.title('原始圖片', fontsize=14)
    plt.imshow(image)
    # plt.savefig('1.png')

    plt.subplot(122)
    print('image shape: ', image.shape)
    vq_image = restore_image(model.cluster_centers_, c, image.shape)
    plt.axis('off')
    plt.title('矢量量化后圖片：%d色' % num_vq, fontsize=14)
    plt.imshow(vq_image)
    plt.savefig('lena100.png')

    plt.tight_layout(2)
    plt.show()

運行圖例

2018-08-17 15_04_04-Figure 2.png

可以發(fā)現(xiàn)紅色占比最多趁尼，綠色有一些埃碱，藍色幾乎沒有，所以圖像基本偏黃
占比最多的圖像顏色酥泞，只有少數(shù)數(shù)量砚殿。占比最高的，超過7%
代碼中有通過np.histogramdd對數(shù)組進行直方圖統(tǒng)計芝囤，并求占比的例子
可見前100幾乎占據(jù)所有顏色似炎，那么num_vq如果設為100，效果將會好很多

2018-08-17 15_04_23-18.Clustering.png

問答

問：Kmeans聚類（二悯姊、三百類別）結果如果某一類很多羡藐，差不多占一半樣本，其他類很少悯许，怎么判斷聚類結果是否合理仆嗦？
答：不合理。特例是：中國五十六個民族先壕，漢族人口占據(jù)大多數(shù)瘩扼。
但是實際工作中，可能有噪聲垃僚。

問：可以不可以加一些特征讓它更清晰集绰？
答：可以

問：幾百維特征，選擇特征還是挺麻煩的事
答：是的冈在。

問：如果用Tensorflow,定義輸出維度為K倒慧，Loss為輸入和輸出的誤差平方和按摘，采取梯度下降算法實現(xiàn)聚類可行么包券？
答：這么做的話，會有一個問題炫贤。聚類沒有Y溅固，于是就沒有所謂標記的問題。但是我們可以按照如下方式去嘗試：
比如有M個向量兰珍，每個向量100維侍郭，然后喂給Tensorflow，使用CNN或U-NET進行訓練掠河，輸出依然是原數(shù)據(jù)亮元。
此時，我們拿出倒數(shù)第二層的神經(jīng)元數(shù)據(jù)唠摹，可以是20個進行降維爆捞，也可以是1000個進行升維。目的是拿到經(jīng)過訓練后的訓練數(shù)據(jù)的特征勾拉。

2018-09-07 18_14_37-【鄒博_chinahadoop】機器學習升級版VII（七）.png

問：如果類別數(shù)目K事先未知煮甥，且數(shù)據(jù)有噪聲盗温，直觀上覺得哪種算法更適合一些？
答：選擇基于密度的DBScan或Density Peak成肘，因為它們都不需要事先知道K卖局，并且密度聚類天然可以發(fā)現(xiàn)噪聲。只要沒有核心對象指向它們双霍，我們都判定為噪聲砚偶。

4. AP聚類

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import euclidean_distances


if __name__ == "__main__":
    N = 400
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    # 使用歐式距離（歐幾里得距離）
    m = euclidean_distances(data, squared=True)
    preference = -np.median(m)
    print('Preference：', preference)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(12, 9), facecolor='w')
    for i, mul in enumerate(np.linspace(1, 4, 9)):
        print(mul)
        p = mul * preference
        model = AffinityPropagation(affinity='euclidean', preference=p)
        af = model.fit(data)
        center_indices = af.cluster_centers_indices_
        n_clusters = len(center_indices)
        print(('p = %.1f' % mul), p, '聚類簇的個數(shù)為：', n_clusters)
        y_hat = af.labels_

        plt.subplot(3, 3, i+1)
        plt.title('Preference：%.2f，簇個數(shù)：%d' % (p, n_clusters))
        clrs = []
        for c in np.linspace(16711680, 255, n_clusters, dtype=int):
            clrs.append('#%06x' % c)
        # clrs = plt.cm.Spectral(np.linspace(0, 1, n_clusters))
        for k, clr in enumerate(clrs):
            cur = (y_hat == k)
            # 繪制中心點周邊的圓點
            plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='none')
            center = data[center_indices[k]]
            for x in data[cur]:
                # 將中心點與周邊的圓點連線
                plt.plot([x[0], center[0]], [x[1], center[1]], color=clr, lw=0.5, zorder=1)
        # 繪制中心點洒闸，用星表示
        plt.scatter(data[center_indices, 0], data[center_indices, 1], s=80, c=clrs, marker='*', edgecolors='k', zorder=2)
        plt.grid(b=True, ls=':')
    plt.tight_layout()
    plt.suptitle('AP聚類', fontsize=20)
    plt.subplots_adjust(top=0.92)
    plt.show()

展示圖形示例：

2018-09-26 21_00_10-Figure 1.png

理論上, 簇個數(shù)應該越來越少蟹演，但是如圖示居然有55余107個簇的情況，估計是sklearn的小bug
代碼中顷蟀，data的shape是(400, 2)酒请，求歐氏距離，是兩兩之間求距離鸣个，即得到一個400x400的距離矩陣羞反。加負號的話，可以認為是相似度囤萤。np.median(m)目的是從16萬個數(shù)中獲得中位數(shù)昼窗，-np.median(m)的目的是獲得preference的初值。
可以從上面的代碼涛舍，了解繪圖的一些技巧

5. MeanShift聚類

Mean: 均值澄惊，Shift: 變換
Mean shift 算法是基于核密度估計的爬山算法，可用于聚類富雅、圖像分割掸驱、跟蹤等
代碼如下：

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import MeanShift
from sklearn.metrics import euclidean_distances


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8, 7), facecolor='w')
    m = euclidean_distances(data, squared=True)
    bw = np.median(m)
    print(bw)
    for i, mul in enumerate(np.linspace(0.1, 0.4, 4)):
        band_width = mul * bw
        model = MeanShift(bin_seeding=True, bandwidth=band_width)
        ms = model.fit(data)
        centers = ms.cluster_centers_
        y_hat = ms.labels_
        n_clusters = np.unique(y_hat).size
        print('帶寬：', mul, band_width, '聚類簇的個數(shù)為：', n_clusters)

        plt.subplot(2, 2, i+1)
        plt.title('帶寬：%.2f，聚類簇的個數(shù)為：%d' % (band_width, n_clusters))
        clrs = []
        # 這個顏色不好看没佑，通過下面的方式固定顏色
        # for c in np.linspace(16711680, 255, n_clusters, dtype=int):
        #     clrs.append('#%06x' % c)
        # 用rgbm不合適毕贼，因為只能為4個聚類繪制數(shù)據(jù)點
        # clrs = list('rgbm')
        clrs = plt.cm.Spectral(np.linspace(0, 1, n_clusters))
        for k, clr in enumerate(clrs):
            cur = (y_hat == k)
            # 繪制中心附近的點
            plt.scatter(data[cur, 0], data[cur, 1], s=10, c=clr, edgecolors='none')
        # 繪制中心點
        plt.scatter(centers[:, 0], centers[:, 1], s=200, c=clrs, marker='*', edgecolors='k')
        plt.grid(b=True, ls=':')
    plt.tight_layout(2)
    plt.suptitle('MeanShift聚類', fontsize=15)
    plt.subplots_adjust(top=0.9)
    plt.show()

效果如圖：

2018-09-27 11_22_40-Figure 1.png

MeanShift的參數(shù)：

bandwidth
bandwidth : float, optional
Bandwidth used in the RBF kernel.

RBF: 即高斯核函數(shù)。
所謂徑向基函數(shù) (Radial Basis Function 簡稱 RBF), 就是某種沿徑向對稱的標量函數(shù)蛤奢。通常定義為空間中任一點x到某一中心xc之間歐氏距離的單調函數(shù) , 可記作 k(||x-xc||), 其作用往往是局部的 , 即當x遠離xc時函數(shù)取值很小鬼癣。最常用的徑向基函數(shù)是高斯核函數(shù) ,形式為 k(||x-xc||)=exp{- ||x-xc||^2/(2*σ^2) } 其中xc為核函數(shù)中心,σ為函數(shù)的寬度參數(shù) , 控制了函數(shù)的徑向作用范圍。
If not given, the bandwidth is estimated using
sklearn.cluster.estimate_bandwidth; see the documentation for that
function for hints on scalability (see also the Notes, below).

bin_seeding
bin_seeding : boolean, optional
If true, initial kernel locations are not locations of all
points, but rather the location of the discretized version of
points, where points are binned onto a grid whose coarseness
corresponds to the bandwidth. Setting this option to True will speed
up the algorithm because fewer seeds will be initialized.
default value: False
Ignored if seeds argument is not None.

下圖可以很直觀表示中心點(均值Mean)不斷漂移(Shift)啤贩，漸趨合理的過程（即不再收斂待秃，中心點漂移就結束了）
圖中的圓的半徑，即band width
帶寬太小痹屹，則簇過于瑣碎
帶寬太大章郁，則簇不再明顯與合理。
所以合理的帶寬是關鍵。

2018-09-27 11_38_35-【鄒博_chinahadoop】機器學習升級版VII（七）.png

問答

Preference是什么九杂？
答：是指AP算法里面的那個值，是不是愿意作為聚類中心的初始值嚷狞。因為所謂相似是指我想做聚類中心雄驹，對其他樣本的吸引程度佃牛；以及某一個樣本不想做聚類中心，它依靠于某一個聚類中心的依賴程度医舆。
AP為啥要給定初始聚類中心俘侠？
答：我們不需要給定初始聚類中心。我們需要給定的是任何一個值蔬将，它的初始a和r值是幾爷速，那個是用中位數(shù)做的。
sklearn中的k-means算法如何選擇度量距離的類別霞怀？比如余弦相似度距離惫东？
答：k-means類的構造函數(shù)有precompute_distances的參數(shù)，默認值是auto毙石。
官方解釋如下：

precompute_distances : {'auto', True, False}
        Precompute distances (faster but takes more memory).
        'auto' : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.
        True : always precompute distances
        False : never precompute distances

sklearn.datasets的make_blob廉沮，返回值得y值是中心的標記么，返回的值結果是什么徐矩？
答：返回的y是0,1,2,3,..., k-1類別的值滞时，表示data的每一個成員屬于哪個類別。

6. DBSCAN 密度聚類

代碼：

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    data = StandardScaler().fit_transform(data)
    # 數(shù)據(jù)1的參數(shù)：(epsilon, min_sample)
    params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))

    # 數(shù)據(jù)2
    # t = np.arange(0, 2*np.pi, 0.1)
    # data1 = np.vstack((np.cos(t), np.sin(t))).T
    # data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    # data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    # data = np.vstack((data1, data2, data3))
    # # # 數(shù)據(jù)2的參數(shù)：(epsilon, min_sample)
    # params = ((0.5, 3), (0.5, 5), (0.5, 10), (1., 3), (1., 10), (1., 20))

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    plt.figure(figsize=(9, 7), facecolor='w')
    plt.suptitle('DBSCAN聚類', fontsize=15)

    for i in range(6):
        eps, min_samples = params[i]
        model = DBSCAN(eps=eps, min_samples=min_samples)
        model.fit(data)
        y_hat = model.labels_

        core_indices = np.zeros_like(y_hat, dtype=bool)
        core_indices[model.core_sample_indices_] = True
        # y_hat中的-1值滤灯，代表數(shù)據(jù)噪聲坪稽。
        y_unique = np.unique(y_hat)
        n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
        print(y_unique, '聚類簇的個數(shù)為：', n_clusters)

        # clrs = []
        # for c in np.linspace(16711680, 255, y_unique.size):
        #     clrs.append('#%06x' % c)
        plt.subplot(2, 3, i+1)
        clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
        print(clrs)
        for k, clr in zip(y_unique, clrs):
            cur = (y_hat == k)
            if k == -1:
                # 繪制噪聲數(shù)據(jù)，顯示為黑點
                plt.scatter(data[cur, 0], data[cur, 1], s=10, c='k')
                continue
            plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='k')
            plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=30, c=clr, marker='o', edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.plot()
        plt.grid(b=True, ls=':', color='#606060')
        plt.title(r'$\epsilon$ = %.1f  m = %d鳞骤，聚類數(shù)目：%d' % (eps, min_samples, n_clusters), fontsize=12)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

執(zhí)行效果：

2018-09-27 16_28_20-Figure 1.png

如果使用數(shù)據(jù)2,（在代碼中窒百，將數(shù)據(jù)2的注釋去掉即可使用），會得到環(huán)形數(shù)據(jù)的聚類結果：

2018-09-27 16_44_53-Figure 1.png

一般來說弟孟，參數(shù)epsilon大的時候贝咙，m值也要跟著變大样悟。因為epsilon變大拂募，意味著半徑變大

7. HDBSCAN聚類

HDBSCAN的介紹：
可以這樣理解，HDBSCAN是密度聚類與層次聚類統(tǒng)一的一個聚類算法窟她。

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.
HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

更詳細的介紹：How HDBSCAN Works

hdbscan: 并不存在于scikit-learn里面陈症，需要自行安裝相關的組件包：

pip install hdbscan

從代碼中，能看出沒有epsilon這個參數(shù)了震糖，這樣聚類結果就非常合理
代碼:

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import hdbscan


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    N = 1000
    centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]]
    data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0)
    data = StandardScaler().fit_transform(data)
    # 數(shù)據(jù)1的參數(shù)：(epsilon, min_sample)
    params = ((0.2, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15))

    # 數(shù)據(jù)2
    # t = np.arange(0, 2*np.pi, 0.1)
    # data1 = np.vstack((np.cos(t), np.sin(t))).T
    # data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    # data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    # data = np.vstack((data1, data2, data3))
    # # # 數(shù)據(jù)2的參數(shù)：(epsilon, min_sample)
    # params = ((0.5, 3), (0.5, 5), (0.5, 10), (1., 3), (1., 10), (1., 20))

    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    plt.figure(figsize=(12, 8), facecolor='w')
    plt.suptitle('HDBSCAN聚類', fontsize=16)

    for i in range(6):
        eps, min_samples = params[i]
        model = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=min_samples)
        model.fit(data)
        y_hat = model.labels_

        core_indices = np.zeros_like(y_hat, dtype=bool)
        core_indices[y_hat != -1] = True

        y_unique = np.unique(y_hat)
        n_clusters = y_unique.size - (1 if -1 in y_hat else 0)
        print(y_unique, '聚類簇的個數(shù)為：', n_clusters)

        # clrs = []
        # for c in np.linspace(16711680, 255, y_unique.size):
        #     clrs.append('#%06x' % c)
        plt.subplot(2, 3, i+1)
        clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size))
        for k, clr in zip(y_unique, clrs):
            cur = (y_hat == k)
            if k == -1:
                plt.scatter(data[cur, 0], data[cur, 1], s=20, c='k')
                continue
            plt.scatter(data[cur, 0], data[cur, 1], s=60*model.probabilities_[cur], marker='o', c=clr, edgecolors='k', alpha=0.9)
            plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=60, c=clr, marker='o', edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':', color='#808080')
        plt.title(r'$\epsilon$ = %.1f  m = %d录肯，聚類數(shù)目：%d' % (eps, min_samples, n_clusters), fontsize=13)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

執(zhí)行圖例。
數(shù)據(jù)1的圖例吊说，從圖例可以看出聚類的數(shù)目都是4论咏，不會出現(xiàn)不合理的只有1~2個聚類：

2018-09-27 17_06_49-Figure 1.png

數(shù)據(jù)2的圖例优炬，從圖例可以看出聚類的數(shù)目是2~3，不會出現(xiàn)不合理的只有1個聚類的情況：

2018-09-27 17_10_55-Start.png

8. 譜聚類

譜聚類同樣可以如密度聚類處理不規(guī)則數(shù)據(jù)分布厅贪，如圓環(huán)形數(shù)據(jù)蠢护。
算法集位于: sklearn.cluster.SpectralClustering
代碼：

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import SpectralClustering
from sklearn.metrics import euclidean_distances


def expand(a, b):
    d = (b - a) * 0.1
    return a-d, b+d


if __name__ == "__main__":
    matplotlib.rcParams['font.sans-serif'] = ['SimHei']
    matplotlib.rcParams['axes.unicode_minus'] = False

    t = np.arange(0, 2*np.pi, 0.1)
    data1 = np.vstack((np.cos(t), np.sin(t))).T
    data2 = np.vstack((2*np.cos(t), 2*np.sin(t))).T
    data3 = np.vstack((3*np.cos(t), 3*np.sin(t))).T
    data = np.vstack((data1, data2, data3))

    n_clusters = 3
    m = euclidean_distances(data, squared=True)

    plt.figure(figsize=(12, 8), facecolor='w')
    plt.suptitle('譜聚類', fontsize=16)
    clrs = plt.cm.Spectral(np.linspace(0, 0.8, n_clusters))
    for i, s in enumerate(np.logspace(-2, 0, 6)):
        print(s)
        af = np.exp(-m ** 2 / (s ** 2)) + 1e-6
        model = SpectralClustering(n_clusters=n_clusters, affinity='precomputed', assign_labels='kmeans', random_state=1)
        y_hat = model.fit_predict(af)
        plt.subplot(2, 3, i+1)
        for k, clr in enumerate(clrs):
            cur = (y_hat == k)
            plt.scatter(data[cur, 0], data[cur, 1], s=40, c=clr, edgecolors='k')
        x1_min, x2_min = np.min(data, axis=0)
        x1_max, x2_max = np.max(data, axis=0)
        x1_min, x1_max = expand(x1_min, x1_max)
        x2_min, x2_max = expand(x2_min, x2_max)
        plt.xlim((x1_min, x1_max))
        plt.ylim((x2_min, x2_max))
        plt.grid(b=True, ls=':', color='#808080')
        plt.title(r'$\sigma$ = %.2f' % s, fontsize=13)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

可以發(fā)現(xiàn)，當σ足夠高养涮，如1.0時葵硕，高斯分布越來越平，即不再衰減贯吓。
或許選擇的最終特征懈凹，就沒有降維打擊能力，則譜聚類悄谐，退化到近似于K-Means聚類介评。
圖例：

2018-09-27 17_19_20-Start.png

課堂問答

問：聚類算法調參是要可視化出來看效果的么？
答：其實是需要的爬舰，可視化效果會相對直觀威沫。

最后編輯于：2018.09.27 17:30:10

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市洼专，隨后出現(xiàn)的幾起案子棒掠，更是在濱河造成了極大的恐慌，老刑警劉巖屁商，帶你破解...
沈念sama閱讀 218,386評論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件烟很，死亡現(xiàn)場離奇詭異，居然都是意外死亡蜡镶，警方通過查閱死者的電腦和手機雾袱，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,142評論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來官还，“玉大人芹橡，你說我怎么就攤上這事⊥祝” “怎么了林说？”我有些...
開封第一講書人閱讀 164,704評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長屯伞。經(jīng)常有香客問我腿箩，道長，這世上最難降的妖魔是什么劣摇？我笑而不...
開封第一講書人閱讀 58,702評論 1贊 294
?港島之戀（遺憾婚禮）
正文為了忘掉前任珠移，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘钧惧。我一直安慰自己暇韧，他們只是感情好，可當我...
茶點故事閱讀 67,716評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布浓瞪。她就那樣靜靜地躺著锨咙，像睡著了一般。火紅的嫁衣襯著肌膚如雪追逮。梳的紋絲不亂的頭發(fā)上酪刀，一...
開封第一講書人閱讀 51,573評論 1贊 305
城市分裂傳說
那天，我揣著相機與錄音钮孵，去河邊找鬼骂倘。笑死，一個胖子當著我的面吹牛巴席，可吹牛的內容都是我干的历涝。我是一名探鬼主播，決...
沈念sama閱讀 40,314評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼漾唉，長吁一口氣：“原來是場噩夢啊……” “哼荧库！你這毒婦竟也來了？” 一聲冷哼從身側響起赵刑，我...
開封第一講書人閱讀 39,230評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤分衫，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后般此，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體蚪战，經(jīng)...
沈念sama閱讀 45,680評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,873評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年铐懊，在試婚紗的時候發(fā)現(xiàn)自己被綠了邀桑。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,991評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡科乎，死狀恐怖壁畸，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情茅茂，我是刑警寧澤捏萍，帶...
沈念sama閱讀 35,706評論 5贊 346
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站玉吁，受9級特大地震影響照弥，放射性物質發(fā)生泄漏。R本人自食惡果不足惜进副，卻給世界環(huán)境...
茶點故事閱讀 41,329評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧影斑，春花似錦给赞、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,910評論 0贊 22
一樁弒父案片迅，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至皆辽，卻和暖如春柑蛇，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背驱闷。一陣腳步聲響...
開封第一講書人閱讀 33,038評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工耻台，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人空另。一個月前我還...
沈念sama閱讀 48,158評論 3贊 370
代替公主和親
正文我出身青樓盆耽，卻偏偏與公主長得像，于是被迫代替她去往敵國和親扼菠。傳聞我的和親對象是個殘疾皇子摄杂，可洞房花燭夜當晚...
茶點故事閱讀 44,941評論 2贊 355

機器學習筆記 - 18. 聚類實踐（講師：鄒博）

本次目標

K-Means (K均值)

Vector Quantization（矢量量化）

傳統(tǒng)模型

傳統(tǒng)圖像分類

課堂問答

AP聚類

MeanShift

密度聚類

譜聚類

代碼解析

1. K-means

silhouette系數(shù)

KMean算法本質

代碼解析

問答

2. Criteria

名詞解釋

同一性 Hommogeneity

完整性 Completeness

V-Measure

ARI

3. 矢量量化

問答

4. AP聚類

5. MeanShift聚類

問答

6. DBSCAN 密度聚類

7. HDBSCAN聚類

8. 譜聚類

課堂問答

推薦閱讀更多精彩內容