機器學習經(jīng)典算法 - 聚類算法

K-Means Clustering

作為非監(jiān)督學習的典型代表浙踢，K-Means 可以通過對于數(shù)據(jù)點的聚類來完成分類，其計算過程如下：

按照設(shè)定的 K 值隨機初始化 K 個聚類的中心點
將所有的數(shù)據(jù)點根據(jù)離這 K 個中心點的距離進行分類
計算每一個分類中的數(shù)據(jù)點的均值纷捞，并將這 K 個均值點作為新的中心點，重新進行聚類
上述聚類過程循環(huán)多次，算法最終會按照設(shè)定的收斂條件收斂到指定的 K 個中心點以完成聚類

在計算機相關(guān)應(yīng)用中悉尾，可以利用 K-Means 來完成圖像的分割，并在此基礎(chǔ)上對分割后的圖像進行操作：

criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS, 100, 0.2)

# then perform K-Means clustering
# the number 10 after the criteria is asking the algorithm to try 10 different initial centers
k = 3
retval, labels, centers = cv2.kmeans(pixel_vals, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)

# convert data into 8-bit values
centers = np.uint8(centers)
segmented_data = centers[labels.flatten()]

# reshape data into the original image dimensions
segmented_image = segmented_data.reshape((image.shape))

# display the image with colors according to each pixel's class centers intensity
plt.imshow(segmented_data)

層次聚類 Hierarchical Clustering

K-Means 算法在預(yù)先知道分類的數(shù)量 K 時挫酿，使用起來一般更加有效构眯，但由于 K-Means 算法中類的中心的定義方式使得其更加傾向于基于圓形、球形或超球面進行分類早龟，例如針對下面這個明顯的月牙形數(shù)據(jù)分布的分類來說惫霸，K-Means 就無法做出合適的分類。

K-means would fail in tasks like this

此時葱弟，如果我們對于評價是否歸屬一個分類的距離有一個大致的概念時壹店，可以選擇層次聚類的方式：層次聚類如其名稱所述，通過衡量兩個特征點之間的距離翘悉，分層級的不斷擴大歸屬于同一個聚類的點的數(shù)量茫打，最終當算法計算出來的分類取得設(shè)定值時停止聚類。根據(jù)層次聚類中距離的定義方式妖混，其又可以進一步分解為多種不同的算法老赤。

在 Scikit-Learn 的聚類算法中層次聚類的實施算法采用的是 Agglomerative clustering，其采用從底向上的方式進行聚類制市，算法開始時會假定每一個輸入數(shù)據(jù)都屬于一個單獨的分類抬旺，在此基礎(chǔ)上根據(jù)選定的判斷標準逐漸將相同的類別進行融合。

The Agglomerative Clustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.

Average linkage minimizes the average of the distances between all observations of pairs of clusters.

在 Scikit-Learn 中使用層次聚類的代碼如下：

from sklearn import cluster
# Specify the parameter for linkage method, 'ward' is default, 
# and can also use 'complete' and 'average'
cluster = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
# labels will be an array with the same length of the data which represents which cluster each point belongs to.
labels = cluster.fit_predict(X)

層次聚類的較為突出的缺點：

對于異常值敏感祥楣，因此應(yīng)該盡量確保數(shù)據(jù)的質(zhì)量
計算復(fù)雜度較高开财，O(n²)

DBSCAN

DBSCAN 算法的全稱是 Density Based Spatial Clustering Applications with Noise汉柒，這個聚類算法最終不會將所有的數(shù)據(jù)劃歸到某一個分類當中，僅會對足夠緊湊的數(shù)據(jù)集進行聚類责鳍，而將其他零散數(shù)據(jù)視作異常值碾褂，因此對于數(shù)據(jù)噪聲不再敏感。

在 Scikit-Learn 中使用 DBSCAN 聚類的代碼如下：

from sklearn import cluster
# Specify the parameter for distance and minimum number of points in a cluster
dbscan = cluster.DBSCAN(eps=5, min_samples=5)
# labels will be an array with the same length of the data which represents which cluster each point belongs to.
dbscan.fit(X)
# The result can be found via `dbscan.labels_` with -1 being noise

在 DBSCAN 中無需事先指定分類的數(shù)量历葛，并且對于異常數(shù)據(jù)不敏感正塌，其主要缺點為：

在分類過程中假設(shè)位于邊界的點與兩個不同分類的距離相同，那么算法會采取先到先得的方式恤溶，因而可能無法準確分類邊界點

Gaussian Mixture Model Clustering

高斯混合模型聚類假定同一類中的數(shù)據(jù)點服從同一個概率（高斯）分布乓诽，在兩個類的重合部分，會分別計算樣本點屬于其中某一個類的概率咒程，進而將其劃歸到概率更高的類中鸠天。

算法實現(xiàn)的步驟如下：

指定潛在的分類的數(shù)量 K，進而建立 K 個高斯分布帐姻，也即指定每一個分布的均值和方差稠集，在實際實現(xiàn)過程中可以先對數(shù)據(jù)集執(zhí)行 K-Means 聚類，在聚類的基礎(chǔ)上再計算相應(yīng)的均值和方差
按照當前的分布的對數(shù)據(jù)進行分類
以最大化期望為目標饥瓷，在前述分類的基礎(chǔ)上對于之前建立的多個分布進行評估
評估算法的收斂性巍杈，如果未能收斂則從第二步開始執(zhí)行

通過 Scikit-Learn 執(zhí)行高斯混合模型聚類的步驟如下：

from sklearn import mixture

# Specify the parameters for the clustering
gmm = mixture.GaussianMixture(n_components=3)

# Fit the model
gmm.fit(X)
# clustering contains an array representing class labels each data belongs to
clustering = gmm.predict(X)

聚類的結(jié)果評估

在聚類完成后，我們還需要通過一些客觀的指標對于聚類的結(jié)果進行評價扛伍，其主要的評價指標的來源如下：

數(shù)據(jù)的外部標注
數(shù)據(jù)的內(nèi)部評價指標：同一類的聚合程度 Compactness 和類間的分隔程度 Seperability

具體的計算過程可以參考 Scikit-Learn 的相應(yīng)文檔部分。

參考閱讀

Clustering in Scikit-Learn