TaoYan
使用k-means
聚類所需的包:
- factoextra
- cluster
加載包
library(factoextra)
library(cluster)
數(shù)據(jù)準(zhǔn)備
使用內(nèi)置的R數(shù)據(jù)集USArrests
#load the dataset
data("USArrests")
#remove any missing value (i.e, NA values for not available)
#That might be present in the data
USArrests <- na.omit(USArrests)#view the first 6 rows of the data
head(USArrests, n=6)
在此數(shù)據(jù)集中,列是變量,行是觀測值
在聚類之前我們可以先進(jìn)行一些必要的數(shù)據(jù)檢查即數(shù)據(jù)描述性統(tǒng)計打却,如平均值安券、標(biāo)準(zhǔn)差等
desc_stats <- data.frame( Min=apply(USArrests, 2, min),#minimum
Med=apply(USArrests, 2, median),#median
Mean=apply(USArrests, 2, mean),#mean
SD=apply(USArrests, 2, sd),#Standard deviation
Max=apply(USArrests, 2, max)#maximum
)
desc_stats <- round(desc_stats, 1)#保留小數(shù)點后一位head(desc_stats)
變量有很大的方差及均值時需進(jìn)行標(biāo)準(zhǔn)化
df <- scale(USArrests)
數(shù)據(jù)集群性評估
使用get_clust_tendency()
計算Hopkins
統(tǒng)計量
res <- get_clust_tendency(df, 40, graph = TRUE)
res$hopkins_stat
## [1] 0.3440875
#Visualize the dissimilarity matrix
res$plot
Hopkins統(tǒng)計量的值<0.5接癌,表明數(shù)據(jù)是高度可聚合的。另外珊泳,從圖中也可以看出數(shù)據(jù)可聚合。
估計聚合簇數(shù)
由于k均值聚類需要指定要生成的聚類數(shù)量拷沸,因此我們將使用函數(shù)clusGap()
來計算用于估計最優(yōu)聚類數(shù)色查。函數(shù)fviz_gap_stat()
用于可視化。
set.seed(123)
## Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, K.max = 10, B = 500)
# Plot the result
fviz_gap_stat(gap_stat)
圖中顯示最佳為聚成四類(k=4)
進(jìn)行聚類
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)
檢查cluster silhouette
圖
Recall that the silhouette measures (SiSi) how similar an object ii is to the the other objects in its own cluster versus those in the neighbor cluster. SiSi values range from 1 to - 1:
- A value of SiSi close to 1 indicates that the object is well clustered. In the other words, the object ii is similar to the other objects in its group.
- A value of SiSi close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
#Visualize
fviz_silhouette(sil)
圖中可以看出有負(fù)值撞芍,可以通過函數(shù)
silhouette()
確定是哪個觀測值
neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## Missouri 3 2 -0.07318144
eclust():增強(qiáng)的聚類分析
與其他聚類分析包相比秧了,
eclust()
有以下優(yōu)點:
- 簡化了聚類分析的工作流程
- 可以用于計算層次聚類和分區(qū)聚類
- eclust()自動計算最佳聚類簇數(shù)。
- 自動提供Silhouette plot
- 可以結(jié)合ggplot2繪制優(yōu)美的圖形
使用eclust()的K均值聚類
# Compute k-means
res.km <- eclust(df, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Silhouette plotfviz_silhouette(res.km)
## cluster size ave.sil.width
## 1 1 13 0.31
## 2 2 29 0.38
## 3 3 8 0.39
使用eclust()的層次聚類
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam
#下面的R代碼生成Silhouette plot和分層聚類散點圖序无。
fviz_silhouette(res.hc) # silhouette plot
## cluster size ave.sil.width
## 1 1 19 0.26
## 2 2 19 0.28
## 3 3 12 0.43
fviz_cluster(res.hc) # scatter plot
Infos
This analysis has been performed using R software (R version 3.3.2)