KMeans聚類
在聚類算法中征炼,最出名的應(yīng)該就是k均值聚類(KMeans)了暇榴,幾乎所有的數(shù)據(jù)挖掘/機(jī)器學(xué)習(xí)書籍都會介紹它,有些初學(xué)者還會將其與KNN等混淆最住。k均值是一種聚類算法超凳,屬于無監(jiān)督學(xué)習(xí)的一種愈污,而KNN是有監(jiān)督學(xué)習(xí)/分類學(xué)習(xí)的一種。
聚類:顧名思義轮傍,就是講某些相似的事物聚在一起暂雹,形成一個類。這里就涉及到幾個概念
1.如何表示一個事物创夜?通常我們會準(zhǔn)備好一個數(shù)據(jù)集杭跪,里面是我們的數(shù)據(jù),每一行代表的都是一個數(shù)據(jù)驰吓,每一列是一個數(shù)據(jù)的一種特征涧尿。比如經(jīng)典的分類數(shù)據(jù)集 iris(鳶尾花數(shù)據(jù)),每一行代表的是每一朵花檬贰,每一朵花都有花萼長度姑廉,花萼寬度,花瓣長度翁涤,花瓣寬度 4個特征桥言,即有4列特征萌踱。
2.如何度量事物間的距離?我們拿到每一個數(shù)據(jù)的特征值之后号阿,需要根據(jù)實際情況來進(jìn)行兩種數(shù)據(jù)之間的計算并鸵,常用的方法是歐氏距離、馬氏距離扔涧、余弦距離等园担。
3.按照什么樣的過程進(jìn)行聚類?這里就涉及到具體的算法了扰柠,目前聚類大致有幾個比較流行的方法:基于劃分粉铐、基于層次疼约、基于密度卤档、基于網(wǎng)絡(luò)。這里的K均值就屬于基于劃分的方法程剥,后續(xù)會繼續(xù)寫其余幾種方法的代表算法劝枣。
4.如何能判斷聚類過程結(jié)束呢?當(dāng)每一種類別中的數(shù)據(jù)趨于穩(wěn)定织鲸,即為完成聚類
KMeans過程
上圖是截取別人blog中的圖片(參考文獻(xiàn)1)舔腾,這里講的其實很清楚了。
上代碼:
#!/usr/bin/python
# -*- coding:utf-8 -*-
"""
Author LiHao
Time 2018/11/26 9:21
"""
import os
import sys
import numpy as np
import scipy as sp
from sklearn.datasets import load_iris
# 歐式距離函數(shù)
from ml_learn.algorithm.distance import eculide
import matplotlib.pyplot as plt
def load_data():
"""
導(dǎo)入iris標(biāo)準(zhǔn)數(shù)據(jù)集
:return:
"""
iris = load_iris()
data = iris.data
target = iris.target
target_names = iris.target_names
return data,target,target_names
class Group(object):
"""
定義類簇的類 -- 后續(xù)也會使用
"""
def __init__(self):
self._name = ""
self._no = None
self._members = []
self._center = None
@property
def no(self):
return self._no
@property
def name(self):
return self._name
@name.setter
def name(self,no):
self._no = no
self._name = "G"+str(self._no)
@property
def members(self):
return self._members
@members.setter
def members(self,member):
if member is None:
raise TypeError("member is None,please set value")
if isinstance(member,list):
self.members.extend(member)
return
self._members.append(member)
def clear_members(self):
self._members = []
@property
def center(self):
return self._center
@center.setter
def center(self,c):
self._center = c
class KMeans(object):
def __init__(self,k = 2):
if (k <= 1) or (k is None):
raise ValueError("k's num must not none and must > 1.")
self._k = k
# 類簇
self._groups = self._make_groups(k)
self._pre_mean_value = 0
self._current_mean_value = 1
def _make_groups(self,k):
"""
生成類簇
:param k:
:return:
"""
groups = []
for i in range(k):
g = Group()
g.name = i+1
groups.append(g)
return groups
def _random_x_index(self,xlen):
indexes = np.random.randint(0,xlen,self._k).tolist()
return indexes
def _compute_mean_value(self):
sum = 0
for i in range(len(self._groups)):
average = self._compute_members_mean(self._groups[i].members)
self._groups[i].center = average
sum += average
return sum/(len(self._groups))
def _compute_members_mean(self,members):
np_members = np.array(members)
average = np.average(np_members,axis=0)
return average
def _find_most_nearby_group(self,x):
np_groups = np.array([group.center for group in self._groups])
distances = eculide(x,np_groups)
most_similarity_index = np.argmin(distances).squeeze()
self._groups[most_similarity_index].members = x
return most_similarity_index
def _clear_groups_members(self):
for group in self._groups:
group.clear_members()
def fit(self,X):
rows,cols = X.shape
# 1.首先選取k個點為初始聚類中心點
init_indexes = self._random_x_index(rows)
for i,index in enumerate(init_indexes):
self._groups[i].center = X[index]
self._groups[i].members = X[index]
# 2.計算每個數(shù)據(jù)與聚類中心的距離搂擦,加入到最近那一個類
while(True):
for i in range(rows):
#發(fā)現(xiàn)距離最近的group 并將數(shù)據(jù)加入至類簇中
self._find_most_nearby_group(X[i])
# 3.重新計算每個類簇的平均值
# 計算各個類別的聚類中心并返回所有類簇的均值
self._current_mean_value = self._compute_mean_value()
epos = np.sum(self._current_mean_value-self._pre_mean_value,axis=0).squeeze()
if epos <= 0.00001:
break
# 清除歷史成員 并將計算得到的均值誤差保存
self._clear_groups_members()
self._pre_mean_value = self._current_mean_value
# 4.重復(fù)2-3的運算稳诚,直到每個類簇額均值不再發(fā)生變化
def plot_example(self):
figure = plt.figure()
ax = figure.add_subplot(111)
ax.set_title("KMeans Iris Example")
plt.xlabel("first dim")
plt.ylabel("third dim")
legends = []
cxs = []
cys = []
for i in range(len(self._groups)):
group = self._groups[i]
members = group.members
x = [member[0] for member in members]
y = [member[2] for member in members]
ax.scatter(x,y,marker='o')
cx = group.center[0]
cy = group.center[2]
cxs.append(cx)
cys.append(cy)
legends.append(group.name)
plt.scatter(cxs, cys, marker='+', c='k')
plt.legend(legends,loc="best")
plt.show()
data,target,target_names = load_data()
kmeans = KMeans(k=3)
kmeans.fit(data)
kmeans.plot_example()
經(jīng)過運行,可以得到類似下圖的結(jié)果:選取的是iris數(shù)據(jù)集瀑踢,展示的是第一維和第三維的二位平面圖扳还。iris真實數(shù)據(jù)是分為3類,每一類50個數(shù)據(jù)橱夭。
每次運行的結(jié)果可能不一樣氨距,因為我們選取的初始中心點是隨機(jī)的,這樣就會造成結(jié)果的不穩(wěn)定性棘劣。因此俏让,很多K均值的改進(jìn)方法就會從初始點選取進(jìn)行優(yōu)化;還有的是優(yōu)化了均值計算茬暇,變成了中位數(shù)計算(K_median算法)