利用brich實(shí)現(xiàn)文本層次聚類,將文本內(nèi)容分類
將相似的文本進(jìn)行聚類 然后選出同類中最具有代表的一條數(shù)據(jù)
輸入數(shù)據(jù):
data.png
-
2.運(yùn)行結(jié)果如下,聚類前數(shù)據(jù)有9條 聚類后6條;
字典key為類別,value是表示同一類別的index(text.dat中的行,從0開始) {0: [0, 1, 2], 1: [3, 4], 2: [5], 3: [6], 4: [7], 5: [8]}
0,1,2被聚為一類 輸出了該類的中心點(diǎn)"吳亦凡陳偉霆“互懟“酷狗賽道TOP1學(xué)員壓軸來襲"。
修改Birch(threshold=0.7,n_clusters=None)中的threshold參數(shù)可調(diào)整聚類效果
result.png
參考:
https://blog.csdn.net/Eastmount/article/details/50473675?fps=1&locationNum=4
源碼:
https://github.com/codingMrHu/test_cluster
# coding=utf-8
import sys
import jieba
import numpy as np
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
reload(sys)
sys.setdefaultencoding('utf-8')
'''
sklearn里面的TF-IDF主要用到了兩個函數(shù):CountVectorizer()和TfidfTransformer()砾跃。
CountVectorizer是通過fit_transform函數(shù)將文本中的詞語轉(zhuǎn)換為詞頻矩陣挺据。
矩陣元素weight[i][j] 表示j詞在第i個文本下的詞頻盐茎,即各個詞語出現(xiàn)的次數(shù)涧衙。
通過get_feature_names()可看到所有文本的關(guān)鍵字唤殴,通過toarray()可看到詞頻矩陣的結(jié)果哺窄。
TfidfTransformer也有個fit_transform函數(shù)捐下,它的作用是計(jì)算tf-idf值。
'''
class Cluster():
def init_data(self):
# corpus = [] #文檔預(yù)料 空格連接
corpus = []
# f_write = open("jieba_result.dat","w")
self.title_dict = {}
with open('text.dat','r') as f:
index = 0
for line in f:
title = line.strip()
self.title_dict[index] = title
seglist = jieba.cut(title,cut_all=False) #精確模式
output = ' '.join(['%s'%x for x in list(seglist)]).encode('utf-8') #空格拼接
# print index,output
index +=1
corpus.append(output.strip())
#將文本中的詞語轉(zhuǎn)換為詞頻矩陣 矩陣元素a[i][j] 表示j詞在i類文本下的詞頻
vectorizer = CountVectorizer()
#該類會統(tǒng)計(jì)每個詞語的tf-idf權(quán)值
transformer = TfidfTransformer()
#第一個fit_transform是計(jì)算tf-idf 第二個fit_transform是將文本轉(zhuǎn)為詞頻矩陣
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
#獲取詞袋模型中的所有詞語
word = vectorizer.get_feature_names()
#將tf-idf矩陣抽取出來萌业,元素w[i][j]表示j詞在i類文本中的tf-idf權(quán)重
self.weight = tfidf.toarray()
# print self.weight
def birch_cluster(self):
print ('start cluster Birch -------------------' )
self.cluster = Birch(threshold=0.6,n_clusters=None)
self.cluster.fit_predict(self.weight)
def get_title(self):
# self.cluster.labels_ 為聚類后corpus中文本index 對應(yīng) 類別 {index: 類別} 類別值int值 相同值代表同一類
cluster_dict = {}
# cluster_dict key為Birch聚類后的每個類坷襟,value為 title對應(yīng)的index
for index,value in enumerate(self.cluster.labels_):
if value not in cluster_dict:
cluster_dict[value] = [index]
else:
cluster_dict[value].append(index)
print cluster_dict
print ("-----before cluster Birch count title:",len(self.title_dict))
# result_dict key為Birch聚類后距離中心點(diǎn)最近的title,value為sum_similar求和
result_dict = {}
for indexs in cluster_dict.values():
latest_index = indexs[0]
similar_num = len(indexs)
if len(indexs)>=2:
min_s = np.sqrt(np.sum(np.square(self.weight[indexs[0]]-self.cluster.subcluster_centers_[self.cluster.labels_[indexs[0]]])))
for index in indexs:
s = np.sqrt(np.sum(np.square(self.weight[index]-self.cluster.subcluster_centers_[self.cluster.labels_[index]])))
if s<min_s:
min_s = s
latest_index = index
title = self.title_dict[latest_index]
result_dict[title] = similar_num
print ("-----after cluster Birch count title:",len(result_dict))
for title in result_dict:
print title,result_dict[title]
return result_dict
def run(self):
self.init_data()
self.birch_cluster()
self.get_title()
if __name__=='__main__':
cluster = Cluster()
cluster.run()