「樸素貝葉斯」
具體理論就不詳細(xì)講解了仓洼,網(wǎng)上一搜一大把宦言。其核心思想就是: 樸素貝葉斯 = 條件獨(dú)立假設(shè) + 貝葉斯方法。運(yùn)行速度快换可,在滿(mǎn)足分布獨(dú)立這一假設(shè)條件下分類(lèi)效果好椎椰,但對(duì)于訓(xùn)練集中沒(méi)有出現(xiàn)過(guò)的詞語(yǔ)要平滑處理,數(shù)值型變量特征默認(rèn)符合正態(tài)分布沾鳄。
「Python實(shí)現(xiàn)」
導(dǎo)入停用詞
導(dǎo)入停用詞庫(kù)慨飘,同時(shí)使用strip()
方法剔除不需要的空白符,包括('\n', '\r', '\t', ' ')
def make_word_set(words_file):
words_set = set()
with codecs.open(words_file,'r','utf-8') as fp:
for line in fp:
word = line.strip()
if len(word) > 0 and word not in words_set:
words_set.add(word)
return words_set
文本處理,樣本生成
每個(gè)新聞文本txt文件在各自所屬類(lèi)別的文件夾中瓤的,結(jié)構(gòu)如下:
"folder_path"
⌒萜|
|-- C000008 -- 1.txt / 2.txt / ... / 19.txt
∪Ω唷|-- C000010
∷|-- C000013
|-- ...
』ぁ|-- C000024
這里使用os.listdir()
讀取指定目錄下的所有文件夾名(即分類(lèi)類(lèi)別)丈甸,遍歷各自文件夾(類(lèi)別)內(nèi)的文本文件,對(duì)每一個(gè)txt文件進(jìn)行文本切詞尿褪,同時(shí)利用zip()
函數(shù)使每個(gè)新聞文本與所屬類(lèi)別一一對(duì)應(yīng)睦擂,一共有90條數(shù)據(jù)。
為了隨機(jī)抽取訓(xùn)練與測(cè)試數(shù)據(jù)集杖玲,用random.shuffle()
打亂順序顿仇,并選取20%的數(shù)據(jù)用于測(cè)試,同時(shí)把特征數(shù)據(jù)與類(lèi)別數(shù)據(jù)各自分開(kāi)摆马。
最后對(duì)訓(xùn)練數(shù)據(jù)集中的詞語(yǔ)進(jìn)行詞頻統(tǒng)計(jì)臼闻。這里有使用sorted()
函數(shù)進(jìn)行排序,方法為sorted(iterable, cmp = None, key = None, reverse = False)
今膊,其中參數(shù)含義如下:
- iterable:是可迭代類(lèi)型(我這里的可迭代類(lèi)型為字典)
- cmp:用于比較的函數(shù)些阅,比較什么由key決定(這里沒(méi)用到)
- key:用列表元素的某個(gè)屬性或函數(shù)進(jìn)行作為關(guān)鍵字(這里使用字典中的“值”大小作為關(guān)鍵字排序)
- reverse:排序規(guī)則伞剑,True為降序(False為升序)
def text_processing(folder_path, test_size = 0.2):
folder_list = os.listdir(folder_path)
data_list = []
class_list = []
# 遍歷文件夾
for folder in folder_list:
new_folder_path = os.path.join(folder_path,folder)
files = os.listdir(new_folder_path)
j = 1
for file in files:
if j > 100: # 防止內(nèi)存爆掉
break
with codecs.open(os.path.join(new_folder_path, file), 'r', 'utf-8') as fp:
raw = fp.read()
word_cut = jieba.cut(raw, cut_all = False)
word_list = list(word_cut)
data_list.append(word_list) # 訓(xùn)練集
class_list.append(folder) # 類(lèi)別
j += 1
# 劃分訓(xùn)練集和測(cè)試集
data_class_list = list(zip(data_list, class_list))
random.shuffle(data_class_list) # 打亂順序
index = int(len(data_class_list) * test_size) + 1 # 抽取測(cè)試數(shù)據(jù)集的占比
train_list = data_class_list[index:]
test_list = data_class_list[:index]
train_data_list,train_class_list = zip(*train_list) # 特征與標(biāo)簽
test_data_list,test_class_list = zip(*test_list)
# 統(tǒng)計(jì)詞頻
all_words_dict = {}
for word_list in train_data_list:
for word in word_list:
if word in all_words_dict:
all_words_dict[word] += 1
else:
all_words_dict[word] = 1
# 降序排序(key函數(shù))
all_words_tuple_list = sorted(all_words_dict.items(),key = lambda f:f[1], reverse = True)
all_words_list = list(zip(*all_words_tuple_list))[0]
return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list
特征選擇
這里我們僅選取詞頻坐高的1000個(gè)特征詞(維度)斑唬,并剔除數(shù)字與停用詞。
def words_dict(all_words_list,deleteN,stopwords_set=set()):
feature_words = []
n = 1
for t in range(deleteN,len(all_words_list),1):
if n > 1000: # 最多取1000個(gè)維度
break
if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
feature_words.append(all_words_list[t])
n += 1
return feature_words
用選取的特征詞構(gòu)建0-1矩陣
對(duì)訓(xùn)練數(shù)據(jù)集train_data_list中每篇切完詞之后的文檔構(gòu)建特征向量(由上述1000個(gè)特征詞組成)黎泣,若出現(xiàn)則取值為1恕刘,否則為0。于是90篇文章構(gòu)建出了[90,1000]維度的0-1矩陣(其中71行為訓(xùn)練數(shù)據(jù)抒倚,19行為測(cè)試數(shù)據(jù))褐着。
訓(xùn)練集如下:
0-1矩陣如下:
def text_features(train_data_list, test_data_list, feature_words):
def text_features(text,feature_words):
# text = train_data_list[0]
text_words = set(text)
features = [1 if word in text_words else 0 for word in feature_words]
return features
# 0,1的矩陣(1000列-維度)
train_feature_list = [text_features(text, feature_words) for text in train_data_list]
test_feature_list = [text_features(text, feature_words) for text in test_data_list]
return train_feature_list,test_feature_list
樸素貝葉斯分類(lèi)器
這里使用開(kāi)源sklearn庫(kù)中的樸素貝葉斯分類(lèi)器,輸入?yún)?shù)分別為訓(xùn)練集的0-1特征矩陣(train_feature_list)與訓(xùn)練集分類(lèi)(train_class_list)托呕,然后對(duì)測(cè)試數(shù)據(jù)的輸出與真實(shí)結(jié)果進(jìn)行比較含蓉,得到準(zhǔn)確度為0.68
def text_classifier(train_feature_list,test_feature_list,train_class_list,test_class_list):
# sklearn多項(xiàng)式分類(lèi)器
classifier = MultinomialNB().fit(train_feature_list,train_class_list) # 特征向量與類(lèi)別
test_accuracy = classifier.score(test_feature_list,test_class_list)
return test_accuracy
這里我們僅選取詞頻最高的1000個(gè)作為特征向量,不妨嘗試下選取其他的關(guān)鍵字作為特征向量项郊,發(fā)現(xiàn)準(zhǔn)確率都在0.63以上馅扣,分類(lèi)效果還算可以,見(jiàn)下圖:
if __name__ == '__main__':
# 文本預(yù)處理(分詞着降、劃分訓(xùn)練與測(cè)試集差油、排序)
folder_path = '...'
all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = text_processing(folder_path, test_size = 0.2)
# stopwords_set
stopwords_file = '...'
stopwords_set = make_word_set(stopwords_file)
# 特征提取與分類(lèi)
deleteNs = range(0,1000,20)
test_accuracy_list = []
for deleteN in deleteNs:
# 選取1000個(gè)特征
feature_words = words_dict(all_words_list,deleteN,stopwords_set)
# 計(jì)算特征向量
train_feature_list, test_feature_list = text_features(train_data_list,test_data_list,feature_words)
# sklearn分類(lèi)器計(jì)算準(zhǔn)確度
test_accuracy = text_classifier(train_feature_list,test_feature_list,train_class_list,test_class_list)
# 不同特征向量下的準(zhǔn)確度
test_accuracy_list.append(test_accuracy)
print(test_accuracy_list)
# 結(jié)果評(píng)價(jià)
plt.figure()
plt.plot(deleteNs,test_accuracy_list)
plt.title('Relationship of deleteNs and test_accuracy')
plt.xlabel('deleteNs')
plt.ylabel('test_accuracy')
plt.savefig('result.png',dpi = 100)
plt.show()