樸素貝葉斯算法是NLP領(lǐng)域常用的一種算法模型,這里我們用一個(gè)簡(jiǎn)單的例子來(lái)看看怎么樣用他來(lái)進(jìn)行一個(gè)NLP的分類例子。(偏向?qū)嵱茫绻肓私馑惴ㄔ淼脑捒玻硗馑阉鲗W(xué)習(xí))
跟常見的模型建立一樣,主要有一下幾個(gè)步驟:
- 數(shù)據(jù)的預(yù)處理
- 數(shù)據(jù)集分類標(biāo)記
- 特征提取與建立模型并訓(xùn)練
- 進(jìn)行測(cè)試
這次我用了sklearn來(lái)進(jìn)行這個(gè)簡(jiǎn)單的小例子藐握,有兩個(gè)文本集,hotel和travel垃喊,一個(gè)文本集全是各種賓館猾普,一個(gè)文本集都是旅游信息
具體的代碼如下:
import os
import jieba
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib
import time
"""
1.數(shù)據(jù)的預(yù)處理
"""
def preprocess(path):
text_with_space = ""
textfile = open(path, "r", encoding="utf8").read()
textcute = jieba.cut(textfile)
for word in textcute:
text_with_space += word + " "
return text_with_space
"""
2. 數(shù)據(jù)集分類標(biāo)記
"""
def loadtrainset(path, classtag):
allfiles = os.listdir(path)
processed_textset = []
allclasstags = []
for thisfile in allfiles:
# print(thisfile)
path_name = path + "/" + thisfile
processed_textset.append(preprocess(path_name))
allclasstags.append(classtag)
return processed_textset, allclasstags
processed_textdata1, class1 = loadtrainset("/Users/fengyang/PycharmProjects/NLP/dataset/train/hotel", "賓館")
processed_textdata2, class2 = loadtrainset("/Users/fengyang/PycharmProjects/NLP/dataset/train/travel", "旅游")
train_data = processed_textdata1 + processed_textdata2
classtags_list = class1 + class2
# 對(duì)文本中的詞語(yǔ)轉(zhuǎn)換
count_vector = CountVectorizer()
vecot_matrix = count_vector.fit_transform(train_data)
"""
3. 特征提取與訓(xùn)練
"""
# TFIDF
# 提取特征
train_tfidf = TfidfTransformer(use_idf=False).fit_transform(vecot_matrix)
# 特征訓(xùn)練
clf = MultinomialNB().fit(train_tfidf, classtags_list)
"""
4. 測(cè)試
"""
testset = []
path = "/Users/fengyang/PycharmProjects/NLP/dataset/test/hotel"
allfiles = os.listdir(path)
hotel = 0
travel = 0
for thisfile in allfiles:
path_name = path + "/" + thisfile
new_count_vector = count_vector.transform([preprocess(path_name)])
new_tfidf = TfidfTransformer(use_idf=False).fit_transform(new_count_vector)
predict_result = clf.predict(new_tfidf)
print(predict_result)
print(thisfile)
if (predict_result == "賓館"):
hotel += 1
if (predict_result == "旅游"):
travel += 1
print("賓館" + str(hotel))
print("旅游" + str(travel))
結(jié)果:
['賓館']
三亞市春節(jié)賓館房?jī)r(jià)不亂漲價(jià)違者將受到嚴(yán)處_seg_pos.txt
['賓館']
住宿-賓館名錄_seg_pos.txt
['賓館']
nj7_seg_pos.txt
['賓館']
dali09_seg_pos.txt
['賓館']
bj6_seg_pos.txt
['賓館']
xm7_seg_pos.txt
['賓館']
dujiangyan09_seg_pos.txt
['賓館']
wuyishan12_seg_pos.txt
['賓館']
zhuhai06_seg_pos.txt
['賓館']
kuerle01_seg_pos.txt
['賓館']
xm3_seg_pos.txt
賓館11
旅游0
通過(guò)結(jié)果我們看到,所有的測(cè)試本文本谜,一種11個(gè)初家,全部正確。