理論內(nèi)容
貝葉斯定理
貝葉斯定理是描述條件概率關(guān)系的定律
$$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}$$
樸素貝葉斯分類器
樸素貝葉斯分類器是一種基于概率的分類器须眷,我們做以下定義:
- B:具有特征向量B
- A:屬于類別A
有了這個(gè)定義,我們解釋貝葉斯公式
- P(A|B):具有特征向量B樣本屬于A類別的概率(計(jì)算目標(biāo))
- P(B|A):在A類別中B向量出現(xiàn)的概率(訓(xùn)練樣本中的數(shù)據(jù))
- P(A):A類出現(xiàn)的概率(訓(xùn)練樣本中的頻率)
- P(B):B特征向量出現(xiàn)的概率(訓(xùn)練樣本中的頻率)
對(duì)于樸素貝葉斯分類器乎婿,進(jìn)一步假設(shè)特征向量之間無關(guān)鞍恢,那么樸素貝葉斯分類器公式可以如下表示$$P(A|B) = \cfrac{P(A)\prod P(B_{i} |A)}{P(B)}$$
以上公式右側(cè)的值都可以在訓(xùn)練樣本中算得絮姆。進(jìn)行預(yù)測(cè)時(shí)伊履,分別計(jì)算每個(gè)類別的概率蝙茶,取概率最高的一個(gè)類別平痰。
特征向量為連續(xù)值的樸素貝葉斯分類器
對(duì)于連續(xù)值,有以下兩種處理方式
- 將連續(xù)值按區(qū)間離散化
- 假設(shè)特征向量服從正態(tài)分布或其他分布(很強(qiáng)的先驗(yàn)假設(shè))伍纫,由樣本中估計(jì)出參數(shù)宗雇,計(jì)算貝葉斯公式時(shí)帶入概率密度
代碼實(shí)現(xiàn)
導(dǎo)入數(shù)據(jù)——文本新聞數(shù)據(jù)
# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
from sklearn import datasets
train = datasets.load_files("./20newsbydate/20news-bydate-train")
test = datasets.load_files("./20newsbydate/20news-bydate-test")
print(train.DESCR)
print(len(train.data))
print(train.data[0])
None
11314
b"From: cubbie@garnet.berkeley.edu ( )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\ngajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n This season so far, Morgan and Guzman helped to lead the Cubs\n at top in ERA, even better than THE rotation at Atlanta.\n Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n in the season, we Cubs fans have learned how to enjoy the\n short triumph while it is still there.\n"
處理數(shù)據(jù)——特征抽取(文字向量化)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
(11314, 129782)
模型訓(xùn)練
from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
模型評(píng)估
使用自帶評(píng)估器
bays.score(test_vec,test.target)
0.80244291024960168
調(diào)用評(píng)估器
from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
precision recall f1-score support
alt.atheism 0.80 0.81 0.80 319
comp.graphics 0.65 0.80 0.72 389
comp.os.ms-windows.misc 0.80 0.04 0.08 394
comp.sys.ibm.pc.hardware 0.55 0.80 0.65 392
comp.sys.mac.hardware 0.85 0.79 0.82 385
comp.windows.x 0.69 0.84 0.76 395
misc.forsale 0.89 0.74 0.81 390
rec.autos 0.89 0.92 0.91 396
rec.motorcycles 0.95 0.94 0.95 398
rec.sport.baseball 0.95 0.92 0.93 397
rec.sport.hockey 0.92 0.97 0.94 399
sci.crypt 0.80 0.96 0.87 396
sci.electronics 0.79 0.70 0.74 393
sci.med 0.88 0.87 0.87 396
sci.space 0.84 0.92 0.88 394
soc.religion.christian 0.81 0.95 0.87 398
talk.politics.guns 0.72 0.93 0.81 364
talk.politics.mideast 0.93 0.94 0.94 376
talk.politics.misc 0.68 0.62 0.65 310
talk.religion.misc 0.88 0.44 0.59 251
avg / total 0.81 0.80 0.78 7532