01. 基于規(guī)則的分類(lèi)器rule-based classifier
0.1 相關(guān)概念
通過(guò)一系列規(guī)則“如果。潮尝。榕吼。就。勉失。羹蚣。”乱凿,來(lái)進(jìn)行分類(lèi)
規(guī)則:(condition)--> y
condition:屬性的合取
規(guī)則覆蓋率:符合規(guī)則的記錄在所有記錄中所占據(jù)的比例
規(guī)則準(zhǔn)確率:規(guī)則正確預(yù)測(cè)在規(guī)則覆蓋的記錄中所占的比例
0.2特點(diǎn)
互斥規(guī)則:規(guī)則之間是相互排斥的顽素;每條記錄最多被一條規(guī)則所覆蓋;窮舉規(guī)則
0.3規(guī)則簡(jiǎn)化及其解決方法
簡(jiǎn)化后徒蟆,規(guī)則不在相互排斥:一條記錄可能滿足多條規(guī)則胁出;
簡(jiǎn)化后,規(guī)則不再詳細(xì)(全面):一條記錄可能不會(huì)觸發(fā)任意一條規(guī)則
解決方法:
使用默認(rèn)分類(lèi)
決策樹(shù)可以轉(zhuǎn)換成基于規(guī)則的分類(lèi)器
0.4 有序規(guī)則
規(guī)則按照優(yōu)先級(jí)一次排列---->決策列表
測(cè)試時(shí)段审,數(shù)據(jù)記錄的預(yù)測(cè)分類(lèi):滿足的規(guī)則最高優(yōu)先級(jí)規(guī)則作出的預(yù)測(cè)全蝶;如果數(shù)據(jù)記錄沒(méi)有滿足的規(guī)則,默認(rèn)類(lèi)別分類(lèi)
規(guī)則排序模式
(1)基于規(guī)則排序:規(guī)則的分類(lèi)能力
(2)基于類(lèi)的排序:相同類(lèi)別的規(guī)則排在一塊
0.5構(gòu)建分類(lèi)器規(guī)則
1.直接方法:直接從數(shù)據(jù)中抽取規(guī)則
????eg:RIPPER,CN2,Holte's 1R
順序覆蓋
步驟:
(1)start from an empty rule
(2)grow a rule using the Learn-One-Rule function
(3)remove training records covered by the rule
(4)repeat step 2 and 3 until stopping criterion is met
涉及的背景知識(shí)
(1)rule growing 生成規(guī)則
兩種策略:general-to-specific 和 specific-to-general
(2)instance elimination 刪除記錄
why do we need to eliminate instances?
--otherwise,the next rule is identical to previous rule
why do we remove positive instances?
--ensure that the next rule is different
why do we remove negative instances?
--prevent underestimationg accuracy of rule
(3)rule evaluation 規(guī)則評(píng)估
accuracy=nc/n
laplace=(nc+1)/(n+k)
M-estimate=(nc+kp)/(n+k)
n:數(shù)據(jù)集記錄數(shù)目
nc:規(guī)則覆蓋數(shù)據(jù)記錄數(shù)目
k:類(lèi)別數(shù)目
p:先驗(yàn)概率
(4)stopping criterion 停止條件
計(jì)算信息增益
如果信息增益不理想裸诽,丟棄該規(guī)則
(5)rule pruning 規(guī)則修剪
和決策樹(shù)的post-pruning后修剪類(lèi)似
reduced error pruning:
確定一條規(guī)則
在修剪之前和修剪后,分別計(jì)算比較在驗(yàn)證集上的錯(cuò)誤率
如果錯(cuò)誤率變高型凳,丟棄該規(guī)則
總結(jié)
grow a single rule
remove intances from rule
prune the rule(if necessary)
add rule to current rule set
repeat
2.間接方法
從其他分類(lèi)器(如:決策樹(shù)丈冬,神經(jīng)網(wǎng)絡(luò)等)中抽取規(guī)則
eg:C4.5規(guī)則
由決策樹(shù)生成:C4.5rules
步驟
extract rules from an unpruned dicision tree
for each rule , r: A--> y,
(1)consider an alternative rule r': A'-->y where A' is obtained by removing one of the conjuncts in A
(2)compare the pessimistic error rate for r against all r' s
(3)prune if one of the r' s has lower pessimistic error rate
(4)repeat until we can no longer improve generalization error
優(yōu)勢(shì)
和決策樹(shù)一樣具有很高的表述能力
容易理解、解釋
容易生成
對(duì)新的樣例劃分速度快
劃分效果和決策樹(shù)相當(dāng)
02Instance-based Classifiers 基于距離的分類(lèi)器
0.1 兩種方法
rote-learner:記錄所有的訓(xùn)練數(shù)據(jù)甘畅,當(dāng)新樣本的屬性正好和訓(xùn)練樣本的屬性相匹配時(shí)埂蕊,進(jìn)行分類(lèi)
nearest neighbor:k近鄰(k個(gè)最近的點(diǎn)neighbors),來(lái)進(jìn)行分類(lèi)
0.2準(zhǔn)備工作:
存儲(chǔ)的數(shù)據(jù)集
用來(lái)計(jì)算記錄之間的距離矩陣
確定k值的大小
分類(lèi) to classify an unknown record:
compute distance to other training records
identify k nearest neighbors
use class labels of nearest neighbors to determine the class label of unknown record(eg:by taking majority vote)
0.3其他問(wèn)題
(1)距離的度量
歐幾里得距離:計(jì)算兩個(gè)向量差的模
高緯度數(shù)據(jù)---維度爆炸 curse of dimensionality
處理方法:標(biāo)準(zhǔn)化
(2)k的選擇
太小疏唾,對(duì)噪聲點(diǎn)太過(guò)敏感
太大蓄氧,neighborhood可能包括其他類(lèi)比的數(shù)據(jù)點(diǎn)
(3)縮放 scale
prevent distance measures from being dominated by one of the attributes(避免距離的計(jì)算被某個(gè)屬性所主導(dǎo))
03貝葉斯分類(lèi)器
0.1相關(guān)概念
基于概率的分類(lèi)器
consider each attribute and class label as random variables 各變量之間相互獨(dú)立
given a record with attributes (A1,A2,...An)
goal: predict class C
find the value of C that maximizes P(C| A1,A2,...,An)
estimate P(C|A1,A2,...,An) from data
compute the posterior probability?P(C|A1,A2,...,An)--后驗(yàn)概率 for all values of C using the Bayes theorem
choose value of C that maximizes?P(C|A1,A2,...,An)
equivalent to choosing value of C that maximizes?P(C|A1,A2,...,An)
estimate P(A1,A2,...,An|C)
假設(shè)各變量之間相互獨(dú)立
從現(xiàn)有數(shù)據(jù)中統(tǒng)計(jì)計(jì)算出P(Ai|C),然后在計(jì)算P(A1,A2,...,An|C)
0.2連續(xù)數(shù)據(jù)怎么使用貝葉斯槐脏?
(1)discretize the range into bins 區(qū)間劃分
one ordinal attribute per bin 每個(gè)區(qū)間變成一個(gè)序數(shù)型數(shù)據(jù)
violates independence assumption 強(qiáng)制假設(shè)獨(dú)立
(2)two-ways split (A>v)or(A<v)
A. choose only one of the two splits as new attribute 二分后喉童,選擇其中的一個(gè)作為新的屬性
B. probability density estimation:概率密度估計(jì)
C. assume attribute follows a normal distribution 假設(shè)該連續(xù)屬性服從正態(tài)分布
D. use data to estimate parameters of distribution(eg: mean and standard deviation);利用數(shù)據(jù)去估計(jì)計(jì)算正態(tài)分布的參數(shù)值:均值and方差(或標(biāo)準(zhǔn)差)
E. once probability distribution is known,can we compute the probability 顿天;知道分布函數(shù)堂氯,可以計(jì)算概率
(3)應(yīng)用過(guò)程中遇到的問(wèn)題
A. 零概率
one of the conditional probability is zero ,then the entire expression becomes zero 如果單變量的條件概率中出現(xiàn)了零概率,那么相乘以后牌废,整個(gè)條件概率值也是0
處理方法:
origin:P(Ai|C)=Nic/Nc(在類(lèi)別為c的數(shù)據(jù)集中咽白,Ai屬性為某一值時(shí),數(shù)據(jù)所占的比例)
Laplace:P(Ai|C)=(Nic+1)/(Nc+c) ?【c:類(lèi)別數(shù)目鸟缕;】
m-estimate:P(Ai|C)=(Nic+mp)/(Nc+m)【p:先驗(yàn)概率晶框;m:一個(gè)參數(shù)】
(4)summary
robust to isolated noise points 對(duì)噪聲點(diǎn)具有健壯性
handle missing values by ignoring the instance during probability estimate calculations;針對(duì)缺失值的數(shù)據(jù)懂从,丟棄授段;
robust to irrelevant attributes 對(duì)不相關(guān)的數(shù)據(jù)有健壯性
independence assumption may not hold for some attributes 對(duì)于一些屬性不滿足獨(dú)立性假設(shè):使用其他技術(shù),像貝葉斯信念網(wǎng)絡(luò)
use other techniques such as Bayesian Belief Networks(BBN)
04SVM支持向量機(jī)
find a linear hyperplane that will separate the data 超平面
數(shù)據(jù)線性可分
對(duì)于不可分的數(shù)據(jù)--->變得線性可分番甩;核函數(shù)
評(píng)價(jià)指標(biāo):distance
maximize
等價(jià)于 minimize
05 ensemble methods 集成的方法
construct a set of classifiers from the training data 通過(guò)訓(xùn)練數(shù)據(jù)構(gòu)建一系列的分類(lèi)器
predict class label of previously unseen records by aggregating predictions made by multiple classifiers畴蒲;預(yù)測(cè)新的數(shù)據(jù)記錄時(shí),通過(guò)多個(gè)分類(lèi)的的所有分類(lèi)結(jié)果進(jìn)行分類(lèi)
general idea
相當(dāng)于:投票選取大會(huì)
0.1集成分類(lèi)器的生成方法
(1)bagging
(2)boosting????
迭代算法:adaptively change distribution of training data by focusing more on previously misclassified records自適應(yīng)的變動(dòng)訓(xùn)練數(shù)據(jù)的分布对室;關(guān)注于先前分類(lèi)器誤分的數(shù)據(jù)記錄
初始時(shí)模燥,所有的N條記錄分配相同的權(quán)重系數(shù)
和bagging算法不同,權(quán)重系數(shù)在每輪boosting結(jié)束以后掩宜,變動(dòng)
誤分的記錄的權(quán)重系數(shù)--增大
正確的記錄的權(quán)重系數(shù)--減小
adaboost