1悦析、Naive Bayes classification
樸素貝葉斯分類(lèi)器在文本分類(lèi)中使用很廣泛眶拉,因?yàn)樗?jiǎn)單筐咧、高效,在大量的樣本集上具有較好的分類(lèi)性能脆贵,但NB反應(yīng)的只是一個(gè)統(tǒng)計(jì)意義上的信息医清,當(dāng)每個(gè)類(lèi)別信息不足時(shí)效果并不能保證,這篇文章主要是剖析NB在Spark MLlib中實(shí)現(xiàn)卖氨,以便在分類(lèi)效果不好時(shí)進(jìn)行問(wèn)題分析和定位会烙。給出NB分類(lèi)的過(guò)程如下:
- 設(shè) x = {a1, a2, ..., am}為待分類(lèi)樣本,其中ai為樣本中的特征筒捺,針對(duì)于NLP領(lǐng)域柏腻,處理的數(shù)據(jù)均為文本,因此這里是經(jīng)過(guò)向量化之后的數(shù)據(jù)系吭,如何將text轉(zhuǎn)換為模型可接受的數(shù)值向量會(huì)在另一篇文章中進(jìn)行介紹五嫂。
- 2)類(lèi)別集合C={c1, c2,...,cn},計(jì)算各個(gè)類(lèi)別的先驗(yàn)概率并取對(duì)數(shù)(),如下
p(ci) = log(p(ci))= log((i類(lèi)別的出現(xiàn)的次數(shù) + 平滑因子) / (所有類(lèi)別出現(xiàn)的總次數(shù) + 平滑因子)) - 3)計(jì)算各類(lèi)別下各個(gè)特征的條件概率村斟,并取對(duì)數(shù)
theta(i)(j) = log(sumTermFreq(j) + 平滑因子) - thetaLogDenom
theta(i)(j)表示類(lèi)別i下的第j個(gè)特征贫导,sumTermFreq(j)表示該類(lèi)別下特征j出現(xiàn)的次數(shù),其實(shí)這里是特征j所在的這個(gè)位置的value蟆盹,而這個(gè)值和向量化的方式有關(guān)孩灯,thetaLogDenom分為兩種形式,
多項(xiàng)式模式
thetaLogDenom = log(sumTermFreq.values.sum + numFeatureslambda)
二項(xiàng)式模型
thetaLogDenom = log(n + 2.0*lambda)
其中逾滥,sumTermFreq.values.sum在文本分類(lèi)中解釋為峰档,類(lèi)別i下的所有單詞的總數(shù),numFeatures表示特征數(shù)量寨昙,lambda為平滑因子讥巡,n為總的文檔/樣本數(shù)量。
2舔哪、模型訓(xùn)練
NB的主要方法run方法欢顷,該方法位于spark\mllib\classification\NaiveBayes.scala中,代碼如下:
代碼的主題思路是捉蚤,先對(duì)樣本根據(jù)label進(jìn)行聚合抬驴,結(jié)果為(label, (標(biāo)簽下樣本數(shù),features之和)),然后在根據(jù)label統(tǒng)計(jì)(label, (n, sumTermFreqs))計(jì)算條件概率和先驗(yàn)概率缆巧。
@Since("0.9.0")
class NaiveBayes private (
private var lambda: Double, // 平滑因子
private var modelType: String) extends Serializable with Logging {
import NaiveBayes.{Bernoulli, Multinomial} // 兩種分類(lèi)模式布持,樣本向量化的格式不同,
@Since("1.4.0")
def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial)
@Since("0.9.0")
def this() = this(1.0, NaiveBayes.Multinomial)
/** Set the smoothing parameter. Default: 1.0. */
@Since("0.9.0")
def setLambda(lambda: Double): NaiveBayes = { // 設(shè)置平滑因子陕悬,默認(rèn)1.0
require(lambda >= 0,
s"Smoothing parameter must be nonnegative but got $lambda")
this.lambda = lambda
this
}
/** Get the smoothing parameter. */
@Since("1.4.0")
def getLambda: Double = lambda
/**
* Set the model type using a string (case-sensitive).
* Supported options: "multinomial" (default) and "bernoulli".
*/
@Since("1.4.0")
def setModelType(modelType: String): NaiveBayes = { // 設(shè)置模式
require(NaiveBayes.supportedModelTypes.contains(modelType),
s"NaiveBayes was created with an unknown modelType: $modelType.")
this.modelType = modelType
this
}
/** Get the model type. */
@Since("1.4.0")
def getModelType: String = this.modelType
// NB的關(guān)鍵方法题暖,用于模型訓(xùn)練
@Since("0.9.0")
def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
val values = v match { // 如果是Multinomial,向量的所有值,進(jìn)行校驗(yàn),所有值都必須非負(fù)
case sv: SparseVector => sv.values
case dv: DenseVector => dv.values
}
if (!values.forall(_ >= 0.0)) {
throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
}
}
val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
val values = v match { // 如果是Bernoulli模型胧卤,向量的所有值只能為0或1
case sv: SparseVector => sv.values
case dv: DenseVector => dv.values
}
if (!values.forall(v => v == 0.0 || v == 1.0)) {
throw new SparkException(
s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
}
}
// 根據(jù)標(biāo)簽進(jìn)行聚合唯绍,并統(tǒng)計(jì)標(biāo)簽下樣本數(shù)
val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
createCombiner = (v: Vector) => { // 創(chuàng)建combiner,用于聚合vectors
if (modelType == Bernoulli) {
requireZeroOneBernoulliValues(v)
} else {
requireNonnegativeValues(v)
}
(1L, v.copy.toDense) // 將樣本vector轉(zhuǎn)換為DenseVector并計(jì)次數(shù)為1枝誊,
},
mergeValue = (c: (Long, DenseVector), v: Vector) => { // 創(chuàng)建合并options推捐,用于合并vector的值
requireNonnegativeValues(v)
BLAS.axpy(1.0, v, c._2) // 該方法的作用為c._2 = c._2 + v
(c._1 + 1L, c._2) // 計(jì)數(shù)加1,(c._1 + 1, c._2 + v)
},
mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
BLAS.axpy(1.0, c2._2, c1._2) // 用法同上侧啼,c1._2 = c1._2 + c2._2
(c1._1 + c2._1, c1._2)
} // 最終的形式為(label, (樣本數(shù),features之和))
).collect().sortBy(_._1)
val numLabels = aggregated.length // 標(biāo)簽個(gè)數(shù)
var numDocuments = 0L
aggregated.foreach { case (_, (n, _)) => // 訓(xùn)練集樣本數(shù)
numDocuments += n
}
// 獲取樣本特征數(shù)即樣本向量的大小
val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
val labels = new Array[Double](numLabels)
val pi = new Array[Double](numLabels)
val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
val piLogDenom = math.log(numDocuments + numLabels * lambda)
var i = 0
aggregated.foreach { case (label, (n, sumTermFreqs)) =>
labels(i) = label
pi(i) = math.log(n + lambda) - piLogDenom // 類(lèi)別的先驗(yàn)概率
val thetaLogDenom = modelType match { // sumTermFreqs.values.sum將vector中的所有values進(jìn)行累計(jì)
case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
case Bernoulli => math.log(n + 2.0 * lambda)
case _ =>
// This should never happen.
throw new UnknownError(s"Invalid modelType: $modelType.")
}
var j = 0
while (j < numFeatures) { // 計(jì)算每個(gè)特征的條件概率
theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
j += 1
}
i += 1
}
new NaiveBayesModel(labels, pi, theta, modelType)
}
}
總結(jié):spark中MLlib版本的NB堪簿,首先根據(jù)label對(duì)樣本進(jìn)行聚合痊乾,聚合的方式把樣本向量轉(zhuǎn)換為DenseVector,然后把vector.values累加椭更,并計(jì)下該label下的樣本數(shù)哪审,即構(gòu)成了(label, (個(gè)數(shù)虑瀑, features之和))湿滓,然后將所有個(gè)數(shù)相加就得到總的樣本數(shù),就可以計(jì)算類(lèi)別先驗(yàn)概率和特征條件概率了舌狗。