? 姓名:崔少杰 ? ? ? 學(xué)號(hào):16040510021
?轉(zhuǎn)載自:http://www.reibang.com/p/04d965e35b6d=有修改
【嵌牛導(dǎo)讀】:mRMR可認(rèn)為是最大化特征子集的聯(lián)合分布與目標(biāo)變量之間依賴關(guān)系的一種近似。
【嵌牛鼻子】:特征工程禁荒、mRMR算法
【嵌牛提問(wèn)】:不同模型有不同的特征適用類(lèi)型羽资?
【嵌牛正文】:
特征子集的搜索:
(1)子集搜索問(wèn)題怠益。
比如逐漸添加相關(guān)特征(前向forward搜索)或逐漸去掉無(wú)關(guān)特征(后向backward搜索),還有雙向搜索忧换。
缺點(diǎn)是恬惯,該策略為貪心算法,本輪最優(yōu)并不一定是全局最優(yōu)亚茬,若不能窮舉搜索酪耳,則無(wú)法避免該問(wèn)題。
該子集搜索策略屬于最大相關(guān)(maximum-relevance)的選擇策略才写。
(2)特征子集評(píng)價(jià)與度量葡兑。
信息增益,交叉熵赞草,相關(guān)性讹堤,余弦相似度等評(píng)級(jí)準(zhǔn)則。
典型的特征選擇方法
二厨疙、從決策樹(shù)模型的特征重要性說(shuō)起
決策樹(shù)可以看成是前向搜索與信息熵相結(jié)合的算法洲守,樹(shù)節(jié)點(diǎn)的劃分屬性所組成的集合就是選擇出來(lái)的特征子集。
(1)決策樹(shù)劃分屬性的依據(jù)
(2)通過(guò)gini不純度計(jì)算特征重要性
不管是scikit-learn還是mllib沾凄,其中的隨機(jī)森林和gbdt算法都是基于決策樹(shù)算法梗醇,一般的,都是使用了cart樹(shù)算法撒蟀,通過(guò)gini指數(shù)來(lái)計(jì)算特征的重要性的叙谨。
比如scikit-learn的sklearn.feature_selection.SelectFromModel可以實(shí)現(xiàn)根據(jù)特征重要性分支進(jìn)行特征的轉(zhuǎn)換。
>>>fromsklearn.ensembleimportExtraTreesClassifier>>>fromsklearn.datasetsimportload_iris>>>fromsklearn.feature_selectionimportSelectFromModel>>>iris = load_iris()>>>X, y = iris.data, iris.target>>>X.shape(150,4)>>>clf = ExtraTreesClassifier()>>>clf = clf.fit(X, y)>>>clf.feature_importances_ array([0.04...,0.05...,0.4...,0.4...])>>>model = SelectFromModel(clf, prefit=True)>>>X_new = model.transform(X)>>>X_new.shape? ? ? ? ? ? ? (150,2)
(3)mllib中集成學(xué)習(xí)算法計(jì)算特征重要性的源碼
在spark 2.0之后保屯,mllib的決策樹(shù)算法都引入了計(jì)算特征重要性的方法featureImportances手负,而隨機(jī)森林算法(RandomForestRegressionModel和RandomForestClassificationModel類(lèi))和gbdt算法(GBTClassificationModel和GBTRegressionModel類(lèi))均利用決策樹(shù)算法中計(jì)算特征不純度和特征重要性的方法來(lái)得到所使用模型的特征重要性涤垫。
而這些集成方法的實(shí)現(xiàn)類(lèi)都集成了TreeEnsembleModel[M <: DecisionTreeModel]這個(gè)特質(zhì)(trait),即featureImportances是在該特質(zhì)中實(shí)現(xiàn)的竟终。
featureImportances方法的基本計(jì)算思路是:
針對(duì)每一棵決策樹(shù)而言蝠猬,特征j的重要性指標(biāo)為所有通過(guò)特征j進(jìn)行劃分的樹(shù)結(jié)點(diǎn)的增益的和
將一棵樹(shù)的特征重要性歸一化到1
將集成模型的特征重要性向量歸一化到1
以下是源碼分析:
deffeatureImportances[M<:DecisionTreeModel](trees:Array[M], numFeatures:Int):Vector= {valtotalImportances =newOpenHashMap[Int,Double]()// 針對(duì)每一棵決策樹(shù)模型進(jìn)行遍歷trees.foreach { tree =>// Aggregate feature importance vector for this treevalimportances =newOpenHashMap[Int,Double]()// 從根節(jié)點(diǎn)開(kāi)始,遍歷整棵樹(shù)的中間節(jié)點(diǎn)统捶,將同一特征的特征重要性累加起來(lái)computeFeatureImportance(tree.rootNode, importances)// Normalize importance vector for this tree, and add it to total.//TODO:In the future, also support normalizing by tree.rootNode.impurityStats.count?// 將一棵樹(shù)的特征重要性進(jìn)行歸一化valtreeNorm = importances.map(_._2).sumif(treeNorm !=0) {? ? ? importances.foreach {case(idx, impt) =>valnormImpt = impt / treeNorm? ? ? ? totalImportances.changeValue(idx, normImpt, _ + normImpt)? ? ? }? ? }? }// Normalize importances// 歸一化總體的特征重要性normalizeMapValues(totalImportances)// Construct vector// 構(gòu)建最終輸出的特征重要性向量vald =if(numFeatures !=-1) {? ? numFeatures? }else{// Find max feature index used in treesvalmaxFeatureIndex = trees.map(_.maxSplitFeatureIndex()).max? ? maxFeatureIndex +1}if(d ==0) {? ? assert(totalImportances.size ==0,s"Unknown error in computing feature"+s" importance: No splits found, but some non-zero importances.")? }val(indices, values) = totalImportances.iterator.toSeq.sortBy(_._1).unzipVectors.sparse(d, indices.toArray, values.toArray)}
其中computeFeatureImportance方法為:
// 這是計(jì)算一棵決策樹(shù)特征重要性的遞歸方法defcomputeFeatureImportance(? ? node:Node,? ? importances:OpenHashMap[Int,Double]):Unit= {? nodematch{// 如果是中間節(jié)點(diǎn)榆芦,即進(jìn)行特征劃分的節(jié)點(diǎn)casen:InternalNode=>// 得到特征標(biāo)記valfeature = n.split.featureIndex// 計(jì)算得到比例化的特征增益值,信息增益乘上該節(jié)點(diǎn)使用的訓(xùn)練數(shù)據(jù)數(shù)量valscaledGain = n.gain * n.impurityStats.count? ? ? importances.changeValue(feature, scaledGain, _ + scaledGain)// 前序遍歷二叉決策樹(shù)computeFeatureImportance(n.leftChild, importances)? ? ? computeFeatureImportance(n.rightChild, importances)casen:LeafNode=>// do nothing}}
(4)mllib中決策樹(shù)算法計(jì)算特征不純度的源碼
InternalNode類(lèi)使用ImpurityCalculator類(lèi)的私有實(shí)例impurityStats來(lái)記錄不純度的信息和狀態(tài)喘鸟,具體使用哪一種劃分方式通過(guò)getCalculator方法來(lái)進(jìn)行選擇:
defgetCalculator(impurity:String, stats:Array[Double]):ImpurityCalculator= {? impuritymatch{case"gini"=>newGiniCalculator(stats)case"entropy"=>newEntropyCalculator(stats)case"variance"=>newVarianceCalculator(stats)case_ =>thrownewIllegalArgumentException(s"ImpurityCalculator builder did not recognize impurity type:$impurity")? }}
以gini指數(shù)為例匆绣,其信息計(jì)算的代碼如下:
@Since("1.1.0")@DeveloperApioverridedefcalculate(counts:Array[Double], totalCount:Double):Double= {if(totalCount ==0) {return0}valnumClasses = counts.lengthvarimpurity =1.0varclassIndex =0while(classIndex < numClasses) {valfreq = counts(classIndex) / totalCount? ? impurity -= freq * freq? ? classIndex +=1}? impurity}
以上源碼解讀即是從集成方法來(lái)計(jì)算特征重要性到?jīng)Q策樹(shù)算法具體計(jì)算節(jié)點(diǎn)特征不純度方法的過(guò)程。
三什黑、最大相關(guān)最小冗余(mRMR)算法
(1)互信息
互信息可以看成是一個(gè)隨機(jī)變量中包含的關(guān)于另一個(gè)隨機(jī)變量的信息量犬绒,或者說(shuō)是一個(gè)隨機(jī)變量由于已知另一個(gè)隨機(jī)變量而減少的不確定性《以洌互信息本來(lái)是信息論中的一個(gè)概念,用于表示信息之間的關(guān)系, 是兩個(gè)隨機(jī)變量統(tǒng)計(jì)相關(guān)性的測(cè)度。
(2)mRMR
之所以出現(xiàn)mRMR算法來(lái)進(jìn)行特征選擇茵瘾,主要是為了解決通過(guò)最大化特征與目標(biāo)變量的相關(guān)關(guān)系度量得到的最好的m個(gè)特征礼华,并不一定會(huì)得到最好的預(yù)測(cè)精度的問(wèn)題。
前面介紹的評(píng)價(jià)特征的方法基本都是基于是否與目標(biāo)變量具有強(qiáng)相關(guān)性的特征拗秘,但是這些特征里還可能包含一些冗余特征(比如目標(biāo)變量是立方體體積圣絮,特征為底面的長(zhǎng)度、底面的寬度雕旨、底面的面積扮匠,其實(shí)底面的面積可以由長(zhǎng)與寬得到,所以可被認(rèn)為是一種冗余信息)凡涩,mRMR算法就是用來(lái)在保證最大相關(guān)性的同時(shí)棒搜,又去除了冗余特征的方法,相當(dāng)于得到了一組“最純凈”的特征子集(特征之間差異很大活箕,而同目標(biāo)變量的相關(guān)性也很大)力麸。
作為一個(gè)特例,變量之間的相關(guān)性(correlation)可以用統(tǒng)計(jì)學(xué)的依賴關(guān)系(dependency)來(lái)替代育韩,而互信息(mutual information)是一種評(píng)價(jià)該依賴關(guān)系的度量方法克蚂。
mRMR可認(rèn)為是最大化特征子集的聯(lián)合分布與目標(biāo)變量之間依賴關(guān)系的一種近似。
mRMR本身還是屬于filter型特征選擇方法筋讨。
可以通過(guò)max(V-W)或max(V/W)來(lái)統(tǒng)籌考慮相關(guān)性和冗余性埃叭,作為特征評(píng)價(jià)的標(biāo)準(zhǔn)。
(3)mRMR的spark實(shí)現(xiàn)源碼
mRMR算法包含幾個(gè)步驟:
將數(shù)據(jù)進(jìn)行處理轉(zhuǎn)換的過(guò)程(注:為了計(jì)算兩個(gè)特征的聯(lián)合分布和邊緣分布悉罕,需要將數(shù)據(jù)歸一化到[0,255]之間赤屋,并且將每一維特征使用合理的數(shù)據(jù)結(jié)構(gòu)進(jìn)行存儲(chǔ))
計(jì)算特征之間立镶、特征與響應(yīng)變量之間的分布及互信息
對(duì)特征進(jìn)行mrmr得分,并進(jìn)行排序
private[feature]defrun(? ? data:RDD[LabeledPoint],? ? nToSelect:Int,? ? numPartitions:Int) = {valnPart =if(numPartitions ==0) data.context.getConf.getInt("spark.default.parallelism",500)elsenumPartitionsvalrequireByteValues = (l:Double, v:Vector) => {valvalues = vmatch{caseSparseVector(size, indices, values) =>? ? ? ? valuescaseDenseVector(values) =>? ? ? ? values? ? }valcondition = (value:Double) => value <=255&&? ? ? value >=0if(!values.forall(condition(_)) || !condition(l)) {thrownewSparkException(s"Info-Theoretic Framework requires positive values in range [0, 255]")? ? }? ? ? ? ? ? }valnAllFeatures = data.first.features.size +1// 將數(shù)據(jù)排列成欄狀,其實(shí)是為每個(gè)數(shù)據(jù)都編上號(hào)valcolumnarData:RDD[(Long,Short)] = data.zipWithIndex().flatMap ({case(LabeledPoint(label, values:SparseVector), r) =>? ? ? requireByteValues(label, values)// Not implemented yet!thrownewNotImplementedError()case(LabeledPoint(label, values:DenseVector), r) =>? ? ? requireByteValues(label, values)valrindex = r * nAllFeaturesvalinputs =for(i <-0until values.size)yield(rindex + i, values(i).toShort)valoutput =Array((rindex + values.size, label.toShort))? ? ? inputs ++ output? ? }).sortByKey(numPartitions = nPart)// put numPartitions parametercolumnarData.persist(StorageLevel.MEMORY_AND_DISK_SER)? ? ? ? ? require(nToSelect < nAllFeatures)// 計(jì)算mrmr過(guò)程及對(duì)特征進(jìn)行排序valselected = selectFeatures(columnarData, nToSelect, nAllFeatures)? ? ? ? ? columnarData.unpersist()// Print best features according to the mRMR measurevalout = selected.map{caseF(feat, rel) => (feat +1) +"\t"+"%.4f".format(rel)}.mkString("\n")? println("\n*** mRMR features ***\nFeature\tScore\n"+ out)// Features must be sortednewSelectorModel(selected.map{caseF(feat, rel) => feat}.sorted.toArray)}
下面是基于互信息及mrmr的特征選擇過(guò)程:
/**
* Perform a info-theory selection process.
*
* @param data Columnar data (last element is the class attribute).
* @param nToSelect Number of features to select.
* @param nFeatures Number of total features in the dataset.
* @return A list with the most relevant features and its scores.
*
*/private[feature]defselectFeatures(? ? data:RDD[(Long,Short)],? ? nToSelect:Int,? ? nFeatures:Int) = {// 特征的下標(biāo)vallabel = nFeatures -1// 因?yàn)閐ata是(編號(hào),每個(gè)特征),所以這是數(shù)據(jù)數(shù)量valnInstances = data.count() / nFeatures// 將同一類(lèi)特征放在一起,根據(jù)同一key進(jìn)行分組,然后取出最大值加1(用于后續(xù)構(gòu)建分布直方圖的參數(shù))valcounterByKey = data.map({case(k, v) => (k % nFeatures).toInt -> v})? ? ? ? .distinct().groupByKey().mapValues(_.max +1).collectAsMap().toMap// calculate relevancevalMiAndCmi=IT.computeMI(? ? ? data,0until label, label, nInstances, nFeatures, counterByKey)// 互信息池,用于mrmr判定,pool是(feat, Mrmr)varpool =MiAndCmi.map{case(x, mi) => (x,newMrmrCriterion(mi))}? ? .collectAsMap()// Print most relevant features// Print most relevant featuresvalstrRels =MiAndCmi.collect().sortBy(-_._2)? ? .take(nToSelect)? ? .map({case(f, mi) => (f +1) +"\t"+"%.4f"format mi})? ? .mkString("\n")? println("\n*** MaxRel features ***\nFeature\tScore\n"+ strRels)// get maximum and select it// 得到了分?jǐn)?shù)最高的那個(gè)特征及其mrmrvalfirstMax = pool.maxBy(_._2.score)varselected =Seq(F(firstMax._1, firstMax._2.score))// 將firstMax對(duì)應(yīng)的key從pool這個(gè)map中去掉pool = pool - firstMax._1while(selected.size < nToSelect) {// update poolvalnewMiAndCmi =IT.computeMI(data, pool.keys.toSeq,? ? ? ? selected.head.feat, nInstances, nFeatures, counterByKey)? ? ? ? .map({case(x, crit) => (x, crit) })? ? ? ? .collectAsMap()? ? ? ? ? pool.foreach({case(k, crit) =>// 從pool里拿出第k個(gè)特征,然后從newMiAndCmi中得到對(duì)應(yīng)的minewMiAndCmi.get(k)match{caseSome(_) => crit.update(_)caseNone=>? ? ? }? ? })// get maximum and save itvalmax = pool.maxBy(_._2.score)// select the best feature and remove from the whole set of featuresselected =F(max._1, max._2.score) +: selected? ? pool = pool - max._1? }? ? selected.reverse}
具體計(jì)算互信息的代碼如下:
/**
* Method that calculates mutual information (MI) and conditional mutual information (CMI)
* simultaneously for several variables. Indexes must be disjoint.
*
* @param rawData RDD of data (first element is the class attribute)
* @param varX Indexes of primary variables (must be disjoint with Y and Z)
* @param varY Indexes of secondary variable (must be disjoint with X and Z)
* @param nInstances? ? Number of instances
* @param nFeatures Number of features (including output ones)
* @return? RDD of (primary var, (MI, CMI))
*
*/defcomputeMI(? ? rawData:RDD[(Long,Short)],? ? varX:Seq[Int],? ? varY:Int,? ? nInstances:Long,? ? ? ? nFeatures:Int,? ? counter:Map[Int,Int]) = {// Pre-requisitesrequire(varX.size >0)// Broadcast variablesvalsc = rawData.contextvallabel = nFeatures -1// A boolean vector that indicates the variables involved on this computation// 對(duì)應(yīng)每個(gè)數(shù)據(jù)不同維度的特征的一個(gè)boolean數(shù)組valfselected =Array.ofDim[Boolean](nFeatures)? fselected(varY) =true// output featurevarX.map(fselected(_) =true)// 將fselected置為truevalbFeatSelected = sc.broadcast(fselected)valgetFeat = (k:Long) => (k % nFeatures).toInt// Filter data by these variables// 根據(jù)bFeatSelected來(lái)過(guò)濾rawDatavaldata = rawData.filter({case(k, _) => bFeatSelected.value(getFeat(k))})// Broadcast Y vectorvalyCol:Array[Short] =if(varY == label){// classCol corresponds with output attribute, which is re-used in the iterationclassCol = data.filter({case(k, _) => getFeat(k) == varY}).values.collect()? ? classCol? }else{? ? data.filter({case(k, _) => getFeat(k) == varY}).values.collect()? }// data是所有選擇維度的特征,(varY, yCol)是y所在的列和y值數(shù)組// 生成特征與y的對(duì)應(yīng)關(guān)系的直方圖valhistograms = computeHistograms(data, (varY, yCol), nFeatures, counter)// 這里只是對(duì)數(shù)據(jù)規(guī)約成占比的特征和目標(biāo)變量的聯(lián)合分布valjointTable = histograms.mapValues(_.map(_.toFloat / nInstances))// sum(h(*, ::))計(jì)算每一行數(shù)據(jù)之和valmarginalTable = jointTable.mapValues(h => sum(h(*, ::)).toDenseVector)// If y corresponds with output feature, we save for CMI computationif(varY == label) {? ? marginalProb = marginalTable.cache()? ? jointProb = jointTable.cache()? }valyProb = marginalTable.lookup(varY)(0)// Remove output feature from the computationsvalfdata = histograms.filter{case(k, _) => k != label}// fdata是特征與y的聯(lián)合分布,yProb是一個(gè)值computeMutualInfo(fdata, yProb, nInstances)}
計(jì)算數(shù)據(jù)分布直方圖的方法:
privatedefcomputeHistograms(? ? data:RDD[(Long,Short)],? ? yCol: (Int,Array[Short]),? ? nFeatures:Long,? ? counter:Map[Int,Int]) = {valmaxSize =256valbyCol = data.context.broadcast(yCol._2)valbCounter = data.context.broadcast(counter)// 得到y(tǒng)的最大值valys = counter.getOrElse(yCol._1, maxSize).toInt// mapPartitions是對(duì)rdd每個(gè)分區(qū)進(jìn)行操作,it為分區(qū)迭代器// map得到的是(feature, matrix)的Mapdata.mapPartitions({ it =>varresult =Map.empty[Int,BDM[Long]]for((k, x) <- it) {valfeat = (k % nFeatures).toInt;valinst = (k / nFeatures).toInt// 取得具體特征的最大值valxs = bCounter.value.getOrElse(feat, maxSize).toIntvalm = result.getOrElse(feat,BDM.zeros[Long](xs, ys))// 創(chuàng)建(xMax,yMax)的矩陣m(x, byCol.value(inst)) +=1result += feat -> m? ? }? ? result.toIterator? }).reduceByKey(_ + _)}
計(jì)算互信息的公式:
privatedefcomputeMutualInfo(? ? data:RDD[(Int,BDM[Long])],? ? yProb:BDV[Float],? ? n:Long) = {valbyProb = data.context.broadcast(yProb)? data.mapValues({ m =>varmi =0.0d// Aggregate by row (x)valxProb = sum(m(*, ::)).map(_.toFloat / n)for(i <-0until m.rows){for(j <-0until m.cols){valpxy = m(i, j).toFloat / nvalpy = byProb.value(j);valpx = xProb(i)if(pxy !=0&& px !=0&& py !=0)// To avoid NaNs// I(x,y) = sum[p(x,y)log(p(x,y)/(p(x)p(y)))]mi += pxy * (math.log(pxy / (px * py)) / math.log(2))? ? ? }? ? }? ? mi? ? ? ? }) }