1.什么是One-Hot Encoding
One-Hot Encoding 也就是獨(dú)熱碼孕讳,直觀來說就是有多少個(gè)狀態(tài)就有多少比特,而且只有一個(gè)比特為1,其他全為0的一種碼制锹引。在機(jī)器學(xué)習(xí)(Logistic Regression,SVM等)中對(duì)于離散型的分類型的數(shù)據(jù)唆香,需要對(duì)其進(jìn)行數(shù)字化比如說性別這一屬性嫌变,只能有男性或者女性或者其他這三種值,如何對(duì)這三個(gè)值進(jìn)行數(shù)字化表達(dá)躬它?一種簡(jiǎn)單的方式就是男性為0腾啥,女性為1,其他為2冯吓,這樣做有什么問題倘待?
使用上面簡(jiǎn)單的序列對(duì)分類值進(jìn)行表示后,進(jìn)行模型訓(xùn)練時(shí)可能會(huì)產(chǎn)生一個(gè)問題就是特征的因?yàn)閿?shù)字值得不同影響模型的訓(xùn)練效果桑谍,在模型訓(xùn)練的過程中不同的值使得同一特征在樣本中的權(quán)重可能發(fā)生變化延柠,假如直接編碼成1000,是不是比編碼成1對(duì)模型的的影響更大锣披。為了解決上述的問題贞间,使訓(xùn)練過程中不受到因?yàn)榉诸愔当硎镜膯栴}對(duì)模型產(chǎn)生的負(fù)面影響贿条,引入獨(dú)熱碼對(duì)分類型的特征進(jìn)行獨(dú)熱碼編碼。
2.One-Hot Encoding在Spark中的應(yīng)用
2.1 數(shù)據(jù)集預(yù)覽
數(shù)據(jù)中字段含義如下:
affairs:Double //是否有婚外情
gender:String //性別
age:Double //年齡
yearsmarried:Double //婚齡
children:String //是否有小孩
religiousness:Double //宗教信仰程度(5分制增热,1分表示反對(duì)整以,5分表示非常信仰)
education:Double //學(xué)歷
occupation:Double //職業(yè)(逆向編號(hào)的戈登7種分類)
rating:Double //對(duì)婚姻的自我評(píng)分(5分制,1表示非常不幸福峻仇,5表示非常幸福)
2.2 加載數(shù)據(jù)集
val conf = new SparkConf().setMaster("local[4]").setAppName(getClass.getSimpleName).set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(conf)
val sqlContext = new HiveContext(sparkContext)
val colArray2 = Array("gender", "age", "yearsmarried", "children", "religiousness", "education", "occupation", "rating")
val logPath = "E:\\spark_workspace\\spark-study\\src\\main\\files\\lr_test03.json"
import sqlContext.implicits._
val dataDF = sqlContext.read.json(logPath).select($"affairs", $"gender", $"age", $"yearsmarried", $"children", $"religiousness", $"education", $"occupation", $"rating")
2.3 使用OneHotEncoder處理數(shù)據(jù)集
/**要進(jìn)行OneHotEncoder編碼的字段*/
val categoricalColumns = Array("gender", "children")
/**采用Pileline方式處理機(jī)器學(xué)習(xí)流程*/
val stagesArray = new ListBuffer[PipelineStage]()
for (cate <- categoricalColumns) {
/**使用StringIndexer 建立類別索引*/
val indexer = new StringIndexer().setInputCol(cate).setOutputCol(s"${cate}Index")
/**使用OneHotEncoder將分類變量轉(zhuǎn)換為二進(jìn)制稀疏向量*/
val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol(s"${cate}classVec")
stagesArray.append(indexer,encoder)
}
2.4 使用VectorAssembler合并所有特征為單個(gè)向量
val numericCols = Array("affairs", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")
val assemblerInputs = categoricalColumns.map(_ + "classVec") ++ numericCols
/**使用VectorAssembler將所有特征轉(zhuǎn)換為一個(gè)向量*/
val assembler = new VectorAssembler().setInputCols(assemblerInputs).setOutputCol("features")
stagesArray.append(assembler)
2.5 以Pipeline的形式運(yùn)行各個(gè)PipelineStage
val pipeline = new Pipeline()
pipeline.setStages(stagesArray.toArray)
/**fit() 根據(jù)需要計(jì)算特征統(tǒng)計(jì)信息*/
val pipelineModel = pipeline.fit(dataDF)
/**transform() 真實(shí)轉(zhuǎn)換特征*/
val dataset = pipelineModel.transform(dataDF)
dataset.show(false)
One-Hot Encoding 之后的數(shù)據(jù)集結(jié)果如下圖:
+-------+------+----+------------+--------+-------------+---------+----------+------+-----------+--------------+-------------+----------------+----------------------------------------+
|affairs|gender|age |yearsmarried|children|religiousness|education|occupation|rating|genderIndex|genderclassVec|childrenIndex|childrenclassVec|features |
+-------+------+----+------------+--------+-------------+---------+----------+------+-----------+--------------+-------------+----------------+----------------------------------------+
|0.0 |male |37.0|10.0 |no |3.0 |18.0 |7.0 |4.0 |1.0 |(1,[],[]) |1.0 |(1,[],[]) |[0.0,0.0,0.0,37.0,10.0,3.0,18.0,7.0,4.0]|
|0.0 |female|27.0|4.0 |no |4.0 |14.0 |6.0 |4.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,27.0,4.0,4.0,14.0,6.0,4.0] |
|0.0 |female|32.0|15.0 |yes |1.0 |12.0 |1.0 |4.0 |0.0 |(1,[0],[1.0]) |0.0 |(1,[0],[1.0]) |[1.0,1.0,0.0,32.0,15.0,1.0,12.0,1.0,4.0]|
|0.0 |male |57.0|15.0 |yes |5.0 |18.0 |6.0 |5.0 |1.0 |(1,[],[]) |0.0 |(1,[0],[1.0]) |[0.0,1.0,0.0,57.0,15.0,5.0,18.0,6.0,5.0]|
|0.0 |male |22.0|0.75 |no |2.0 |17.0 |6.0 |3.0 |1.0 |(1,[],[]) |1.0 |(1,[],[]) |[0.0,0.0,0.0,22.0,0.75,2.0,17.0,6.0,3.0]|
|0.0 |female|32.0|1.5 |no |2.0 |17.0 |5.0 |5.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,32.0,1.5,2.0,17.0,5.0,5.0] |
|0.0 |female|22.0|0.75 |no |2.0 |12.0 |1.0 |3.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,22.0,0.75,2.0,12.0,1.0,3.0]|
|0.0 |male |57.0|15.0 |yes |2.0 |14.0 |4.0 |4.0 |1.0 |(1,[],[]) |0.0 |(1,[0],[1.0]) |[0.0,1.0,0.0,57.0,15.0,2.0,14.0,4.0,4.0]|
|0.0 |female|32.0|15.0 |yes |4.0 |16.0 |1.0 |2.0 |0.0 |(1,[0],[1.0]) |0.0 |(1,[0],[1.0]) |[1.0,1.0,0.0,32.0,15.0,4.0,16.0,1.0,2.0]|
|0.0 |male |22.0|1.5 |no |4.0 |14.0 |4.0 |5.0 |1.0 |(1,[],[]) |1.0 |(1,[],[]) |[0.0,0.0,0.0,22.0,1.5,4.0,14.0,4.0,5.0] |
|0.0 |male |37.0|15.0 |yes |2.0 |20.0 |7.0 |2.0 |1.0 |(1,[],[]) |0.0 |(1,[0],[1.0]) |[0.0,1.0,0.0,37.0,15.0,2.0,20.0,7.0,2.0]|
|0.0 |male |27.0|4.0 |yes |4.0 |18.0 |6.0 |4.0 |1.0 |(1,[],[]) |0.0 |(1,[0],[1.0]) |[0.0,1.0,0.0,27.0,4.0,4.0,18.0,6.0,4.0] |
|0.0 |male |47.0|15.0 |yes |5.0 |17.0 |6.0 |4.0 |1.0 |(1,[],[]) |0.0 |(1,[0],[1.0]) |[0.0,1.0,0.0,47.0,15.0,5.0,17.0,6.0,4.0]|
|0.0 |female|22.0|1.5 |no |2.0 |17.0 |5.0 |4.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,22.0,1.5,2.0,17.0,5.0,4.0] |
|0.0 |female|27.0|4.0 |no |4.0 |14.0 |5.0 |4.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,27.0,4.0,4.0,14.0,5.0,4.0] |
|0.0 |female|37.0|15.0 |yes |1.0 |17.0 |5.0 |5.0 |0.0 |(1,[0],[1.0]) |0.0 |(1,[0],[1.0]) |[1.0,1.0,0.0,37.0,15.0,1.0,17.0,5.0,5.0]|
|0.0 |female|37.0|15.0 |yes |2.0 |18.0 |4.0 |3.0 |0.0 |(1,[0],[1.0]) |0.0 |(1,[0],[1.0]) |[1.0,1.0,0.0,37.0,15.0,2.0,18.0,4.0,3.0]|
|0.0 |female|22.0|0.75 |no |3.0 |16.0 |5.0 |4.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,22.0,0.75,3.0,16.0,5.0,4.0]|
|0.0 |female|22.0|1.5 |no |2.0 |16.0 |5.0 |5.0 |0.0 |(1,[0],[1.0]) |1.0 |(1,[],[]) |[1.0,0.0,0.0,22.0,1.5,2.0,16.0,5.0,5.0] |
|0.0 |female|27.0|10.0 |yes |2.0 |14.0 |1.0 |5.0 |0.0 |(1,[0],[1.0]) |0.0 |(1,[0],[1.0]) |[1.0,1.0,0.0,27.0,10.0,2.0,14.0,1.0,5.0]|
+-------+------+----+------------+--------+-------------+---------+----------+------+-----------+--------------+-------------+----------------+----------------------------------------+
2.6 訓(xùn)練和評(píng)估模型
/**隨機(jī)分割測(cè)試集和訓(xùn)練集數(shù)據(jù)公黑,指定seed可以固定數(shù)據(jù)分配*/
val Array(trainingDF, testDF) = dataset.randomSplit(Array(0.6, 0.4), seed = 12345)
println(s"trainingDF size=${trainingDF.count()},testDF size=${testDF.count()}")
val lrModel = new LogisticRegression().setLabelCol("affairs").setFeaturesCol("features").fit(trainingDF)
val predictions = lrModel.transform(testDF).select($"affairs".as("label"), $"features", $"rawPrediction", $"probability", $"prediction")
predictions.show(false)
/**使用BinaryClassificationEvaluator來評(píng)價(jià)我們的模型。在metricName參數(shù)中設(shè)置度量摄咆。*/
val evaluator = new BinaryClassificationEvaluator()
evaluator.setMetricName("areaUnderROC")
val auc= evaluator.evaluate(predictions)
println(s"areaUnderROC=$auc")
使用model 預(yù)測(cè)后的數(shù)據(jù)如下圖所示:
+-----+-----------------------------------------+----------------------------------------+-------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+-----------------------------------------+----------------------------------------+-------------------------------------------+----------+
|0.0 |[1.0,0.0,0.0,22.0,0.125,4.0,14.0,4.0,5.0]|[24.24907721362884,-24.24907721362884] |[0.999999999970572,2.942792055040055E-11] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.417,1.0,17.0,6.0,4.0]|[21.290119589459323,-21.290119589459323]|[0.9999999994326925,5.673075233382041E-10] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.417,5.0,14.0,1.0,4.0]|[24.17979109657276,-24.17979109657276] |[0.9999999999684608,3.1539162239002745E-11]|0.0 |
|0.0 |[1.0,1.0,0.0,22.0,0.417,3.0,14.0,3.0,5.0]|[22.67775610810491,-22.67775610810491] |[0.9999999998583633,1.4163665456478983E-10]|0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.75,2.0,12.0,1.0,3.0] |[18.511403509878832,-18.511403509878832]|[0.9999999908672915,9.13270857267764E-9] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.75,4.0,16.0,1.0,5.0] |[25.35929557565844,-25.35929557565844] |[0.999999999990304,9.69611742832185E-12] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.75,5.0,14.0,3.0,5.0] |[25.260012900022847,-25.260012900022847]|[0.9999999999892919,1.070818300382037E-11] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,0.75,5.0,18.0,1.0,5.0] |[27.56176640273893,-27.56176640273893] |[0.9999999999989282,1.0717091528412073E-12]|0.0 |
|0.0 |[1.0,0.0,0.0,22.0,1.5,2.0,14.0,4.0,5.0] |[21.806773356131036,-21.806773356131036]|[0.9999999996615936,3.3840647423836113E-10]|0.0 |
|0.0 |[1.0,0.0,0.0,22.0,1.5,2.0,16.0,5.0,5.0] |[22.87962909201085,-22.87962909201085] |[0.9999999998842548,1.1574529263994485E-10]|0.0 |
|0.0 |[1.0,0.0,0.0,22.0,1.5,2.0,16.0,5.0,5.0] |[22.87962909201085,-22.87962909201085] |[0.9999999998842548,1.1574529263994485E-10]|0.0 |
|0.0 |[1.0,0.0,0.0,22.0,1.5,4.0,16.0,5.0,3.0] |[22.617887847315348,-22.617887847315348]|[0.9999999998496247,1.5037516453560028E-10]|0.0 |
|0.0 |[1.0,1.0,0.0,22.0,1.5,3.0,16.0,5.0,5.0] |[23.505953663596607,-23.505953663596607]|[0.9999999999381279,6.187198251529256E-11] |0.0 |
|0.0 |[1.0,0.0,0.0,22.0,4.0,4.0,17.0,5.0,5.0] |[25.142053761516753,-25.142053761516753]|[0.9999999999879512,1.2048827525325212E-11]|0.0 |
|0.0 |[1.0,0.0,0.0,27.0,1.5,2.0,16.0,6.0,5.0] |[23.342953469838886,-23.342953469838886]|[0.9999999999271745,7.282560759398736E-11] |0.0 |
|0.0 |[1.0,0.0,0.0,27.0,1.5,2.0,18.0,6.0,5.0] |[24.454819713457812,-24.454819713457812]|[0.9999999999760445,2.3955582882827004E-11]|0.0 |
|0.0 |[1.0,0.0,0.0,27.0,1.5,3.0,18.0,5.0,2.0] |[21.920009187230548,-21.920009187230548]|[0.9999999996978233,3.021766947986581E-10] |0.0 |
|0.0 |[1.0,0.0,0.0,27.0,4.0,2.0,18.0,5.0,5.0] |[24.01911260197023,-24.01911260197023] |[0.9999999999629634,3.703667040712842E-11] |0.0 |
|0.0 |[1.0,0.0,0.0,27.0,4.0,3.0,16.0,5.0,4.0] |[22.776375736003562,-22.776375736003562]|[0.9999999998716649,1.2833517289922962E-10]|0.0 |
|0.0 |[1.0,1.0,0.0,27.0,4.0,2.0,18.0,6.0,1.0] |[18.629921259118063,-18.629921259118063]|[0.999999991887999,8.112000996701378E-9] |0.0 |
+-----+-----------------------------------------+----------------------------------------+-------------------------------------------+----------+