數(shù)據(jù)與變量:
加載數(shù)據(jù)
> cup98 <- read.csv("F:/cup98lrn.txt")
> dim(cup98[,1:10])
[1] 95412? ? 10
> n.missing<-rowSums(is.na(cup98))
#計算存在NA值的行數(shù)
> sum(n.missing>0)
[1] 95412
選擇變量
> varSet<-c(
#demographics
"ODATEDW","OSOURCE","STATE","ZIP","PVASTATE","DOB","RECINHSE","MDMAUD","DOMAIN","CLUSTER","AGE","HOMEOWNR","CHILD03","CHILD07","CHILD12","CHILD18","NUMCHLD","INCOME","GENDER","WEALTH1","HIT",
#donor interests
"COLLECT1","VETERANS","BIBLE","CATLG","HOMEE","PETS","CDPLAY","STEREO","PCOWNERS","PHOTO","CRAFTS","FISHER","GARDENIN","BOATS","WALKER","KIDSTUFF","CARDS","PLATES","PEPSTRFL",
#summary variables of promotion?history "CARDPROM","MAXADATE","NUMPROM","CARDPM12","NUMPRM12",
#summary variables of giving?history "RAMNTALL","NGIFTALL","CARDGIFT","MINRAMNT","MAXRAMNT","LASTGIFT","LASTDATE","FISTDATE","TIMELAG","AVGGIFT",
#ID & targets
"CONTROLN","TARGET_B","TARGET_D","HPHONE_D",
#RFA
"RFA_2F","RFA_2A","MDMAUD_R","MDMAUD_F","MDMAUD_A",
#OTHERS
"CLUSTER2","GEOCODE2")
#刪除Id和TARGET_D屬性
> vars <- setdiff(varSet, c("CONTROLN", "TARGET_D"))
> cup98 <- cup98[, vars]
> dim(cup98)
[1] 95412? ? 64
隨機(jī)森林:
使用randomForest包和party包創(chuàng)建隨機(jī)森林主卫。
randomForest包無法處理包含缺失值或者擁有超過32個等級水平的分類變量庭再。但學(xué)習(xí)數(shù)據(jù)集中的所有記錄都包含了一個或多個缺失值牍帚。即使是varset中的變量值牺弄,也有大于93%的記錄含有一個或者多個缺失值妆绞。這在真實數(shù)據(jù)中很常見隐圾。集中的很多分類變量都含有超過32個等級水平龙致。例如: 表示國家,郵政編碼刘莹,職位以及制造商中的變量阎毅。這些變量分類中有一部分可以通過分組方式來減少等級水平,例如表示職位的變量点弯。通過將只含有少量記錄水平劃分為同一組來減少分類變量的等級水平數(shù)量扇调。例如小國家和制造商。
#查看缺失值以及分類變量等級超過10 的數(shù)據(jù)
> ? library(randomForest)
> ? rf <- randomForest(TARGET_B~.,data=cup98)
下面查看含有缺失值及分類變量等級超過10 的數(shù)據(jù)抢肛。
> #checking missing values
> n.missing <- rowSums(is.na(cup98))
> (tab.missing <-table(n.missing))
n.missing
? ? ? 0 1 2 3 4 5 6 7 6782 36864 23841 13684 11716 2483 41 1
> #percentage of records without missing values
> (tab.missing <-table(n.missing))
> round(tab.missing["0"]/nrow(cup98),digits=2)
0
0.07
> #check levels of categorial variables
> idx.cat <- which(sapply(cup98,is.factor))
> all.levels <- sapply(names(idx.cat),function(x)
+ nlevels(cup98[,x]))
> all.levels[all.levels>10]
OSOURCE STATE ZIP MDMAUD DOMAIN
896 57 19938 28 17
下面將數(shù)據(jù)劃分為訓(xùn)練集和測試集
> trainPercentage <-80
> testPercentage <-20
> ind <- sample(2,nrow(cup98),replace=TRUE
+ ,prob=c(trainPercentage,testPercentage))
> trainData <- cup98[ind==1,]
> testData <- cup98[ind==2,]
接下來狼钮,使用party包中的函數(shù)cforest()創(chuàng)建隨機(jī)森林。對于80%的訓(xùn)練集雌团,創(chuàng)建一棵決策樹需要花費2min ,創(chuàng)建一個含有50棵決策樹的隨機(jī)森林需要花費1.5h燃领。
> # cforest
> library(party)
> (time1 <- Sys.time())
> cf <- cforest(TARGET_B~.,data=trainData, + control = cforest_unbiased(mtry=2,ntree=50)) 錯誤: 無法分配大小為11.3 Gb的矢量
> (time2 <- Sys.time())
[1] "2017-12-26 14:20:21 CST"
> time2-time1
Time difference of 4.906754 mins
> print(object.size(cf),units = "Mb")
Error in structure(.Call(C_objectSize, x), class = "object_size") : 找不到對象'cf'
> myPrediction <-predict(cf,newdata = testData)
Error in predict(cf, newdata = testData) : 找不到對象'cf'
> (time3 <-Sys.time())
[1] "2017-12-26 14:22:46 CST"
> time3-time2
Time difference of 2.417677 mins?
內(nèi)存問題
用party包中的函數(shù)ctree()創(chuàng)建決策樹士聪。
> memory.limit(4095)
[1] 8099
Warning message:
In memory.limit(4095) : 無法減少內(nèi)存極限:忽視不用
> library(party)
> ct <- ctree(TARGET_B ~.,data=trainData)
錯誤: 無法分配大小為11.3 Gb的矢量
? ? ? ?函數(shù)memory.limit() 對R設(shè)置內(nèi)存限制锦援,memory.size()當(dāng)前R運(yùn)行所占用的內(nèi)存或者最大運(yùn)存。memory.profile()用來查看使用了哪些內(nèi)存空間剥悟。函數(shù)object.size()返回R對象占用的內(nèi)存大小灵寺。關(guān)于R內(nèi)存空間分配的細(xì)節(jié)參照運(yùn)行?memory.size返回的信息区岗。
? ? ? ?當(dāng)運(yùn)行以上代碼創(chuàng)建決策樹時略板,遇到了一個內(nèi)存空間的問題。
※ 減少內(nèi)存需求的一種方法是對含有多個等級水平的分類變量進(jìn)行分組或者刪除慈缔。
? ? ? 首先叮称,使用20%數(shù)據(jù)進(jìn)行訓(xùn)練, 其中包含了大約19200行和62列。加入ZIP運(yùn)行瓤檐,函數(shù)ctree()返回了一個錯誤“ reach total memory allocation "赂韵。刪除了ZIP之后可以成功運(yùn)行,但需要花費25 min挠蛉。刪除OSOURCE后創(chuàng)建一棵決策樹只需要5s祭示。同樣,輸入80%的已經(jīng)刪除了變量ZIP和OSOURCE的數(shù)據(jù)(大約76000行和60列)谴古,執(zhí)行ctree()函數(shù)需要25s质涛。
樣本數(shù)據(jù)的訓(xùn)練模型
為了找到哪些變量將用于建模,在本節(jié)需要對創(chuàng)建決策樹的過程重復(fù)10次掰担。然后收集出現(xiàn)在所有決策樹中的每一個變量汇陆,并將收集到的變量用于建立最終模型。
首先带饱,將數(shù)據(jù)集劃分為3個子集:訓(xùn)練數(shù)據(jù)集30%瞬测,測試數(shù)據(jù)集20%和其余的數(shù)據(jù)。
劃分出一部分的數(shù)據(jù)是為了縮減訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)的規(guī)模纠炮,以便在內(nèi)存受限的環(huán)境下成功地執(zhí)行訓(xùn)練和測試月趟。
> library(party)#for tree
> trainPercentage <-30
> testPercentage <-20
> restPercentage <-100 -trainPercentage-testPercentage
> filename <-paste("cup98-ctree",trainPercentage,testPercentage,sep="-")
> vars <-setdiff(varSet,c("TARGET_D","CONTROLN","ZIP","OSOURCE"))
> # partition the data into training and test datasets
> ind <- sample(3,nrow(cup98),replace=T,
+ prob = c(trainPercentage,testPercentage,restPercentage))
> trainData <-cup98[ind==1,vars]
> testData <-cup98[ind==2,vars]
檢驗抽樣后得到的訓(xùn)練集和測試集中的目標(biāo)變量,看其分布與原始數(shù)據(jù)中的分布是否一致恢口。如果不一致孝宗,可以使用分層抽樣。
> #check the percentage of classes
> round(prop.table(table(cup98$TARGET_B)),digits=3)
? ? 0? ? 1
0.949 0.051
> round(prop.table(table(cup98$TARGET_B)),digits=3)
0 1
0.949 0.051
> round(prop.table(table(cup98$TARGET_B)),digits=3)
0 1
0.949 0.051
此時保存工作空間
> #remove raw data to save memory
> rm(cup98,ind)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1610444 86.1 2637877 140.9 2637877 140.9
Vcells 4657403 35.6 647222195 4938.0 877743600 6696.7
> memory.size()
[1] 156.11
? ? ? 接下來耕肩,在訓(xùn)練數(shù)據(jù)上調(diào)用函數(shù)ctree()創(chuàng)建決策樹因妇。為了簡化案例并增強(qiáng)可讀性,本章使用函數(shù)ctree()默認(rèn)設(shè)置來訓(xùn)練決策樹猿诸。例如婚被,決策樹的默認(rèn)設(shè)置在13.4節(jié)。在下面的代碼中梳虽,函數(shù)object.size()返回一個數(shù)據(jù)對象的大小址芯。
加載工作空間(因為在上一步rm操作)
> # build ctree
> myCtree <-NULL
> startTime <-Sys.time()
> myCtree <-ctree(TARGET_B~.,data = trainData)
> Sys.time() - startTime
Time difference of 30.86209 secs
> print(object.size(myCtree),units = "Mb")
9.6 Mb
> #print(myCtree) > memory.size()
[1] 2571.15
> #plot the tree and save it in a .PDF file
> pdf(paste(filename,".pdf",sep=""),width=12,height=9,
+ paper="a4r",pointsize=6)
> plot(myCtree,type="simple",ip_args=list(pval=F),
+ ep_args=list(digits=0),main=filename)
> graphics.off()
構(gòu)建10棵決策樹
使用已選變量建立模型
? ? ?建立了10棵決策樹之后,選取包含的所有變量來建立最后的模型窜觉。這一次所有的數(shù)據(jù)都用于學(xué)習(xí)谷炸,80%作為訓(xùn)練集和20%作為測試集。
> vars.selected<- c("CARDS", "CARDGIFT", "CARDPM12", "CHILD12", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT", "HOMEOWNR", "INCOME", "LASTDATE", "MINRAMNT", "NGIFTALL", "PEPSTRFL", "RECINHSE", "RFA_2A", "RFA_2F", "STATE", "WALKER")
> trainPercentage <- 80
> ?testPercentage <- 20
> (fileName <- paste("cup98-ctree", trainPercentage, testPercentage, sep="-"))
[1] "cup98-ctree-80-20" > vars <- c("TARGET_B", vars.selected)
> ind <- sample(2, nrow(cup98), replace=T, prob=c(trainPercentage, testPercentage))
> trainData <- cup98[ind==1, vars]
> testData <- cup98[ind==2, vars]
> #build a decision tree
> myCtree <-ctree(TARGET_B~.,data = trainData)
> print(object.size(myCtree),units="Mb")
48.1 Mb
> memory.size()
[1] 4994.07
> print(myCtree)
Conditional inference tree with 25 terminal nodes Response:
TARGET_B Inputs:
CARDS, CARDGIFT, CARDPM12, CHILD12, CLUSTER2, DOMAIN, GENDER, GEOCODE2, HIT, H;.'tatistic = 70.324 18)* weights = 207 17) CARDPM12 > 4 19)* weights = 2751 16) CARDGIFT > 3 20) LASTDATE <= 9610; criterion = 0.958, statistic = 43.626 21) CARDPM12 <= 4; criterion = 1, statistic = 40.985 22)* weights = 152 21) CARDPM12 > 4 23)* weights = 1057 20) LASTDATE > 9610 24)* weights = 47 1) RFA_2A == {F, G} 25) RFA_2F <= 1; criterion = 1, statistic = 117.069 26) PEPSTRFL == {X}; criterion = 1, statistic = 85.627 27) MINRAMNT <= 13; criterion = 0.999, statistic = 56.749 28)* weights = 8622 27) MINRAMNT > 13 29) RFA_2A == {F}; criterion = 0.987, statistic = 36.194 30)* weights = 87 29) RFA_2A == {G} 31)* weights = 274 26) PEPSTRFL == { } 32) CLUSTER2 <= 27; criterion = 1, statistic = 65.473 33)* weights = 12301 32) CLUSTER2 > 27 34) RFA_2A == {F}; criterion = 0.955, statistic = 27.535 35)* weights = 9557 34) RFA_2A == {G} 36)* weights = 3385 25) RFA_2F > 1 37) PEPSTRFL == {X}; criterion = 1, statistic = 95.078 38) LASTDATE <= 9609; criterion = 0.995, statistic = 62.087 39) GENDER == { , A}; criterion = 0.994, statistic = 54.878 40)* weights = 301 39) GENDER == {F, J, M, U} 41)* weights = 8208 38) LASTDATE > 9609 42) WALKER == {Y}; criterion = 0.969, statistic = 22.234 43)* weights = 58 42) WALKER == { } 44)* weights = 396 37) PEPSTRFL == { } 45) CARDGIFT <= 5; criterion = 1, statistic = 88.729 46) INCOME <= 3; criterion = 0.976, statistic = 84.344 47)* weights = 2770 46) INCOME > 3 48)* weights = 6668 45) CARDGIFT > 5 49)* weights = 543
? ? ? ?將所有已建立的決策樹保存為一個Rdata文件禀挫,并將決策樹的圖像保存在一個pdf 中旬陡。如果一棵決策樹很大,其中的節(jié)點以及節(jié)點中的文本將會出現(xiàn)重疊语婴。避免類似情況的一種方法是紙張的寬和高都相應(yīng)調(diào)大描孟,并使用pointsize將字體調(diào)小驶睦。此外,在繪制決策樹時可以減少圖像的文字匿醒。設(shè)置ip_args=list(pval=FALSE)來壓縮P值啥繁,并設(shè)置ep_args=list(digists=0)縮減數(shù)值的長度。
> save(myCtree,file=paste(fileName,".Rdata",sep=""))
> pdf(paste(fileName,".pdf",sep=""),width=12,height=9,
+ paper="a4r",pointsize=6)
> plot(myCtree,type="simple",ip_args=list(pval=F),ep_args=list(digits=0),main=filename)
>plot(myCtree,terminal_panel=node_barplot(myCtree),ip_args=list(pval=F),ep_args=list(digits=0),main=filename)
> graphics.off()
?然后使用測試數(shù)據(jù)對決策樹模型進(jìn)行測試青抛。
> rm(trainData)
> myPrediction <-predict(myCtree,newdata=testData)
> # check predicted results
> testResult <-table(myPrediction,testData$TARGET_B)
> percentageOfOne <-round(100*testResult[,2]/(testResult[,1]+testResult[,2]),digits=1)
> print(testResult)
myPrediction 0 1 0.0303567702900063 4627 168 0.0382599580712788 2375 119 0.0382771164021164 2887 123 0.0418929402637704 303 11 0.0466228332337119 1202 55 0.0517171717171717 612 40 0.0538709677419355 1467 84 0.063 712 39 0.0666666666666667 142 7 0.0670955882352941 492 47 0.0716565129097766 803 59 0.0723014256619145 715 45 0.0746705710102489 151 7 0.0775269872423945 251 21 0.0782828282828283 112 8 0.0939875604699378 649 80 0.0950413223140496 71 6 0.0985507246376812 97 5 0.114043355325165 247 30 0.115384615384615 20 0 0.135135135135135 153 20 0.167844522968198 133 15 0.176795580110497 36 7 0.22 10 1
> boxplot(myPrediction~testData$TARGET_B,xlab = "TARGET_B",ylab="Prediction",ylim=c(0,0.25))
> sl <-sort(myPrediction,decreasing=TRUE,method="quick",index.return=TRUE)
> testSize <- nrow(testData)
> TotalNumOfTarget <-sum(testData$TARGET_B)
> NumOfTarget <-rep(0,testSize)
> NumOfTarget[1] <- (testData$TARGET_B)[sl$ix[1]]
> for(i in 2:testSize){
+ NumOfTarget[i] <-NumOfTarget[i-1]+testData$TARGET_B[sl$ix[i]]}
> plot(1:testSize,NumOfTarget,pty=".",type="l",lty="solid",col="red",ylab="Count Of Responses in Top k",xlab="Top k",main=filename)
> grid(col="gray",lty="dotted")
> percentile <-100*(1:testSize)/testSize
>percentileTarget<-100*NumOfTarget/TotalNumOfTarget
>plot(percentile,percentileTarget,pty=".",type="l",lty="solid",col="red",ylab="Percentage of Predicted Donations(%)",xlab="Percentage of Pool",main=filename)
> grid(col="grey",lty="dotted")
評分
? ? ? ?使用一棵較大的決策樹進(jìn)行對大數(shù)據(jù)評分時旗闽,將會出現(xiàn)內(nèi)存溢出。為了減少內(nèi)存損耗蜜另,將評分?jǐn)?shù)據(jù)劃分為多少個子集适室,并對每一個子集分別使用預(yù)測模型,然后再將所以的評分結(jié)果進(jìn)行融合举瑰。
> memory.limit(4095)
> #read scoring data and training data
> cup98val <-read.csv("F://cup98VAL.txt")
>? cup98 <-read.csv("F://cup98LRN.txt")
> library(party)
> treeFileName <-"cup98-ctree-80-20"
> splitNum <-10
? ? ? ?評分之前捣辆,我們需要查看scoreData 和分類變量的等級水平是否一致。如果不一致此迅,需要根據(jù)scoreData中的因子水平對trianData中的進(jìn)行設(shè)置汽畴。只對于predict()的執(zhí)行十分關(guān)鍵。評分?jǐn)?shù)據(jù)中分類變量的缺失值和新值都設(shè)置成NA(缺失值)耸序。
> #check and set levels of categorical variables
> trainData <-cup98[,vars]
> vars2 <-setdiff(c(vars,"CONTROLN"),"TARGET_B")
>scoreData <-cup98val[,vars2]
>rm(cup98,cup98val)
?>scoreNames <-names(scoreData)
>#cat("\n checking and setting variable values \n")
>newScoreData <-scoreData
>variableList <-intersect(trainNames,scoreNames)
下面代碼一開始因為格式不正確報錯忍些,正確格式截圖如下:
? ? 查看新數(shù)據(jù)之后,再加載預(yù)測模型并查看內(nèi)存的使用情況坎怪。還要使用函數(shù)gc()將不再使用或者產(chǎn)生垃圾的對象刪除罢坝。
> load(paste(treeFileName,".Rdata",sep=""))
> print(object.size(trainData),units="Mb") 8 Mb
> print(object.size(scoreData),units="Mb") 187 Mb
> print(object.size(newScoreData),units="Mb") 8.1 Mb
> print(object.size(myCtree),units="Mb") 45 Mb
> gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 1700741 90.9 2637877 140.9 2554658 136.5 Vcells 146612276 1118.6 296119905 2259.3 295382927 2253.6
> memory.size() [1] 1268.69
> rm(trainNames,scoreNames) >
rm(variableList)
> rm(trainLevels,scoreLevels)
> rm(trainData,scoredata)
> gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 1701286 90.9 2637877 140.9 2554658 136.5 Vcells 146618456 1118.7 296119905 2259.3 295382927 2253.6 > memory.size() [1] 1268.72
將評分?jǐn)?shù)據(jù)劃分為多個子集,并對每個子集建立一棵決策樹以便降低內(nèi)存消耗搅窿,評分結(jié)果的分布情況如圖14-5所示嘁酿。
> nScore <-dim(newScoreData)[1]
> (splitSize <-round(nScore/splitNum))
[1] 9637?
> for (i in 1:splitNum) {
+ startPos <- 1+(i-1)*splitSize
+ if(i==splitNum) {
+ endPos <-nScore
+ }
+ else {
+ endPos <-i*splitSize
+ }
+ print(paste("Predicing:",startPos,"-",endPos))
+ #make prediction + tmpPred <- predict(myCtree,newdata=newScoreData + [startPos:endPos,])
+ myPred <-c(myPred,tmpPred)
+ }
[1] "Predicing: 1 - 9637" [1] "Predicing: 9638 - 19274" [1] "Predicing: 19275 - 28911" [1] "Predicing: 28912 - 38548" [1] "Predicing: 38549 - 48185" [1] "Predicing: 48186 - 57822" [1] "Predicing: 57823 - 67459" [1] "Predicing: 67460 - 77096" [1] "Predicing: 77097 - 86733" [1] "Predicing: 86734 - 96367"
> #cumulative count and percentage
> length(myPred)
[1] 96367
> rankedLevels <-table(round(myPred,digits=4))
> ? ? ? ? ? ? ?# ? put highest rank first by reversing the vector
> rankedLevels<-rankedLevels[length(rankedLevels):1]
> levelNum<-length(rankedLevels)
> cumCnt<-rep(0,levelNum)
> cumCnt[1]<-rankedLevels[1]
>? for (i in 2:levelNum) {
+ cumCnt[i]<-cumCnt[i-1]+rankedLevels[i]}
?>? cumPercent<-100*cumCnt/nScore
> cumPercent<-round(cumPercent,digits=1)
> percent<-100*rankedLevels/nScore
> percent<-round(percent,digits=1)
> cumBanking<-data.frame(rankedLevels,cumCnt,percent,cumPercent)
>?? names(cumRanking)<-c("Frequency","CumFrequency","Percentage","CumPercentage")
>? print(cumRanking)
> write.csv(cumRanking,"cup98-cumulative-ranking.csv",row.names=T)
> pdf(paste("cup98-score-distribution.pdf",sep=""))
>plot(rankedLevels,x= names(rankedLevels),type="h",xlab="Score",ylab="#of Customers")
>? ?sl <-sort(myPred,decreasing=TRUE,method="quick",index.return=TRUE)
> ?varToOutput <-c("CONTROLN")
> ?score<-round(myPred[sl$ix],digits=4)
> table(score,useNA="ifany")
score 0.0163 0.0285 0.031 0.0361 0.0414 0.0416 0.0486 0.0488 0.0534 0.0539 0.054 3624 10088 3178 18551 12758 5187 1230 5753 938 592 8383 0.0608 0.0703 0.0731 0.075 0.0884 0.125 0.1402 0.1636 0.1687 0.1833 0.1957 3856 6957 2082 6010 5618 138 745 51 213 161 183 0.2308 71
> result<-data.frame(cbind(newScoreData[sl$ix,varToOutput]),score)
> names(result) <-c(varToOutput,"score")
> write.csv(result,"cup98-predicted-score.csv",row.names=F)
##32位寫入EXCEL
> library(RODBC)
>? xlsFile <- obdcConnectExcel("cup98-predicted-score.xls")
> sqlSave(xlsFile,result,rownames=F)
> obdcCloseAll()
##64位(也可以安裝XLconnect包)寫入EXCEL(.csv)
> write.table(score,"cup98-predicted-score.csv",sep=",")
輸出規(guī)則
> print.TerminalNode<-function(x,rule=NULL,...){
+ n.rules<<-n.rules+1
+ node.ids<<-c(node.ids,x$nodeID)
+ n.records<<-c(n.records,sum(x$weights))
+ scores<<-c(scores,x$prediction)
+ ruleset<<-c(ruleset,rule)
+ }
> print.SplittingNode<-function(x,rule=NULL,...){
+ if(!is.null(rule)) {
+ rule<-paste(rule,"\n")
+ }
+ rule2<-print(x$psplit,left=TRUE,rule=rule)
+ print(x$left,rule=rule2)
+ rule3<-print(x$psplit,left=FALSE,rule=rule)
+ print(x$right,rule=rule3)
+ }
>? print.orderedSplit<-function(x,left=TRUE,rule=NULL,...){
if(!is.null(attr(x$splitpoint,"levels"))){ sp<-attr(x$splitpoint,"levels")[x$splitpoint]
}else{
sp<-x$splitpoint } n.pad<-20-nchar(x$variableName) pad<-paste(rep(" ",n.pad),collapse=" ")
if(!is.null(x$toleft)) { left<-as.logical(x$toleft)==left } if(left) {
rule2<-paste(rule,x$variableName,pad,"<=",sp,sep="")
} else{ rule2<-paste(rule,x$variableName,pad,">",sp,sep="")
}
rule2
}
> print.nominalSplit<-function(x,left=TRUE,rule=NULL,...){
+ levels<-attr(x$splitpoint,"levels")
+ ###is>0 for levels available in this code
+ tab<-x$table
+ if(left) {
+ lev<-levels[as.logical(x$splitpoint)&(tab>0)]
+ }else{
+ lev<-levels[!as.logical(x$splitpoint)&(tab>0)]
+ }
+ txt<-paste("'",paste(lev,collapse="','"),"'",sep="")
+ n.pad<-20-nchar(x$variableName)
+ pad<-paste(rep("",n.pad),collapse="")
+ rule2<-paste(rule,x$variableName,pad,txt,sep="")
?+ rule2
?+ }
? ? ? ? 調(diào)用print(myCtree@tree)抽取出規(guī)則的所有相關(guān)信息并保存到5個全局變量中國。借此案例按照規(guī)則的評分對其進(jìn)行排序男应,并輸出規(guī)則及其所包含記錄的百分比和累計百分比闹司。cumsum()計算數(shù)值型向量的積累和。為了節(jié)省空間沐飘,這里只輸出5個規(guī)則游桩。
> load(paste(treeFileName,".Rdata",sep=""))
> # extract rules from treeFileName
> n.rules<-0
> node.ids<-NULL
> n.records<-NULL
> scores<-NULL
> ruleset<-NULL
> print(myCtree@tree)
> n.rules
[1] 50
? ? ? 按照評分的規(guī)則進(jìn)行排序
> sl <-sort(scores,decreasing=T,method="quick",index.return=T)
> percentage <-100*n.records[sl$ix]/sum(myCtree@weights)
> cumPercentage<-round(cumsum(percentage),digits=1)
>percentage<-round(percentage,digits=1)
>load(paste(treeFileName,".Rdata",sep=""))
> #print all rules
>for (i in 1:n.rules) {
cat("Rrule",i,"\n")
cat("Node:",node.ids[sl$ix[i]])
cat(",score:",percentage[i],'%',sep="")
cat(",Percentage:",percentage[i],'%',sep="")
cat(",Cumulative Percentage:",cumPercentage[i],"&",sep="")?
cat("ruleset[sl$ix[i]]","\n\n")}
由于篇幅限制 , 只展示其中前五個規(guī)則:
Rrule 1 Node: 38,Percentage:0.0680192%,Cumulative Percentage:0.1&Rrule 2 Node: 38,Percentage:0.0680192%,Cumulative Percentage:0.1&Rrule 3 Node: 32,Percentage:0.1805125%,Cumulative Percentage:0.3&Rrule 4 Node: 32,Percentage:0.1805125%,Cumulative Percentage:0.5&Rrule 5 Node: 42,Percentage:0.1569674%,Cumulative Percentage:0.7 ……
? ? ? ?輸出SAS規(guī)則的得分
下面為四個改進(jìn)的輸出函數(shù):
> #functions for printing rules in SAS statement for scoring with a DATA step
> # based on "Print.R"from package party
> print.TerminalNode <-function(x,rule=NULL,...){
+ rule<-sub(' +',"",rule) #remove leading spaces + n.rules<<-n.rules+1
+ node.ids<<-c(node.ids,x$nodeID)
+ n.records<<-c(n.records,sum(x$weights))
+ scores<<-c(scores,x$prediction) +
ruleset<<-c(ruleset,rule)?
+ }
> print.SplittingNode<-function(x,rule=NULL,...){
+ if(!is.null(rule)) { + rule<-paste(rule,"\n and")
+ }#endif
+ rule2<-print(x$psplit,left=TRUE,rule=rule)
+ print(x$left,rule=rule2)
+ rule3<-print(x$psplit,left=FALSE,rule=rule)
+ print(x$right,rule=rule3)
+ }
> print.orderedSplit<-function(x,left=TRUE,rule=NULL,...){
+ if(!is.null(attr(x$splitpoint,"levels"))){
+ sp<-attr(x$splitpoint,"levels")[x$splitpoint]
+ }else{
+ sp<-x$splitpoint
+ } + if(!is.null(x$toleft)) {
+ left<-as.logical(x$toleft)==left
+ } + if(left) { + rule2<-paste(rule,x$variableName,"<=",sp,sep="")
+ } else{
+ rule2<-paste(rule,x$variableName,">",sp,sep="")
+ }
+ rule2
+ }
> print.nominalSplit<-function(x,left=TRUE,rule=NULL,...){
+ levels<-attr(x$splitpoint,"levels")
+ ###is>0 for levels available in this code + tab<-x$table
+ if(left) {
+ lev<-levels[as.logical(x$splitpoint)&(tab>0)]
+ }else{ + lev<-levels[!as.logical(x$splitpoint)&(tab>0)]
+ }
+ txt<-paste("'",paste(lev,collapse="','"),"'",sep="")
+ rule2<-paste(rule, " ",x$variableName,"in (",txt,")",sep="")
+ rule2
+ }
> library(party)#for tree
> load(paste(treeFileName,".Rdata",sep=""))
> n.rules<-0
> node.ids<-NULL
> n.records<-NULL
> scores<-NULL
> ruleset<-NULL
> print(myCtree@tree)
> n.rules
[1] 48
按照得分對其排名并輸出。
篇幅有限薪铜,只給出前4個众弓。
總結(jié):
? ? ? ? 本章介紹了一個內(nèi)存受限的環(huán)境下對大數(shù)據(jù)建立預(yù)測模型的案例。通過在樣本數(shù)據(jù)上建立決策樹隔箍,查找并收集有用的變量來建立最終的預(yù)測模型。此方法適用于內(nèi)存受限的大數(shù)據(jù)建模脚乡。
? ? ? ? ?另一種方法是對變量進(jìn)行抽樣蜒滩,每一次進(jìn)行變量抽樣后都建立一個模型滨达,建立了10個或者20個模型之后,從所有的這些變量中收集有用變量俯艰,并建立最終模型捡遍。該方法類似于隨機(jī)森林,隨機(jī)森林中的每一棵樹都由變量的隨機(jī)抽樣子集構(gòu)建而得到的竹握。但是画株,與構(gòu)建隨機(jī)森林比,該方法需要更少的內(nèi)存空間啦辐。