1.引言
信貸行業(yè)中常見的評分卡包括:申請評分卡(Application)、行為評分卡(Behavior)厕怜、催收評分卡(Collection)以及反欺詐評分卡(Anti-Fraud)衩匣,簡稱為A卡、B卡粥航、C卡和F卡琅捏。
A卡,主要應用于貸前準入環(huán)節(jié)對新用戶的信用評級递雀。
B卡柄延,主要應用于貸中管理環(huán)節(jié)對存量用戶的行為預測。
C卡缀程,主要應用于貸后催收環(huán)節(jié)對存量用戶是否催收的預測管理搜吧。
F卡,主要應用于貸前準入環(huán)節(jié)對新用戶可能存在的欺詐行為進行預測杨凑。
本文通過歷史數(shù)據(jù)建立Logistic回歸模型滤奈,預測用戶出現(xiàn)違約的概率,從而建立申請評分卡模型撩满。
本文數(shù)據(jù)來自“klaR”包中的German credit data蜒程。
2.數(shù)據(jù)導入與觀察
加載要用到的數(shù)據(jù),并進行初步數(shù)據(jù)觀察:
> library(klaR) #數(shù)據(jù)集包
> library(VIM) #缺失值可視化
> library(party) #隨機森林
> library(InformationValue) #求IV值
> library(smbinning) #最優(yōu)分段
> library(ggplot2) #可視化
> library(gridExtra) #可視化
> library(woe) #求woe值
> library(car) #檢驗多重共線性
> library(pROC)
> data(GermanCredit)
> str(GermanCredit)
略
> summary(GermanCredit)
略
該數(shù)據(jù)集包含了1000個樣本伺帘,每個樣本包括21個變量昭躺,變量含義如下:
3.數(shù)據(jù)清洗
數(shù)據(jù)清洗主要工作包括缺失值和異常值處理。
3.1 缺失值處理
查看缺失值情況:
> aggr(x = GermanCredit,prop=T,numbers=T,combined=F)
從以上結果可看出伪嫁,本數(shù)據(jù)集不存在缺失值窍仰。
3.2 異常值處理
查看定量指標異常值情況:
> quan_index <-c("duration","amount","installment_rate","present_residence",
+ "age","number_credits","people_liable")
> quan_vars <- GermanCredit[,quan_index]
> boxplot(scale(quan_vars),col="lightgray")
從圖2可以看出,定量指標中存在異常值礼殊。下面驹吮,讓我們具體來看一下:
> table(boxplot.stats(quan_variables$duration)$out)
45 47 48 54 60 72
5 1 48 2 13 1
> table(boxplot.stats(quan_variables$amount)$out)
7966 7980 8065 8072 8086 8133 8229 8318 8335 8358 8386 8471 8487 8588
1 1 1 1 1 1 1 1 1 1 1 1 1 1
8613 8648 8858 8947 8978 9034 9055 9157 9271 9277 9283 9398 9436 9566
1 1 1 1 1 1 1 1 1 1 1 1 1 1
9572 9629 9857 9960 10127 10144 10222 10297 10366 10477 10623 10722 10875 10961
1 1 1 1 1 1 1 1 1 1 1 1 1 1
10974 11054 11328 11560 11590 11760 11816 11938 11998 12169 12204 12389 12579 12612
1 1 1 1 1 1 1 1 1 1 1 1 1 1
12680 12749 12976 13756 14027 14179 14318 14421 14555 14782 14896 15653 15672 15857
1 1 1 1 1 1 1 1 1 1 1 1 1 1
15945 18424
1 1
> table(boxplot.stats(quan_variables$age)$out)
65 66 67 68 70 74 75
5 5 3 3 1 4 2
> table(boxplot.stats(quan_variables$number_credits)$out)
4
6
> table(boxplot.stats(quan_variables$people_liable)$out)
2
155
根據(jù)具體情況來看针史,定量指標中存在的異常值是基本符合實際情況的,而且數(shù)據(jù)集樣本數(shù)量較少碟狞,因此不對異常值做處理啄枕。(本例比較特殊,實際工作中的情況肯定會比較復雜)
4.特征變量選擇
本數(shù)據(jù)集包含了定量和定性兩類指標族沃,接下來我們用不同的方法频祝,篩選出對違約狀態(tài)影響最大的指標,作為構建模型的變量脆淹。
首先常空,根據(jù)簡單隨機抽樣,將數(shù)據(jù)集劃分為訓練集和測試集:
> set.seed(1234)
> GermanCredit$credit_risk <- ifelse(GermanCredit$credit_risk=="good",0,1)
> sam <- sample(nrow(GermanCredit), 800, replace = F)
> train <- GermanCredit[sam, ]
> test <- GermanCredit[-sam, ]
4.1 定量指標
以下用隨機森林法和Logistic回歸方法盖溺,尋找對因變量影響最顯著的自變量:
> # 提取定量指標
> quant_vars<-c("duration","amount","installment_rate","present_residence","age",
+ "number_credits","people_liable","credit_risk")
> quant_data<-GermanCredit[,quant_vars]
> # 隨機森林法
> fit1 <- cforest(credit_risk~.,data = quant_data,controls = cforest_unbiased(mtry = 2, ntree = 50))
> # 調整變量間的相關系數(shù)漓糙,獲取自變量的重要性
> sort(varimp(fit1,conditional = T),decreasing=T)
duration age amount people_liable installment_rate
0.0046574142 0.0032983348 0.0028402490 0.0009693390 0.0007194029
number_credits present_residence
0.0005876221 -0.0001082808
> # 調整樣本變量不平衡性,獲取自變量的重要性
> sort(varimpAUC(fit1),decreasing = T)
duration amount age installment_rate people_liable
0.0185098322 0.0118307629 0.0089927377 0.0040057809 0.0026379584
number_credits present_residence
0.0011849564 -0.0009655443
> # Logistic回歸
> fit2 <- glm(credit_risk~.,data = quant_data,family = binomial())
> fit2 <- step(fit2,trace = 0)
> summary(fit2)
Call:
glm(formula = credit_risk ~ duration + installment_rate + age,
family = binomial(), data = quant_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4943 -0.8464 -0.6890 1.2015 2.2806
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.427935 0.362767 -3.936 8.28e-05 ***
duration 0.037535 0.006553 5.728 1.02e-08 ***
installment_rate 0.158810 0.074329 2.137 0.03263 *
age -0.021756 0.007686 -2.831 0.00464 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
略
綜合以上兩種方法的結果烘嘱,我們篩選出了對違約狀態(tài)影響最顯著的四個指標 :duration(3)昆禽、age(3)、amount(2)和installment_rate(2)蝇庭。
4.2 定性指標
通過R中的informationvalue包醉鳖,計算各指標的IV值,得到各定性指標間的重要性度量哮内,選取其中的high predictive指標:
> # 提取定性指標
> qualt_vars<-c("status","credit_history","purpose","savings","employment_duration",
+ "personal_status_sex","other_debtors","property",
+ "other_installment_plans","housing","job","telephone","foreign_worker","credit_risk")
> qualt_data<-train[,qualt_vars]
> # 求指標iv值
> all_iv <- data.frame(vars=qualt_vars,iv=numeric(length(qualt_vars)),
+ strength=character(length(qualt_vars)),stringsAsFactors = F)
> for (i in qualt_vars){
+ all_iv[all_iv$vars==i,]$iv <- IV(X=qualt_data[,i], Y=qualt_data$credit_risk)
+ all_iv[all_iv$vars==i,]$strength <- attr(IV(X=qualt_data[,i], Y=qualt_data$credit_risk),"howgood")
+ }
> (all_iv<-all_iv[order(-all_iv$iv),] )
vars iv strength
1 status 0.62161414 Highly Predictive
2 credit_history 0.28847143 Highly Predictive
4 savings 0.21358882 Highly Predictive
3 purpose 0.21326499 Highly Predictive
8 property 0.09828292 Somewhat Predictive
5 employment_duration 0.08935235 Somewhat Predictive
10 housing 0.07040316 Somewhat Predictive
6 personal_status_sex 0.06871949 Somewhat Predictive
9 other_installment_plans 0.04371291 Somewhat Predictive
7 other_debtors 0.04350575 Somewhat Predictive
13 foreign_worker 0.04262905 Somewhat Predictive
12 telephone 0.01669733 Not Predictive
11 job 0.01002765 Not Predictive
14 credit_risk 0.00000000 Not Predictive
根據(jù)以上結果盗棵,我們選擇status、credit_history北发、savings和purpose四個high predictive指標構建模型漾根。
綜上,我們共選擇了8個變量作為入模變量鲫竞。
> # 入模指標
> quant_model_vars <- c("duration","amount","installment_rate","age")
> qualt_model_vars <- c("status","credit_history","savings","purpose")
5.WOE計算
5.1 變量分箱
5.1.1 定量指標
計算定量指標的WOE之前,需要先對定量指標進行分段逼蒙。下面从绘,優(yōu)先采用最優(yōu)分段,其原理是基于條件推理樹(conditional inference trees, Ctree)的遞歸分割算法是牢,核心算法用函數(shù)ctree()表示僵井。
> # 對duration進行最優(yōu)分段
> result1<-smbinning(df=train,y="credit_risk",x="duration",p=0.05)
> 查看分段效果
> smbinning.plot(result1,option="WoE",sub="Duration")
> breaks1 <- c(0,7,30,Inf)
> train$cut_duration <- cut(train$duration,breaks = breaks1)
從上圖可以看出,woe值相差較大驳棱,分段效果不錯批什。以下針對amount、age采用相同分段方法社搅。
> #對amount進行最優(yōu)分段
> result2<-smbinning(df=train,y="credit_risk",x="amount")
> smbinning.plot(result2,option="WoE",sub="Amount")
> breaks2 <- c(0,1372,3913,Inf)
> train$cut_amount <- cut(train$amount,breaks = breaks2)
> result3<-smbinning(df=train,y="credit_risk",x="age")
> smbinning.plot(result3,option="WoE",sub="Age")
> breaks3 <- c(0,25,Inf)
> train$cut_age <- cut(train$age,breaks = breaks3)
> train$cut_rate <- cut(train$installment_rate,4)
> rate_woe <- woe(Data = train,Independent ="cut_rate",Continuous = F,Dependent = "credit_risk",
+ C_Bin = 4,Bad = 0,Good = 1)
> ggplot(rate_woe, aes(x = BIN, y = WOE)) +
+ geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+ size = 0.2, alpha = 0.2)+
+ labs(title = "等距分段")+
+ theme(plot.title = element_text(hjust = 0.5))
installment_rate只有1驻债、2乳规、3、4四個值合呐,四個值對應的woe值差別較大暮的,且具有單調性,采用等距分段淌实。
5.1.2 定性指標
接下來我們需要對定性指標做必要的降維處理冻辩,方便計算其WOE值。首先拆祈,我們查看一下入模的定性指標的概況:
> discrete_data<-train[,qual_model_vars]
> summary(discrete_data)
status credit_history
... < 100 DM :211 no credits taken/all credits paid back duly: 34
0 <= ... < 200 DM :214 all credits at this bank paid back duly : 41
... >= 200 DM / salary for at least 1 year: 50 existing credits paid back duly till now :422
no checking account :325 delay in paying off in the past : 66
critical account/other credits existing :237
savings purpose
... < 100 DM :478 domestic appliances:216
100 <= ... < 500 DM : 84 car (new) :189
500 <= ... < 1000 DM : 48 radio/television :145
... >= 1000 DM : 41 car (used) : 83
unknown/no savings account:149 others : 83
retraining : 39
(Other) : 45
由以上概況可知恨闪,定性指標status、credit_history放坏、和savings的維數(shù)最高為5維咙咽,最低為4維,維數(shù)適中轻姿,可以不進行處理犁珠。
定性指標purpose的維數(shù)多于7維,明顯高于其他定性指標互亮。為了避免“維數(shù)災難”犁享,我們根據(jù)三條準則進行降維:1.維度間屬性相似;2.合并后woe有明顯變化豹休;3.單個維度樣本量不應過小炊昆。
> # 未進行降維前
> purpose_woe1 <- woe(Data = train,Independent ="purpose",Continuous = F,Dependent = "credit_risk",
+ C_Bin = 10,Bad = 0,Good = 1)
> ggplot(purpose_woe1, aes(x = BIN, y = WOE)) +
+ geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+
+ labs(title = "Purpose")+
+ theme(plot.title = element_text(hjust = 0.5))
> # 類似屬性合并
> train <- within(train,{
+ cut_purpose <- NA
+ cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+ cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+ cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+ cut_purpose[purpose=="education"] <- "education/retraining"
+ cut_purpose[purpose=="retraining"] <- "education/retraining"
+ cut_purpose[purpose=="business"] <- "repairs/business/others"
+ cut_purpose[purpose=="others"] <- "repairs/business/others"})
> purpose_woe2 <- woe(Data = train,Independent ="cut_purpose",Continuous = F,Dependent = "credit_risk",
+ C_Bin = 10,Bad = 0,Good = 1)
> ggplot(purpose_woe2, aes(x = BIN, y = WOE)) +
+ geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+ size = 0.2, alpha = 0.2)+
+ labs(title = "Purpose")+
+ theme(plot.title = element_text(hjust = 0.5))+
+ theme(axis.text.x = element_text(vjust = 0.2, hjust = 0.2, angle = 10)
5.2 WOE計算
用klaR包中的woe()函數(shù)獲取入模變量的woe值。
> newtrain <- cbind(discrete_data[,-4],train[,c(21:26)])
> str(newtrain)
'data.frame': 800 obs. of 9 variables:
$ status : Factor w/ 4 levels "... < 100 DM",..: 4 4 4 1 4 4 2 3 3 4 ...
$ credit_history: Factor w/ 5 levels "no credits taken/all credits paid back duly",..: 5 5 3 3 5 4 5 3 3 3 ...
$ savings : Factor w/ 5 levels "... < 100 DM",..: 1 1 1 1 1 1 1 1 1 2 ...
$ credit_risk : num 1 1 0 1 0 0 1 1 0 0 ...
$ cut_duration : Factor w/ 3 levels "(0,7]","(7,30]",..: 3 2 2 3 2 3 2 3 2 3 ...
$ cut_amount : Factor w/ 3 levels "(0,1.37e+03]",..: 3 2 2 2 2 3 3 3 1 3 ...
$ cut_age : Factor w/ 2 levels "(0,25]","(25,Inf]": 1 2 2 1 2 1 2 2 1 2 ...
$ cut_rate : Factor w/ 4 levels "(0.997,1.75]",..: 4 3 4 4 4 2 4 4 3 4 ...
$ cut_purpose : chr "car(new/used)" "car(new/used)" "furniture/equipment/radio/television/domestic appliances" "furniture/equipment/radio/television/domestic appliances" ...
> newtrain$credit_risk <- as.factor(newtrain$credit_risk)
> newtrain$cut_purpose <- as.factor(newtrain$cut_purpose)
> # 獲取woe值
> woemodel<-klaR::woe(credit_risk~.,data = newtrain,zeroadj=0.5,applyontrain=TRUE)
> traindata <- predict(woemodel, newtrain, replace = TRUE)
> str(traindata)
'data.frame': 800 obs. of 9 variables:
$ credit_risk : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 1 1 ...
$ woe.status : num 1.127 1.127 1.127 -0.772 1.127 ...
$ woe.credit_history: num 0.6989 0.6989 -0.0638 -0.0638 0.6989 ...
$ woe.savings : num -0.292 -0.292 -0.292 -0.292 -0.292 ...
$ woe.cut_duration : num -0.8058 0.0808 0.0808 -0.8058 0.0808 ...
$ woe.cut_amount : num -0.55 0.419 0.419 0.419 0.419 ...
$ woe.cut_age : num -0.612 0.161 0.161 -0.612 0.161 ...
$ woe.cut_rate : num -0.15884 -0.00808 -0.15884 -0.15884 -0.15884 ...
$ woe.cut_purpose : num -0.0199 -0.0199 0.1853 0.1853 0.1853 ...
至此威根,我們已經(jīng)獲得了入模變量對應的woe值凤巨。值的注意的是,我們之前將好客戶設定為0洛搀,壞客戶設定為1敢茁,所以woe值越大,代表客戶違約的概率越大留美,但traindata中的woe實際是按照好客戶為1彰檬,壞客戶為0計算的,所以與之前變量分箱中計算的woe正好相反谎砾。
下面正式開始構建模型逢倍,并轉換為標準評分卡。
6 模型構建與驗證
6.1 構建邏輯回歸模型
> # 用獲得的woe數(shù)據(jù)進行邏輯回歸
> trainmodel<-glm(credit_risk~.,data=traindata,family = binomial())
> summary(trainmodel)
Call:
glm(formula = credit_risk ~ ., family = binomial(), data = traindata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7274 -0.7140 -0.4202 0.7805 2.5005
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9024 0.0924 -9.766 < 0.0000000000000002 ***
woe.status -0.7947 0.1205 -6.594 0.0000000000427 ***
woe.credit_history -0.8322 0.1712 -4.861 0.0000011658520 ***
woe.savings -0.8522 0.2102 -4.055 0.0000501789230 ***
woe.cut_duration -0.7785 0.1875 -4.151 0.0000330526950 ***
woe.cut_amount -0.7891 0.2259 -3.493 0.000478 ***
woe.cut_age -0.9719 0.2803 -3.468 0.000525 ***
woe.cut_rate -1.7358 0.5309 -3.269 0.001078 **
woe.cut_purpose -1.0409 0.3958 -2.630 0.008541 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 963.44 on 799 degrees of freedom
Residual deviance: 754.15 on 791 degrees of freedom
AIC: 772.15
Number of Fisher Scoring iterations: 5
從以上邏輯回歸的結果來看景图,各個變量都通過了顯著性檢驗较雕。同時,為防止多重共線性問題的出現(xiàn)挚币,我們對模型進行VIF檢驗:
> vif(trainmodel)
woe.status woe.credit_history woe.savings woe.cut_duration
1.043140 1.024758 1.039515 1.048207
woe.cut_amount woe.cut_age woe.cut_rate woe.cut_purpose
1.088700 1.038714 1.055447 1.036517
從結果可知亮蒋,所有變量VIF均小于4扣典,可以判斷模型中不存在多重共線性問題。
6.2 轉換為標準評分卡
根據(jù)信用評分卡模型的建立宛蚓,我們可以得到:
其中激捏,woe=ln(odds),odds為good_rate/bad_rate凄吏,beita為回歸系數(shù)远舅,altha為截距,n為變量個數(shù)痕钢,offset為偏移量(視風險偏好而定)图柏,比例因子factor。
假定odds=50任连,對應的評分為600蚤吹,在此基礎上評分值增加20分(1個PDO),可以使odds翻番随抠,則可以得出:
6.2.1 轉換為標準評分卡
獲取基礎分以及訓練集中各變量的分數(shù):
> # 設定alpha為比例因子factor裁着,beta為風險偏移量offset
> alpha_beta <- function(basepoints,baseodds,pdo){
+ alpha <- pdo/log(2)
+ beta <- basepoints-alpha*log(baseodds)
+ return(list(alpha=alpha,beta=beta))
+ }
> # 指定0dds=50時,基礎分為600分拱她,比率翻番的分數(shù)為20二驰,計算評分卡的系數(shù)alpha和beta
> (x <- alpha_beta(600,50,20))
$alpha
[1] 28.8539
$beta
[1] 487.1229
> # 獲得模型系數(shù)
> (coefficients <- trainmodel$coefficients)
(Intercept) woe.status woe.credit_history woe.savings
-0.9023751 -0.7946682 -0.8321739 -0.8522369
woe.cut_duration woe.cut_amount woe.cut_age woe.cut_rate
-0.7785099 -0.7891044 -0.9718816 -1.7357656
woe.cut_purpose
-1.0409225
> #構造計算分值函數(shù):
> vars_score<-function(i){
+ score = -round(x$alpha*coefficients[i]*traindata[,names(coefficients[i])])
+ return(score)
+ }
> # 計算基礎分值
> (basepoint <- round(x$beta-x$alpha*coefficients[1]))
(Intercept)
513
> # 1.status_score
> status_score <- vars_score(2)
> colnames(status_score)<-"status_score"
> # 2.credit_history_score
> credit_history_score <- vars_score(3)
> colnames(credit_history_score)<-"credit_history_score"
> # 3.savings_score
> savings_score <- vars_score(4)
> colnames(savings_score)<-"savings_score"
> # 4.duration_score
> duration_score <- vars_score(5)
> colnames(duration_score)<-"duration_score"
> # 5.amount_score
> amount_score <- vars_score(6)
> colnames(amount_score)<-"amount_score"
> # 6.age_score
> age_score <- vars_score(7)
> colnames(age_score)<-"age_score"
> # 7.rate_score
> rate_score <- vars_score(8)
> colnames(rate_score)<-"rate_score"
> # 8.purpose_score
> purpose_score <- vars_score(9)
> colnames(purpose_score)<-"purpose_score"
6.2.2 輸出標準評分卡
輸出CSV格式的標準評分卡:
> # 基礎分
> a <- c("","basepoint",513)
> b <- matrix(r1,nrow = 1)
> colnames(b)<-c("Variable","Basepoint","Score")
> #2.duration的分值
> duration_Cutpoint <- as.matrix(newtrain$cut_duration,stringsAsFactors=F)
> duration_scoreCard<-cbind(as.matrix(c("Duration","",""),ncol=1),
+ unique(cbind(duration_Cutpoint,duration_score)))
> #3.amount的分值
> amount_Cutpoint <- as.matrix(newtrain$cut_amount,stringsAsFactors=F)
> amount_scoreCard<-cbind(as.matrix(c("Amount","",""),ncol=1),
+ unique(cbind(amount_Cutpoint,amount_score)))
> #4.age的分值
> age_Cutpoint <- as.matrix(newtrain$cut_age,stringsAsFactors=F)
> age_scoreCard<-cbind(as.matrix(c("Age",""),ncol=1),
+ unique(cbind(age_Cutpoint,age_score)))
> #5.installment_rate的分值
> rate_Cutpoint <- as.matrix(newtrain$cut_rate,stringsAsFactors=F)
> rate_scoreCard<-cbind(as.matrix(c("Installment_rate","","",""),ncol=1),
+ unique(cbind(rate_Cutpoint,rate_score)))
> #6.status的分值
> status <- as.matrix(newtrain$status,stringsAsFactors=F)
> status_scoreCard<-cbind(as.matrix(c("Status","","",""),ncol=1),
+ unique(cbind(status,status_score)))
> #7.credit_history的分值
> credit_history <- as.matrix(newtrain$credit_history,stringsAsFactors=F)
> credit_history_scoreCard<-cbind(as.matrix(c("Credit_history","","","",""),ncol=1),
+ unique(cbind(credit_history,credit_history_score)))
> #8.savings的分值
> savings <- as.matrix(newtrain$savings,stringsAsFactors=F)
> savings_scoreCard<-cbind(as.matrix(c("Savings","","","",""),ncol=1),
+ unique(cbind(savings,savings_score)))
> #9.purpose的分值
> purpose <- as.matrix(newtrain$cut_purpose,stringsAsFactors=F)
> purpose_scoreCard<-cbind(as.matrix(c("Purpose","","",""),ncol=1),
+ unique(cbind(purpose,purpose_score)))
> scoreCard_CSV<-rbind(m1,duration_scoreCard,amount_scoreCard,age_scoreCard,
+ rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+ savings_scoreCard,purpose_scoreCard)
> scoreCard_CSV<-rbind(b,duration_scoreCard,amount_scoreCard,age_scoreCard,
+ rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+ savings_scoreCard,purpose_scoreCard)
> #輸出標準評分卡到文件中
> write.csv(scoreCard_CSV,"C:/Users/Administrator/Desktop/ScoreCard.CSV")
7 模型驗證
對測試集中的樣本做同樣的降維處理:
> # 對duration分段
> breaks1 <- c(0,7,30,Inf)
> test$cut_duration <- cut(test$duration,breaks = breaks1)
> # 對amount分段
> breaks2 <- c(0,1372,3913,Inf)
> test$cut_amount <- cut(test$amount,breaks = breaks2)
> # 對age分段
> breaks3 <- c(0,25,Inf)
> test$cut_age <- cut(test$age,breaks = breaks3)
> # 對installment_rate分段
> test$cut_rate <- cut(test$installment_rate,4)
> # 對purpose分段
> test <- within(test,{
+ cut_purpose <- NA
+ cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+ cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+ cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+ cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+ cut_purpose[purpose=="education"] <- "education/retraining"
+ cut_purpose[purpose=="retraining"] <- "education/retraining"
+ cut_purpose[purpose=="business"] <- "repairs/business/others"
+ cut_purpose[purpose=="others"] <- "repairs/business/others"})
> newtest <- cbind(test[,qualt_model_vars][-4],test[,c(21:26)])
> newtest$credit_risk <- as.factor(newtest$credit_risk)
> newtest$cut_purpose <- as.factor(newtest$cut_purpose)
> # 將newtest中的各個變量轉換為對應的woe值
> woemodel_test<-klaR::woe(credit_risk~.,data = newtest,zeroadj=0.5,applyontrain=TRUE)
> # 獲得woe數(shù)據(jù)框
> testdata <- predict(woemodel_test, newtest, replace = TRUE)
> # 測試集驗證
> prob <- predict(trainmodel,testdata,type="response")
> logit.pred <- ifelse(prob>0.5,1,0) #閾值簡單設為0.5
> (Freq <- table(logit.pred,testdata$credit_risk))
logit.pred 0 1
0 121 35
1 11 33
> # 準確率
> (ACC <- sum(diag(Freq))/sum(Freq))
[1] 0.77
> # AUC、Gini系數(shù)
> modelroc <- roc(testdata$credit_risk,prob)
> plot(modelroc, print.auc=TRUE, auc.polygon=T, grid=c(0.1, 0.2),
+ grid.col=c("green", "red"), max.auc.polygon=TRUE,
+ auc.polygon.col="skyblue", print.thres=T)
> modelauc<- auc(modelroc)
> (Gini <- 2*modelauc-1)
[1] 0.6106283
從以上結果可知秉沼,模型準確率ACC為0.77桶雀,AUC為0.805,Gini系數(shù)為0.61唬复,整體效果尚可矗积。
8 總結
本文通過對Germancredit數(shù)據(jù)的挖掘分析,從數(shù)據(jù)清洗敞咧、變量篩選棘捣、WOE計算、建模分析到模型驗證休建,創(chuàng)建了一個簡單的申請評分卡乍恐。
本文用到的數(shù)據(jù)集比較簡單,在實操中丰包,數(shù)據(jù)清洗應該會占用更多時間和精力。
本文僅進行了一次樣本抽樣壤巷,在實操中邑彪,應進行K折交叉檢驗,提升模型準確度胧华。
開發(fā)的模型是基于某一時間的特定樣本的寄症,隨著時間的推移和信貸政策的變化宙彪,樣本會發(fā)生變化,從而造成模型的區(qū)分能力和穩(wěn)定性變差有巧。一般需要定期對模型的使用情況進行檢測并報告模型區(qū)分能力和穩(wěn)定性的變化情況释漆,必要時應采取包括修正模型或重建模型等措施。這是后面需要認真學習的地方篮迎!