基于R語言的申請評分卡

1.引言

信貸行業(yè)中常見的評分卡包括:申請評分卡(Application)、行為評分卡(Behavior)厕怜、催收評分卡(Collection)以及反欺詐評分卡(Anti-Fraud)衩匣,簡稱為A卡、B卡粥航、C卡和F卡琅捏。
A卡,主要應用于貸前準入環(huán)節(jié)對新用戶的信用評級递雀。
B卡柄延,主要應用于貸中管理環(huán)節(jié)對存量用戶的行為預測。
C卡缀程,主要應用于貸后催收環(huán)節(jié)對存量用戶是否催收的預測管理搜吧。
F卡,主要應用于貸前準入環(huán)節(jié)對新用戶可能存在的欺詐行為進行預測杨凑。
本文通過歷史數(shù)據(jù)建立Logistic回歸模型滤奈,預測用戶出現(xiàn)違約的概率,從而建立申請評分卡模型撩满。
本文數(shù)據(jù)來自“klaR”包中的German credit data蜒程。

2.數(shù)據(jù)導入與觀察

加載要用到的數(shù)據(jù),并進行初步數(shù)據(jù)觀察:

> library(klaR)                     #數(shù)據(jù)集包
> library(VIM)                      #缺失值可視化
> library(party)                    #隨機森林
> library(InformationValue)         #求IV值
> library(smbinning)                #最優(yōu)分段
> library(ggplot2)                  #可視化
> library(gridExtra)                #可視化
> library(woe)                      #求woe值
> library(car)                      #檢驗多重共線性
> library(pROC)
> data(GermanCredit)
> str(GermanCredit)
略
> summary(GermanCredit)
略

該數(shù)據(jù)集包含了1000個樣本伺帘,每個樣本包括21個變量昭躺,變量含義如下:

變量解釋

3.數(shù)據(jù)清洗

數(shù)據(jù)清洗主要工作包括缺失值和異常值處理。

3.1 缺失值處理

查看缺失值情況:

> aggr(x = GermanCredit,prop=T,numbers=T,combined=F)
圖1

從以上結果可看出伪嫁,本數(shù)據(jù)集不存在缺失值窍仰。

3.2 異常值處理

查看定量指標異常值情況:

> quan_index <-c("duration","amount","installment_rate","present_residence",
+        "age","number_credits","people_liable")
> quan_vars <- GermanCredit[,quan_index]
> boxplot(scale(quan_vars),col="lightgray")
圖2

從圖2可以看出,定量指標中存在異常值礼殊。下面驹吮,讓我們具體來看一下:

> table(boxplot.stats(quan_variables$duration)$out)

45 47 48 54 60 72 
 5  1 48  2 13  1 
> table(boxplot.stats(quan_variables$amount)$out)

 7966  7980  8065  8072  8086  8133  8229  8318  8335  8358  8386  8471  8487  8588 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 8613  8648  8858  8947  8978  9034  9055  9157  9271  9277  9283  9398  9436  9566 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 9572  9629  9857  9960 10127 10144 10222 10297 10366 10477 10623 10722 10875 10961 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
10974 11054 11328 11560 11590 11760 11816 11938 11998 12169 12204 12389 12579 12612 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
12680 12749 12976 13756 14027 14179 14318 14421 14555 14782 14896 15653 15672 15857 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
15945 18424 
    1     1 
> table(boxplot.stats(quan_variables$age)$out)

65 66 67 68 70 74 75 
 5  5  3  3  1  4  2 
> table(boxplot.stats(quan_variables$number_credits)$out)

4 
6 
> table(boxplot.stats(quan_variables$people_liable)$out)

  2 
155 

根據(jù)具體情況來看针史,定量指標中存在的異常值是基本符合實際情況的,而且數(shù)據(jù)集樣本數(shù)量較少碟狞,因此不對異常值做處理啄枕。(本例比較特殊,實際工作中的情況肯定會比較復雜)

4.特征變量選擇

本數(shù)據(jù)集包含了定量和定性兩類指標族沃,接下來我們用不同的方法频祝,篩選出對違約狀態(tài)影響最大的指標,作為構建模型的變量脆淹。
首先常空,根據(jù)簡單隨機抽樣,將數(shù)據(jù)集劃分為訓練集和測試集:

> set.seed(1234)
> GermanCredit$credit_risk <- ifelse(GermanCredit$credit_risk=="good",0,1)
> sam <- sample(nrow(GermanCredit), 800, replace = F)
> train <- GermanCredit[sam, ]
> test <- GermanCredit[-sam, ]

4.1 定量指標

以下用隨機森林法和Logistic回歸方法盖溺,尋找對因變量影響最顯著的自變量:

> # 提取定量指標
> quant_vars<-c("duration","amount","installment_rate","present_residence","age",
+               "number_credits","people_liable","credit_risk")
> quant_data<-GermanCredit[,quant_vars]   
> # 隨機森林法
> fit1 <- cforest(credit_risk~.,data = quant_data,controls = cforest_unbiased(mtry = 2, ntree = 50))
> # 調整變量間的相關系數(shù)漓糙,獲取自變量的重要性
> sort(varimp(fit1,conditional = T),decreasing=T)
         duration               age            amount     people_liable  installment_rate 
     0.0046574142      0.0032983348      0.0028402490      0.0009693390      0.0007194029 
   number_credits present_residence 
     0.0005876221     -0.0001082808 
> # 調整樣本變量不平衡性,獲取自變量的重要性
> sort(varimpAUC(fit1),decreasing = T)
         duration            amount               age  installment_rate     people_liable 
     0.0185098322      0.0118307629      0.0089927377      0.0040057809      0.0026379584 
   number_credits present_residence 
     0.0011849564     -0.0009655443
> # Logistic回歸
> fit2 <- glm(credit_risk~.,data = quant_data,family = binomial())
> fit2 <- step(fit2,trace = 0)
> summary(fit2)

Call:
glm(formula = credit_risk ~ duration + installment_rate + age, 
    family = binomial(), data = quant_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4943  -0.8464  -0.6890   1.2015   2.2806  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.427935   0.362767  -3.936 8.28e-05 ***
duration          0.037535   0.006553   5.728 1.02e-08 ***
installment_rate  0.158810   0.074329   2.137  0.03263 *  
age              -0.021756   0.007686  -2.831  0.00464 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
略

綜合以上兩種方法的結果烘嘱,我們篩選出了對違約狀態(tài)影響最顯著的四個指標 :duration(3)昆禽、age(3)、amount(2)和installment_rate(2)蝇庭。

4.2 定性指標

通過R中的informationvalue包醉鳖,計算各指標的IV值,得到各定性指標間的重要性度量哮内,選取其中的high predictive指標:

> # 提取定性指標
> qualt_vars<-c("status","credit_history","purpose","savings","employment_duration",
+                "personal_status_sex","other_debtors","property",
+  "other_installment_plans","housing","job","telephone","foreign_worker","credit_risk")
> qualt_data<-train[,qualt_vars]
> # 求指標iv值
> all_iv <- data.frame(vars=qualt_vars,iv=numeric(length(qualt_vars)),
+                      strength=character(length(qualt_vars)),stringsAsFactors = F)
> for (i in qualt_vars){
+   all_iv[all_iv$vars==i,]$iv <- IV(X=qualt_data[,i], Y=qualt_data$credit_risk)
+   all_iv[all_iv$vars==i,]$strength <- attr(IV(X=qualt_data[,i], Y=qualt_data$credit_risk),"howgood")
+   }
> (all_iv<-all_iv[order(-all_iv$iv),] )
                      vars         iv            strength
1                   status 0.62161414   Highly Predictive
2           credit_history 0.28847143   Highly Predictive
4                  savings 0.21358882   Highly Predictive
3                  purpose 0.21326499   Highly Predictive
8                 property 0.09828292 Somewhat Predictive
5      employment_duration 0.08935235 Somewhat Predictive
10                 housing 0.07040316 Somewhat Predictive
6      personal_status_sex 0.06871949 Somewhat Predictive
9  other_installment_plans 0.04371291 Somewhat Predictive
7            other_debtors 0.04350575 Somewhat Predictive
13          foreign_worker 0.04262905 Somewhat Predictive
12               telephone 0.01669733      Not Predictive
11                     job 0.01002765      Not Predictive
14             credit_risk 0.00000000      Not Predictive

根據(jù)以上結果盗棵,我們選擇status、credit_history北发、savings和purpose四個high predictive指標構建模型漾根。
綜上,我們共選擇了8個變量作為入模變量鲫竞。

> # 入模指標
> quant_model_vars <- c("duration","amount","installment_rate","age")
> qualt_model_vars <- c("status","credit_history","savings","purpose")

5.WOE計算

5.1 變量分箱

5.1.1 定量指標

計算定量指標的WOE之前,需要先對定量指標進行分段逼蒙。下面从绘,優(yōu)先采用最優(yōu)分段,其原理是基于條件推理樹(conditional inference trees, Ctree)的遞歸分割算法是牢,核心算法用函數(shù)ctree()表示僵井。

> # 對duration進行最優(yōu)分段
> result1<-smbinning(df=train,y="credit_risk",x="duration",p=0.05)
> 查看分段效果
> smbinning.plot(result1,option="WoE",sub="Duration")
> breaks1 <- c(0,7,30,Inf)
> train$cut_duration <- cut(train$duration,breaks = breaks1)
圖3

從上圖可以看出,woe值相差較大驳棱,分段效果不錯批什。以下針對amount、age采用相同分段方法社搅。

> #對amount進行最優(yōu)分段
> result2<-smbinning(df=train,y="credit_risk",x="amount")
> smbinning.plot(result2,option="WoE",sub="Amount")
> breaks2 <-  c(0,1372,3913,Inf)
> train$cut_amount <- cut(train$amount,breaks = breaks2)
> result3<-smbinning(df=train,y="credit_risk",x="age")
> smbinning.plot(result3,option="WoE",sub="Age")
> breaks3 <- c(0,25,Inf)
> train$cut_age <- cut(train$age,breaks = breaks3)
> train$cut_rate <- cut(train$installment_rate,4)
> rate_woe <- woe(Data = train,Independent ="cut_rate",Continuous = F,Dependent = "credit_risk",
+                 C_Bin = 4,Bad = 0,Good = 1)
> ggplot(rate_woe, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+            size = 0.2, alpha = 0.2)+
+   labs(title = "等距分段")+
+   theme(plot.title = element_text(hjust = 0.5))

installment_rate只有1驻债、2乳规、3、4四個值合呐,四個值對應的woe值差別較大暮的,且具有單調性,采用等距分段淌实。

圖4
5.1.2 定性指標

接下來我們需要對定性指標做必要的降維處理冻辩,方便計算其WOE值。首先拆祈,我們查看一下入模的定性指標的概況:

> discrete_data<-train[,qual_model_vars]
> summary(discrete_data)
                                        status                                        credit_history
 ... < 100 DM                              :211   no credits taken/all credits paid back duly: 34   
 0 <= ... < 200 DM                         :214   all credits at this bank paid back duly    : 41   
 ... >= 200 DM / salary for at least 1 year: 50   existing credits paid back duly till now   :422   
 no checking account                       :325   delay in paying off in the past            : 66   
                                                  critical account/other credits existing    :237   
                                                                                                    
                                                                                                    
                       savings                   purpose   
 ... < 100 DM              :478   domestic appliances:216  
 100 <= ... < 500 DM       : 84   car (new)          :189  
 500 <= ... < 1000 DM      : 48   radio/television   :145  
 ... >= 1000 DM            : 41   car (used)         : 83  
 unknown/no savings account:149   others             : 83  
                                  retraining         : 39  
                                  (Other)            : 45 

由以上概況可知恨闪,定性指標status、credit_history放坏、和savings的維數(shù)最高為5維咙咽,最低為4維,維數(shù)適中轻姿,可以不進行處理犁珠。
定性指標purpose的維數(shù)多于7維,明顯高于其他定性指標互亮。為了避免“維數(shù)災難”犁享,我們根據(jù)三條準則進行降維:1.維度間屬性相似;2.合并后woe有明顯變化豹休;3.單個維度樣本量不應過小炊昆。

> # 未進行降維前
> purpose_woe1 <- woe(Data = train,Independent ="purpose",Continuous = F,Dependent = "credit_risk",
+     C_Bin = 10,Bad = 0,Good = 1)
> ggplot(purpose_woe1, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+
+   labs(title = "Purpose")+
+   theme(plot.title = element_text(hjust = 0.5))
圖5
> # 類似屬性合并
> train <- within(train,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"})
> purpose_woe2 <- woe(Data = train,Independent ="cut_purpose",Continuous = F,Dependent = "credit_risk",
+                       C_Bin = 10,Bad = 0,Good = 1)
>   ggplot(purpose_woe2, aes(x = BIN, y = WOE)) + 
+     geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+              size = 0.2, alpha = 0.2)+
+     labs(title = "Purpose")+
+     theme(plot.title = element_text(hjust = 0.5))+
+     theme(axis.text.x = element_text(vjust = 0.2, hjust = 0.2, angle = 10)
圖6

5.2 WOE計算

用klaR包中的woe()函數(shù)獲取入模變量的woe值。

> newtrain <- cbind(discrete_data[,-4],train[,c(21:26)])
> str(newtrain)
'data.frame':   800 obs. of  9 variables:
 $ status        : Factor w/ 4 levels "... < 100 DM",..: 4 4 4 1 4 4 2 3 3 4 ...
 $ credit_history: Factor w/ 5 levels "no credits taken/all credits paid back duly",..: 5 5 3 3 5 4 5 3 3 3 ...
 $ savings       : Factor w/ 5 levels "... < 100 DM",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ credit_risk   : num  1 1 0 1 0 0 1 1 0 0 ...
 $ cut_duration  : Factor w/ 3 levels "(0,7]","(7,30]",..: 3 2 2 3 2 3 2 3 2 3 ...
 $ cut_amount    : Factor w/ 3 levels "(0,1.37e+03]",..: 3 2 2 2 2 3 3 3 1 3 ...
 $ cut_age       : Factor w/ 2 levels "(0,25]","(25,Inf]": 1 2 2 1 2 1 2 2 1 2 ...
 $ cut_rate      : Factor w/ 4 levels "(0.997,1.75]",..: 4 3 4 4 4 2 4 4 3 4 ...
 $ cut_purpose   : chr  "car(new/used)" "car(new/used)" "furniture/equipment/radio/television/domestic appliances" "furniture/equipment/radio/television/domestic appliances" ...
> newtrain$credit_risk <- as.factor(newtrain$credit_risk)
> newtrain$cut_purpose <- as.factor(newtrain$cut_purpose)
>  # 獲取woe值
> woemodel<-klaR::woe(credit_risk~.,data = newtrain,zeroadj=0.5,applyontrain=TRUE)
> traindata <- predict(woemodel, newtrain, replace = TRUE)  
> str(traindata)
'data.frame':   800 obs. of  9 variables:
 $ credit_risk       : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 1 1 ...
 $ woe.status        : num  1.127 1.127 1.127 -0.772 1.127 ...
 $ woe.credit_history: num  0.6989 0.6989 -0.0638 -0.0638 0.6989 ...
 $ woe.savings       : num  -0.292 -0.292 -0.292 -0.292 -0.292 ...
 $ woe.cut_duration  : num  -0.8058 0.0808 0.0808 -0.8058 0.0808 ...
 $ woe.cut_amount    : num  -0.55 0.419 0.419 0.419 0.419 ...
 $ woe.cut_age       : num  -0.612 0.161 0.161 -0.612 0.161 ...
 $ woe.cut_rate      : num  -0.15884 -0.00808 -0.15884 -0.15884 -0.15884 ...
 $ woe.cut_purpose   : num  -0.0199 -0.0199 0.1853 0.1853 0.1853 ...

至此威根,我們已經(jīng)獲得了入模變量對應的woe值凤巨。值的注意的是,我們之前將好客戶設定為0洛搀,壞客戶設定為1敢茁,所以woe值越大,代表客戶違約的概率越大留美,但traindata中的woe實際是按照好客戶為1彰檬,壞客戶為0計算的,所以與之前變量分箱中計算的woe正好相反谎砾。
下面正式開始構建模型逢倍,并轉換為標準評分卡。

6 模型構建與驗證

6.1 構建邏輯回歸模型

> # 用獲得的woe數(shù)據(jù)進行邏輯回歸
> trainmodel<-glm(credit_risk~.,data=traindata,family = binomial())
> summary(trainmodel)

Call:
glm(formula = credit_risk ~ ., family = binomial(), data = traindata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7274  -0.7140  -0.4202   0.7805   2.5005  

Coefficients:
                   Estimate Std. Error z value             Pr(>|z|)    
(Intercept)         -0.9024     0.0924  -9.766 < 0.0000000000000002 ***
woe.status          -0.7947     0.1205  -6.594      0.0000000000427 ***
woe.credit_history  -0.8322     0.1712  -4.861      0.0000011658520 ***
woe.savings         -0.8522     0.2102  -4.055      0.0000501789230 ***
woe.cut_duration    -0.7785     0.1875  -4.151      0.0000330526950 ***
woe.cut_amount      -0.7891     0.2259  -3.493             0.000478 ***
woe.cut_age         -0.9719     0.2803  -3.468             0.000525 ***
woe.cut_rate        -1.7358     0.5309  -3.269             0.001078 ** 
woe.cut_purpose     -1.0409     0.3958  -2.630             0.008541 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 963.44  on 799  degrees of freedom
Residual deviance: 754.15  on 791  degrees of freedom
AIC: 772.15

Number of Fisher Scoring iterations: 5

從以上邏輯回歸的結果來看景图,各個變量都通過了顯著性檢驗较雕。同時,為防止多重共線性問題的出現(xiàn)挚币,我們對模型進行VIF檢驗:

> vif(trainmodel)
        woe.status woe.credit_history        woe.savings   woe.cut_duration 
          1.043140           1.024758           1.039515           1.048207 
    woe.cut_amount        woe.cut_age       woe.cut_rate    woe.cut_purpose 
          1.088700           1.038714           1.055447           1.036517 

從結果可知亮蒋,所有變量VIF均小于4扣典,可以判斷模型中不存在多重共線性問題。

6.2 轉換為標準評分卡

根據(jù)信用評分卡模型的建立宛蚓,我們可以得到:


其中激捏,woe=ln(odds),odds為good_rate/bad_rate凄吏,beita為回歸系數(shù)远舅,altha為截距,n為變量個數(shù)痕钢,offset為偏移量(視風險偏好而定)图柏,比例因子factor。
假定odds=50任连,對應的評分為600蚤吹,在此基礎上評分值增加20分(1個PDO),可以使odds翻番随抠,則可以得出:

6.2.1 轉換為標準評分卡

獲取基礎分以及訓練集中各變量的分數(shù):

> # 設定alpha為比例因子factor裁着,beta為風險偏移量offset
> alpha_beta <- function(basepoints,baseodds,pdo){
+   alpha <- pdo/log(2)
+   beta <- basepoints-alpha*log(baseodds)
+   return(list(alpha=alpha,beta=beta))
+ }
> # 指定0dds=50時,基礎分為600分拱她,比率翻番的分數(shù)為20二驰,計算評分卡的系數(shù)alpha和beta
> (x <- alpha_beta(600,50,20))
$alpha
[1] 28.8539

$beta
[1] 487.1229
> # 獲得模型系數(shù)
> (coefficients <- trainmodel$coefficients)
       (Intercept)         woe.status woe.credit_history        woe.savings 
        -0.9023751         -0.7946682         -0.8321739         -0.8522369 
  woe.cut_duration     woe.cut_amount        woe.cut_age       woe.cut_rate 
        -0.7785099         -0.7891044         -0.9718816         -1.7357656 
   woe.cut_purpose 
        -1.0409225 
> #構造計算分值函數(shù):
> vars_score<-function(i){
+   score = -round(x$alpha*coefficients[i]*traindata[,names(coefficients[i])])
+   return(score)
+ }
> # 計算基礎分值
> (basepoint <- round(x$beta-x$alpha*coefficients[1]))
(Intercept) 
        513 
> # 1.status_score
> status_score <- vars_score(2)
> colnames(status_score)<-"status_score" 
> # 2.credit_history_score
> credit_history_score <- vars_score(3)
> colnames(credit_history_score)<-"credit_history_score"
> # 3.savings_score
> savings_score <- vars_score(4)
> colnames(savings_score)<-"savings_score"
> # 4.duration_score
> duration_score <- vars_score(5)
> colnames(duration_score)<-"duration_score"
> # 5.amount_score
> amount_score <- vars_score(6)
> colnames(amount_score)<-"amount_score"
> # 6.age_score
> age_score <- vars_score(7)
> colnames(age_score)<-"age_score"
> # 7.rate_score
> rate_score <- vars_score(8)
> colnames(rate_score)<-"rate_score"
> # 8.purpose_score
> purpose_score <- vars_score(9)
> colnames(purpose_score)<-"purpose_score"
6.2.2 輸出標準評分卡

輸出CSV格式的標準評分卡:

> # 基礎分
> a <- c("","basepoint",513)
> b <- matrix(r1,nrow = 1)
> colnames(b)<-c("Variable","Basepoint","Score")
> #2.duration的分值
> duration_Cutpoint <- as.matrix(newtrain$cut_duration,stringsAsFactors=F)
> duration_scoreCard<-cbind(as.matrix(c("Duration","",""),ncol=1),
+                           unique(cbind(duration_Cutpoint,duration_score)))
> #3.amount的分值
> amount_Cutpoint <- as.matrix(newtrain$cut_amount,stringsAsFactors=F)
> amount_scoreCard<-cbind(as.matrix(c("Amount","",""),ncol=1),
+                         unique(cbind(amount_Cutpoint,amount_score)))
> #4.age的分值
> age_Cutpoint <- as.matrix(newtrain$cut_age,stringsAsFactors=F)
> age_scoreCard<-cbind(as.matrix(c("Age",""),ncol=1),
+                      unique(cbind(age_Cutpoint,age_score)))
> #5.installment_rate的分值
> rate_Cutpoint <- as.matrix(newtrain$cut_rate,stringsAsFactors=F)
> rate_scoreCard<-cbind(as.matrix(c("Installment_rate","","",""),ncol=1),
+                       unique(cbind(rate_Cutpoint,rate_score)))
> #6.status的分值
> status <- as.matrix(newtrain$status,stringsAsFactors=F)
> status_scoreCard<-cbind(as.matrix(c("Status","","",""),ncol=1),
+                         unique(cbind(status,status_score)))
> #7.credit_history的分值
> credit_history <- as.matrix(newtrain$credit_history,stringsAsFactors=F)
> credit_history_scoreCard<-cbind(as.matrix(c("Credit_history","","","",""),ncol=1),
+                                 unique(cbind(credit_history,credit_history_score)))
> #8.savings的分值
> savings <- as.matrix(newtrain$savings,stringsAsFactors=F)
> savings_scoreCard<-cbind(as.matrix(c("Savings","","","",""),ncol=1),
+                          unique(cbind(savings,savings_score)))
> #9.purpose的分值
> purpose <- as.matrix(newtrain$cut_purpose,stringsAsFactors=F)
> purpose_scoreCard<-cbind(as.matrix(c("Purpose","","",""),ncol=1),
+                          unique(cbind(purpose,purpose_score)))
> scoreCard_CSV<-rbind(m1,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> scoreCard_CSV<-rbind(b,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> #輸出標準評分卡到文件中
> write.csv(scoreCard_CSV,"C:/Users/Administrator/Desktop/ScoreCard.CSV")

7 模型驗證

對測試集中的樣本做同樣的降維處理:

> # 對duration分段
> breaks1 <- c(0,7,30,Inf)
> test$cut_duration <- cut(test$duration,breaks = breaks1)
> # 對amount分段
> breaks2 <-  c(0,1372,3913,Inf)
> test$cut_amount <- cut(test$amount,breaks = breaks2)
> # 對age分段
> breaks3 <- c(0,25,Inf)
> test$cut_age <- cut(test$age,breaks = breaks3)
> # 對installment_rate分段
> test$cut_rate <- cut(test$installment_rate,4)
> # 對purpose分段
> test <- within(test,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"}) 
> newtest <- cbind(test[,qualt_model_vars][-4],test[,c(21:26)])
> newtest$credit_risk <- as.factor(newtest$credit_risk)
> newtest$cut_purpose <- as.factor(newtest$cut_purpose)
> # 將newtest中的各個變量轉換為對應的woe值
> woemodel_test<-klaR::woe(credit_risk~.,data = newtest,zeroadj=0.5,applyontrain=TRUE)
> # 獲得woe數(shù)據(jù)框
> testdata <- predict(woemodel_test, newtest, replace = TRUE)  
> # 測試集驗證
> prob <- predict(trainmodel,testdata,type="response")
> logit.pred <- ifelse(prob>0.5,1,0)  #閾值簡單設為0.5
> (Freq <- table(logit.pred,testdata$credit_risk))
          
logit.pred   0   1
         0 121  35
         1  11  33 
> # 準確率
> (ACC <- sum(diag(Freq))/sum(Freq))
[1] 0.77
> # AUC、Gini系數(shù)
> modelroc <- roc(testdata$credit_risk,prob)
> plot(modelroc, print.auc=TRUE, auc.polygon=T, grid=c(0.1, 0.2),
+      grid.col=c("green", "red"), max.auc.polygon=TRUE,
+      auc.polygon.col="skyblue", print.thres=T)
> modelauc<- auc(modelroc)
> (Gini <- 2*modelauc-1)
[1] 0.6106283
圖7

從以上結果可知秉沼,模型準確率ACC為0.77桶雀,AUC為0.805,Gini系數(shù)為0.61唬复,整體效果尚可矗积。

8 總結

本文通過對Germancredit數(shù)據(jù)的挖掘分析,從數(shù)據(jù)清洗敞咧、變量篩選棘捣、WOE計算、建模分析到模型驗證休建,創(chuàng)建了一個簡單的申請評分卡乍恐。
本文用到的數(shù)據(jù)集比較簡單,在實操中丰包,數(shù)據(jù)清洗應該會占用更多時間和精力。
本文僅進行了一次樣本抽樣壤巷,在實操中邑彪,應進行K折交叉檢驗,提升模型準確度胧华。
開發(fā)的模型是基于某一時間的特定樣本的寄症,隨著時間的推移和信貸政策的變化宙彪,樣本會發(fā)生變化,從而造成模型的區(qū)分能力和穩(wěn)定性變差有巧。一般需要定期對模型的使用情況進行檢測并報告模型區(qū)分能力和穩(wěn)定性的變化情況释漆,必要時應采取包括修正模型或重建模型等措施。這是后面需要認真學習的地方篮迎!

參考

信用標準評分卡模型開發(fā)及實現(xiàn)
信用評分卡模型的建立

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末男图,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子甜橱,更是在濱河造成了極大的恐慌逊笆,老刑警劉巖,帶你破解...
    沈念sama閱讀 217,277評論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件岂傲,死亡現(xiàn)場離奇詭異难裆,居然都是意外死亡,警方通過查閱死者的電腦和手機镊掖,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,689評論 3 393
  • 文/潘曉璐 我一進店門乃戈,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人亩进,你說我怎么就攤上這事症虑。” “怎么了镐侯?”我有些...
    開封第一講書人閱讀 163,624評論 0 353
  • 文/不壞的土叔 我叫張陵侦讨,是天一觀的道長。 經(jīng)常有香客問我苟翻,道長韵卤,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,356評論 1 293
  • 正文 為了忘掉前任崇猫,我火速辦了婚禮沈条,結果婚禮上,老公的妹妹穿的比我還像新娘诅炉。我一直安慰自己蜡歹,他們只是感情好,可當我...
    茶點故事閱讀 67,402評論 6 392
  • 文/花漫 我一把揭開白布涕烧。 她就那樣靜靜地躺著月而,像睡著了一般。 火紅的嫁衣襯著肌膚如雪议纯。 梳的紋絲不亂的頭發(fā)上父款,一...
    開封第一講書人閱讀 51,292評論 1 301
  • 那天,我揣著相機與錄音,去河邊找鬼憨攒。 笑死世杀,一個胖子當著我的面吹牛,可吹牛的內容都是我干的肝集。 我是一名探鬼主播瞻坝,決...
    沈念sama閱讀 40,135評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼杏瞻!你這毒婦竟也來了所刀?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 38,992評論 0 275
  • 序言:老撾萬榮一對情侶失蹤伐憾,失蹤者是張志新(化名)和其女友劉穎勉痴,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體树肃,經(jīng)...
    沈念sama閱讀 45,429評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡蒸矛,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,636評論 3 334
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了胸嘴。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片雏掠。...
    茶點故事閱讀 39,785評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖劣像,靈堂內的尸體忽然破棺而出乡话,到底是詐尸還是另有隱情,我是刑警寧澤耳奕,帶...
    沈念sama閱讀 35,492評論 5 345
  • 正文 年R本政府宣布绑青,位于F島的核電站,受9級特大地震影響屋群,放射性物質發(fā)生泄漏闸婴。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,092評論 3 328
  • 文/蒙蒙 一芍躏、第九天 我趴在偏房一處隱蔽的房頂上張望邪乍。 院中可真熱鬧斟叼,春花似錦吏饿、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,723評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽吕晌。三九已至,卻和暖如春临燃,著一層夾襖步出監(jiān)牢的瞬間睛驳,已是汗流浹背壁拉。 一陣腳步聲響...
    開封第一講書人閱讀 32,858評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留柏靶,地道東北人。 一個月前我還...
    沈念sama閱讀 47,891評論 2 370
  • 正文 我出身青樓溃论,卻偏偏與公主長得像屎蜓,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子钥勋,可洞房花燭夜當晚...
    茶點故事閱讀 44,713評論 2 354

推薦閱讀更多精彩內容