背景介紹:
Lending Club 創(chuàng)立于2006年逗嫡,主營(yíng)業(yè)務(wù)是為市場(chǎng)提供P2P貸款的平臺(tái)中介服務(wù)麦乞,公司總部位于舊金山。公司在運(yùn)營(yíng)初期僅提供個(gè)人貸款服務(wù)派诬,貸款人向Lending Club平臺(tái)申請(qǐng)貸款時(shí)洒扎,Lending Club通過(guò)線(xiàn)上或線(xiàn)下讓客戶(hù)填寫(xiě)貸款申請(qǐng)表父叙,收集客戶(hù)的基本信息碎浇,同時(shí)會(huì)借助第三方平臺(tái)的征信機(jī)構(gòu)的信息腾它。
通過(guò)這些信息屬性來(lái)做邏輯回歸生成預(yù)測(cè)模型力惯,Lending Club可以通過(guò)預(yù)測(cè)判斷貸款人是否會(huì)違約碗誉,從而決定是否向申請(qǐng)人發(fā)放貸款。
數(shù)據(jù)集來(lái)源:LendingClub官網(wǎng) 07年—11年 的數(shù)據(jù):
https://www.lendingclub.com/statistics/additional-statistics?
引入包和數(shù)據(jù)集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
Loandata = pd.read_csv('C:/Users/Jason/Desktop/DAdata/LoanStats3a_securev2.csv',skiprows=1)
一父晶、查看數(shù)據(jù)集基本情況
Loandata.shape
(39786, 150)
每一行是一條數(shù)據(jù)哮缺,150個(gè)字段,字段信息如下:
Loandata.iloc[0]
查看第一條字段的信息
二诱建、數(shù)據(jù)可視化分析前的數(shù)據(jù)預(yù)處理
1蝴蜓、刪除特征中只有一種屬性的列
orig_columns = Loandata.columns
drop_columns = []
for col in orig_columns:
col_series = Loandata[col].dropna().unique() #去重唯一的屬性
if len(col_series) == 1: #如果該特征的屬性只有一個(gè)屬性,就給過(guò)濾掉該特征
drop_columns.append(col)
Loandata = Loandata.drop(drop_columns, axis=1)
print(drop_columns)
['pymnt_plan', 'out_prncp', 'next_pymnt_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'application_type', 'verification_status_joint', 'acc_now_delinq', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'percent_bc_gt_75', 'tax_liens', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_last_payment_amount']
2俺猿、刪除缺失值超過(guò)二分之一的字段
half_count = len(Loandata)/2
Loandata = Loandata.dropna(thresh=half_count,axis=1)
Loandata.shape
(39786, 50)
還剩下50個(gè)字段
Loandata.isnull().sum()
查看有空值的字段
id 0
loan_amnt 0
funded_amnt 0
funded_amnt_inv 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_title 2467
emp_length 1078
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
url 0
desc 12967
purpose 0
title 11
zip_code 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
out_prncp_inv 0
total_pymnt 0
total_pymnt_inv 0
total_rec_prncp 0
total_rec_int 0
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d 71
last_pymnt_amnt 1
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
dtype: int64
空值比較多的列茎匠,如:desc,emp_title等對(duì)于分析和建模都沒(méi)有幫助押袍,所以將其刪除诵冒,id,url谊惭,zip_code等也一并刪除
Loandata = Loandata.drop(['id','url','desc','title','emp_title','zip_code'],axis=1)
Loandata.isnull().sum()
loan_amnt 0
funded_amnt 0
funded_amnt_inv 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_length 1078
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
purpose 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
out_prncp_inv 0
total_pymnt 0
total_pymnt_inv 0
total_rec_prncp 0
total_rec_int 0
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d 71
last_pymnt_amnt 1
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
dtype: int64
# 采用labelencoder處理 emp_length
label_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
None: 0
}
}
Loandata = Loandata.replace(label_dict)
3汽馋、將issue_d這一列從字符串轉(zhuǎn)換為時(shí)間格式,并查看是否轉(zhuǎn)換后有空值,然后按時(shí)間先后排序
Loandata['issue_d'] = pd.to_datetime(Loandata['issue_d'])
Loandata['issue_d'].isnull().any()
-->> False
# 按時(shí)間排序
Loandata = Loandata.sort_values(by=['issue_d'],ascending=True)
Loandata = Loandata.reset_index(drop=True)
把費(fèi)率這個(gè)字段做一個(gè)處理
Loandata["int_rate"] = Loandata["int_rate"].str.rstrip("%").astype("float")
三圈盔、我們先來(lái)做一個(gè)初步數(shù)據(jù)分析
1豹芯、查看貸款人數(shù)最多的州:
Loandata.addr_state.value_counts()[:20].plot(kind='bar', figsize=(8, 4),title='StateLoan Count')
因?yàn)長(zhǎng)ending Club總部在加州,對(duì)本地業(yè)務(wù)開(kāi)拓比較深驱敲,所以加州的筆數(shù)遠(yuǎn)遠(yuǎn)高于其他州铁蹈,其次是紐約州、佛羅里達(dá)州和德克薩斯州
2众眨、查看壞賬率
Loandata['loan_status'].value_counts()
-->>Fully Paid 34116
Charged Off 5670
Name: loan_status, dtype: int64
# 對(duì)還款情況做一個(gè)編碼
badloan = ['Charged Off']
Loandata['loan_condition'] = np.nan
def loan_condition(status):
if status in badloan:
return 0
else:
return 1
Loandata['loan_condition'] = Loandata['loan_status'].apply(loan_condition)
print('goodload 1: badloan 0')
print(Loandata['loan_condition'].value_counts())
-->>goodload 1: badloan 0
1 34116
0 5670
Name: loan_condition, dtype: int64
3握牧、每年放款交易額
Loandata['year'] =Loandata['issue_d'].dt.year
sns.countplot('year',data=Loandata)
plt.title('Loan Amount by Year',fontsize=10)
每年的貸款筆數(shù)和貸款金額在逐年上升
4、客戶(hù)貸款金額和期數(shù)的選擇
plt.hist(Loandata.loan_amnt,bins=10,edgecolor='white',color='dodgerblue')
Loandata['term'].value_counts()
-->> 36 months 29096
60 months 10690
Name: term, dtype: int64
4000-12000 的貸款人數(shù)是最多的娩梨,大部分人選擇36期還款
5沿腰、利率的范圍
print(Loandata.int_rate.describe())
sns.distplot(Loandata.int_rate)
-->>count 39786.000000
mean 12.027873
std 3.727466
min 5.420000
25% 9.250000
50% 11.860000
75% 14.590000
max 24.590000
Name: int_rate, dtype: float64
利率平均值是12%,總體范圍在5.4%~24.59%
四狈定、初步分析完畢颂龙,開(kāi)始建模部分,但是在此之間還要對(duì)數(shù)據(jù)進(jìn)行處理,刪除對(duì)于建模幫助不大的字段厘托,減少模型計(jì)算量友雳,而且由于sk-learn不接受字符串類(lèi)型的數(shù)據(jù),還需做缺失值字符串铅匹、標(biāo)點(diǎn)符號(hào)押赊、%號(hào)、字符值等的處理
Loandata.columns
-->>Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
'annual_inc', 'verification_status', 'issue_d', 'loan_status',
'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'open_acc',
'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
'initial_list_status', 'out_prncp_inv', 'total_pymnt',
'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
'last_fico_range_high', 'last_fico_range_low', 'policy_code',
'pub_rec_bankruptcies', 'debt_settlement_flag', 'loan_condition',
'year'],
dtype='object')
目前還有比較多的字段包斑,可能在實(shí)際工作中流礁,模型字段的保留與刪除與否,將會(huì)是一個(gè)重要的工程罗丰,在這里我就刪除一些對(duì)建模無(wú)用的字段神帅,比如:to迄今收到的本金,期望貸款金額萌抵,郵編等
Loandata = Loandata.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "issue_d"], axis=1)
Loandata = Loandata.drop(["out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
Loandata = Loandata.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
Loandata.head(1)
-->>
loan_amnt term int_rate installment emp_length home_ownership annual_inc verification_status loan_status purpose ... total_acc initial_list_status last_credit_pull_d last_fico_range_high last_fico_range_low policy_code pub_rec_bankruptcies debt_settlement_flag loan_condition year
0 7500 36 months 13.75 255.43 0 OWN 22000.0 Not Verified Fully Paid debt_consolidation ... 8 f 20-Jan 719 715 1 NaN N 1 2007
還剩下31個(gè)字段
null_counts = Loandata.isnull().sum()
null_counts
-->>
loan_amnt 0
term 0
int_rate 0
installment 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
purpose 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
fico_range_low 0
fico_range_high 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
initial_list_status 0
last_credit_pull_d 2
last_fico_range_high 0
last_fico_range_low 0
policy_code 0
pub_rec_bankruptcies 697
debt_settlement_flag 1
loan_condition 0
year 0
dtype: int64
revol_util 去掉%并轉(zhuǎn)成float
Loandata["revol_util"] = Loandata["revol_util"].str.rstrip("%").astype("float")
缺失值并不多找御,丟棄也無(wú)妨,當(dāng)然也可以最大值绍填、最小值霎桅、平均值等填充
Loandata = Loandata.drop("pub_rec_bankruptcies", axis=1)
Loandata = Loandata.dropna(axis=0)
Loandata = Loandata.drop(['debt_settlement_flag', 'policy_code','initial_list_status','earliest_cr_line','addr_state','loan_status'],axis=1)
把剩下的幾個(gè)字符串類(lèi)型字段做一個(gè)標(biāo)簽編碼
import sklearn.preprocessing as sp
lbe = sp.LabelEncoder()
Loandata['home_ownership'] = lbe.fit_transform(Loandata['home_ownership'])
lbe = sp.LabelEncoder()
Loandata['verification_status'] = lbe.fit_transform(Loandata['verification_status'])
lbe = sp.LabelEncoder()
Loandata['purpose'] = lbe.fit_transform(Loandata['purpose'])
lbe = sp.LabelEncoder()
Loandata['term'] = lbe.fit_transform(Loandata['term'])
把剩下數(shù)值型的字段轉(zhuǎn)成int型
Loandata['total_acc'] = Loandata['total_acc'].astype('int64')
Loandata['revol_bal'] = Loandata['revol_bal'].astype('int64')
Loandata['delinq_2yrs'] = Loandata['delinq_2yrs'].astype('int64')
Loandata.head()
-->>
loan_amnt term int_rate installment emp_length home_ownership annual_inc verification_status purpose dti ... fico_range_high inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc last_fico_range_high last_fico_range_low loan_condition
0 7500 0 13.75 255.43 0 3 22000.0 0 2 14.29 ... 664 0 7 0 4175 51.5 8 719 715 1
1 3500 0 10.28 113.39 0 4 20000.0 0 8 1.50 ... 684 0 17 0 1882 32.4 18 829 825 1
2 5750 0 7.43 178.69 10 0 125000.0 0 2 0.27 ... 794 0 10 0 2817 10.2 16 799 795 1
3 5000 0 7.43 155.38 6 4 40000.0 0 0 2.55 ... 774 2 4 0 2562 14.0 7 729 725 1
4 1200 0 11.54 39.60 0 4 20000.0 0 1 2.04 ... 664 2 3 0 1153 75.8 4 704 700 1
5 rows × 22 columns
數(shù)據(jù)清洗完畢,剩下22個(gè)字段用作模型訓(xùn)練讨永,將干凈的數(shù)據(jù)重新保存并讀取
Loandata.to_csv("C:/Users/Jason/Desktop/CleanLoanData.csv", index=False)
Loandata=pd.read_csv("C:/Users/Jason/Desktop/CleanLoanData.csv")
五滔驶、利用邏輯回歸實(shí)現(xiàn)客戶(hù)逾期預(yù)測(cè)
5.1
import sklearn.linear_model as lm
model = lm.LogisticRegression()
cols = Loandata.columns
train_cols = cols.drop('loan_condition')
x = Loandata[train_cols]
y = Loandata['loan_condition']
model.fit(x,y)
predict = model.predict(x)
predict[:10]
-->>array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
0 代表沒(méi)還,1代表還了卿闹,這么高的還款率揭糕,似乎有點(diǎn)不對(duì)。讓我們看看model的模型概率
model.predict_proba(x)
-->>
array([[0.03725216, 0.96274784],
[0.00711186, 0.99288814],
[0.02119685, 0.97880315],
...,
[0.18928953, 0.81071047],
[0.04177887, 0.95822113],
[0.06569009, 0.93430991]])
5.2 等等锻霎,讓我們想一想著角,拿什么衡量我們模型的好壞呢,我們結(jié)合實(shí)際旋恼,我們借錢(qián)出去給有能力還款的人雇寇,每筆賺取10%的利潤(rùn),十個(gè)人中假設(shè)一個(gè)人沒(méi)還款蚌铜,損失100%,但是需要預(yù)測(cè)對(duì)十個(gè)人才能彌補(bǔ)預(yù)測(cè)錯(cuò)一個(gè)人的收益嫩海,顯然精度是不合適此模型冬殃,為了實(shí)現(xiàn)利潤(rùn)最大化,所以需要模型預(yù)測(cè)更高的recall率叁怪,故采用兩個(gè)指標(biāo):TPR(True Poositive Rate)更高审葬,F(xiàn)PR(False Positive Rate)更低
實(shí)際值 預(yù)測(cè)值 盈虧
0 1 -1000 FP
1 1 100 TP
1 0 0 FN
0 0 0 TN
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
-->>
4414
33118
962
1239
5.3 建立一個(gè)混淆矩陣
import sklearn.model_selection as sm
model = lm.LogisticRegression()
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
predict[:100]
-->>
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 1
27 1
28 1
29 1
..
70 1
71 1
72 1
73 1
74 1
75 1
76 1
77 1
78 1
79 1
80 1
81 1
82 0
83 0
84 1
85 1
86 1
87 1
88 1
89 1
90 1
91 1
92 1
93 1
94 1
95 0
96 1
97 1
98 1
99 1
Length: 100, dtype: int64
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
--->>
4420
33127
953
1233
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
0.9720363849765258
0.781885724394127
5.4 TPR和FPR的值都很高,顯然不是我們想要的,考慮到數(shù)據(jù)集樣本權(quán)重差異較大涣觉,下一步我們調(diào)整權(quán)重再訓(xùn)練一次(默認(rèn)權(quán)重)
model = lm.LogisticRegression(class_weight='balanced')
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
1517
26393
4136
7687
0.7744424882629108
0.26835308685653636
5.5 自定義權(quán)重
penalty = {
0:6,
1:1
}
model = lm.LogisticRegression(class_weight=penalty)
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
1521
26382
4132
7698
0.7741197183098592
0.2690606757473908