1妻柒、項(xiàng)目介紹
- 背景
攜程作為中國(guó)領(lǐng)先的綜合性旅行服務(wù)公司属提,每天向超過2.5億會(huì)員提供全方位的旅行服務(wù)使套,在這海量的網(wǎng)站訪問量中西采,我們可分析用戶的行為數(shù)據(jù)來挖掘潛在的信息資源凰萨。其中,客戶流失率是考量業(yè)務(wù)成績(jī)的一個(gè)非常關(guān)鍵的指標(biāo)械馆。本項(xiàng)目的目的是為了深入了解用戶畫像及行為偏好胖眷,找到最優(yōu)算法,挖掘出影響用戶流失的關(guān)鍵因素霹崎,從而更好地完善產(chǎn)品設(shè)計(jì)珊搀、提升用戶體驗(yàn)。 - 評(píng)估標(biāo)準(zhǔn)
評(píng)分指標(biāo)為97%精確度下的召回率尾菇,即:在precision>=0.97的recall中境析,選取max(recall)。 -
數(shù)據(jù)集
數(shù)據(jù)集包括49個(gè)指標(biāo)(某一周的數(shù)據(jù))派诬,預(yù)測(cè)的目標(biāo)樣本為流失樣本(即label=1)劳淆,將這些指標(biāo)按訂單相關(guān)、酒店相關(guān)和客戶行為相關(guān)進(jìn)行歸類默赂。
2憔儿、項(xiàng)目流程
2.1 數(shù)據(jù)處理
2.1.1 目標(biāo)特征分布
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t')
df = df_orign.copy()
df['label'].value_counts()
返回
流失和未流失的用戶比例2:5,樣本不算特別不平衡放可,此處不做處理谒臼。
2.1.2 處理異常值
-
通過describe函數(shù)發(fā)現(xiàn)朝刊,部分與價(jià)格有關(guān)的字段(delta_price1、delta_price2蜈缤、lowestprice)存在負(fù)值的情況拾氓。
存在負(fù)值的具體數(shù)量如下:
三個(gè)字段的分布如下:
根據(jù)上圖可見,delta_price1【負(fù)值>=25%】底哥、delta_price2【負(fù)值>=25%】咙鞍、lowestprice【僅有1條負(fù)值記錄】,考慮到字段的分布情況(近似正態(tài)分布)以及實(shí)際需求趾徽,此處用0來替換價(jià)格異常值续滋。
df[['delta_price1','delta_price2','lowestprice']] = df[['delta_price1','delta_price2','lowestprice']].applymap(lambda x: 0 if x<0 else x)
- 24小時(shí)內(nèi)登陸時(shí)長(zhǎng)內(nèi)登錄時(shí)長(zhǎng)不應(yīng)該超過24小時(shí),將大于24的值改為24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24
2.1.3 處理缺失值
各個(gè)字段缺失值情況:
根據(jù)上圖顯示:幾乎所有字段都存在缺失值孵奶,存在缺失值的字段均為連續(xù)性字段疲酌,其中historyvisit_7ordernum缺失值超過80%,已沒有分析的必要了袁,考慮將其刪除朗恳。
剩余存在缺失值的字段,考慮用其他值來填充空值载绿。
- 用其他字段填充
計(jì)算字段的相關(guān)性發(fā)現(xiàn):
commentnums_pre和novoters_pre相關(guān)性較強(qiáng)粥诫;
commentnums_pre2和novoters_pre2相關(guān)性較強(qiáng)。
(df['commentnums_pre']/df['novoters_pre']).describe()
返回
此處取上圖結(jié)果的中位數(shù)65%作為評(píng)分率崭庸,考慮用novoters_pre*65%來填充commentnums_pre,commentnums_pre/65%來填充novoters_pre怀浆。
def fill_commentnum_novoter_pre(x):
if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']):
x['commentnums_pre'] = x['novoters_pre']*0.65
elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
x['novoters_pre'] = x['commentnums_pre']/0.65
else:
return x
return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
df[['commentnums_pre','novoters_pre']].info()
返回
填充了commentnums_pre和novoters_pre的部分缺失值,剩余缺失值用中位數(shù)填充怕享。
同上执赡,填充commentnums_pre2和novoters_pre2字段,剩余缺失值用均值填充熬粗。
def fill_commentnum_novoter_pre2(x):
if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
x['commentnums_pre2'] = x['novoters_pre2']*0.65
elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
x['novoters_pre2'] = x['commentnums_pre2']/0.65
else:
return x
return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
- 均值搀玖、中位數(shù)、0填充
#均值(極端值影響不大驻呐,符合近似正態(tài)分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))
#中位數(shù)
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))
#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
- 聚類填充
commentnums和novoters灌诅、cancelrate、hoteluv存在較強(qiáng)相關(guān)性含末,考慮通過聚類取中位數(shù)的方式來填充commentnums猜拾。
#commentnums:當(dāng)前酒店點(diǎn)評(píng)數(shù)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler() # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']
同理佣盒,取starprefer和consuming_capacity聚類后每類avgprice的均值來填充avgprice的空值
#avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler() # 聚類算距離挎袜,需要先標(biāo)準(zhǔn)化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']
取consuming_capacity和avgprice聚類后的中位數(shù)來填充delta_price1
#delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler() # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = 187#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = 100#data['fill1']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = 26#data['fill2']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = 1269#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = 323#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = 573#data['fill0']
df['delta_price1'] = data['delta_price1']
取 consuming_capacity和avgprice聚類delta_price2的中位數(shù)來填充delta_price2
#delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler() # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = 91#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = 419#data['fill1']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = 18#data['fill2']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = 205#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = 1042#data['fill0']
df['delta_price2'] = data['delta_price2']
-
分段填充
consuming_capacity和starprefer相關(guān)盯仪,考慮通過starprefer分段來填充consuming_capacity紊搪。
看一下這兩個(gè)字段的描述情況:
根據(jù)上述描述情況,將starprefer分成三段全景,將每塊區(qū)域內(nèi)consuming_capacity的均值來填充consuming_capacity的空值耀石。
fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
if x.isnull()['consuming_capacity']:
if x['starprefer']<60:
x['consuming_capacity'] = fill1
elif (x['starprefer']<80)&(x['starprefer']>=60):
x['consuming_capacity'] = fill2
else:
x['consuming_capacity'] = fill3
else:
return x
return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)
以上,缺失值處理完畢
2.2 特征工程
2.2.1 新增字段
- 時(shí)間字段
新增字段:訪問日期和入住日期間隔天數(shù)booking_gap爸黄、入住日期是星期幾week_day滞伟、入住日期是否是周末is_weekend
#格式為年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#訪問日期和入住日期間隔天數(shù)
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期幾
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
- 是否是同一個(gè)樣本【選取部分客戶行為指標(biāo)】
查看字段sid,發(fā)現(xiàn)95%都是老用戶炕贵,新用戶很少梆奈,一周內(nèi)部分用戶可能會(huì)下多個(gè)訂單,為了方便后續(xù)劃分訓(xùn)練集和驗(yàn)證集称开,此處添加一個(gè)user_tag來區(qū)分是否是同一個(gè)用戶的訂單亩钟。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
df['historyvisit_visit_detailpagenum'].map(str) + \
df['delta_price2'].map(str) + \
df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
df['user_tag'].unique().shape
返回670226,即實(shí)際這周有670226個(gè)用戶下過訂單钥弯。
- 用戶字段和酒店字段
選取部分用戶相關(guān)字段進(jìn)行聚類創(chuàng)建用戶字段user_group径荔,選取部分酒店相關(guān)字段進(jìn)行聚類創(chuàng)建酒店字段hotel_group督禽。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚類之前先標(biāo)準(zhǔn)化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:218580.269018 4:218580.416497 5:218581.368953 6:218581.203569
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:266853.481135 4:268442.314369 5:268796.468103 6:268796.707149
2.2.2 連續(xù)特征離散化
historyvisit_avghotelnum大部分都小于5脆霎,將字段處理成小于等于5和大于5的離散值;
ordercanncelednum大部分都小于5狈惫,將字段處理成小于等于5和大于5的離散值睛蛛;
sid等于1是新訪設(shè)為0,其他設(shè)為1為老用戶胧谈。
avgprice忆肾、lowestprice、starprefer菱肖、consuming_capacity和h進(jìn)行數(shù)值分段離散化客冈。
df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)
#分段離散化
def discrete_avgprice(x):
if x<=200:
return 0
elif x<=400:
return 1
elif x<=600:
return 2
else:
return 3
def discrete_lowestprice(x):
if x<=100:
return 0
elif x<=200:
return 1
elif x<=300:
return 2
else:
return 3
def discrete_starprefer(x):
if x==0:
return 0
elif x<=60:
return 1
elif x<=80:
return 2
else:
return 3
def discrete_consuming_capacity(x):
if x<0:
return 0
elif x<=20:
return 1
elif x<=40:
return 2
elif x<=60:
return 3
else:
return 4
def discrete_h(x):
if x>=0 and x<6:#凌晨訪問
return 0
elif x<12:#上午訪問
return 1
elif x<18:#下午訪問
return 2
else:
return 3#晚上訪問
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)
對(duì)當(dāng)前的數(shù)值型類別變量進(jìn)行離散特征熱編碼,此處用OneHotEncoder方法
discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)
2.2.3 刪除字段
去掉兩類字段:
d稳强、arrival场仲、sampleid、firstorder_bu這幾個(gè)對(duì)分析沒有意義的字段退疫;
historyvisit_totalordernum和ordernum_oneyear這兩個(gè)字段值相等渠缕,此處取ordernum_oneyear這個(gè)字段,刪除historyvisit_totalordernum褒繁;
decisionhabit_user和historyvisit_avghotelnum數(shù)值較一致亦鳞,此處選擇historyvisit_avghotelnum,刪除decisionhabit_user。
encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user'],axis=1)
encode_df_new.shape
最終去除目標(biāo)字段label和劃分訓(xùn)練集字段user_tag燕差,共有79個(gè)字段遭笋。
2.3 模型訓(xùn)練
2.3.1 劃分訓(xùn)練集和驗(yàn)證集
為了保證訓(xùn)練集和驗(yàn)證集獨(dú)立同分布,將數(shù)據(jù)按照user_tag進(jìn)行排序徒探,取前70%作為訓(xùn)練集坐梯,剩余的作為驗(yàn)證集。
ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]
2.3.2 比較各個(gè)模型的訓(xùn)練效果
所有模型的調(diào)參都采用GridSearchCV網(wǎng)格搜索進(jìn)行刹帕。
- GBDT
#調(diào)整的參數(shù):
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate吵血,需要配合調(diào)整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最終的參數(shù)結(jié)果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('調(diào)參之后:測(cè)試集中precision>=0.97對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.15988300816140671
0.8808204850185188
- xgboost
#調(diào)整的參數(shù):
#迭代器個(gè)數(shù)n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合調(diào)整n_esgtimators
from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7640022417597814
0.9754939563495324
- 隨機(jī)森林
#調(diào)整的參數(shù):
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.666135416301797
0.9616117844760916
- Adaboost
bdt = AdaBoostClassifier(algorithm="SAMME",
n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.00019265123121650496
0.7300356696791559
- DecisionTree
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.0
0.8340018840954033
根據(jù)上述結(jié)果可知,xgboost的訓(xùn)練效果最好护锤,當(dāng)precision>=0.97時(shí)胸私,recall最大能達(dá)到76.4%。
2.3.3 模型堆疊
后面也嘗試了模型堆疊的方法侦另,看是否能得到更好的效果,首先利用上述提到的各個(gè)模型尉共,根據(jù)特征重要性選取了57個(gè)特征褒傅,然后利用KFold方法進(jìn)行5折交叉驗(yàn)證,得到五種模型的驗(yàn)證集和測(cè)試集結(jié)果袄友,分別作為第二層的訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集殿托,并用邏輯回歸模型來訓(xùn)練這五個(gè)特征,最終得到的結(jié)果是當(dāng)precision>=0.97時(shí)剧蚣,recall最大能達(dá)到78.3%支竹,比原來的76.4%稍有提高。
- 選取重要特征
#篩選特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
def get_top_n_features(train_x, train_y):
# random forest
rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
rf_est.fit(train_x, train_y)
feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)
# AdaBoost
ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
ada_est.fit(train_x, train_y)
feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)
# GradientBoosting
gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gb_est.fit(train_x, train_y)
feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)
# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
dt_est.fit(train_x, train_y)
feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
# xgbc
xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xg_est.fit(train_x, train_y)
feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)
return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg
feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg],
ignore_index=True).drop_duplicates()
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada,
feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n
最終從79個(gè)特征中選取了57個(gè)鸠按。
- 第一層模型訓(xùn)練
#第一層
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)
def get_out_fold(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((5, ntest))
oof_train_prob = np.zeros((ntrain,))
oof_test_prob = np.zeros((ntest,))
oof_test_skf_prob = np.empty((5, ntest))
for i, (train_index, test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
print('現(xiàn)在是第{}層'.format(i))
print('訓(xùn)練集索引如下:')
print(train_index)
print('測(cè)試集索引如下:')
print(test_index)
oof_test[:] = oof_test_skf.mean(axis=0)
oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
x_train = train_x_new.values
x_test = test_x_new.values
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
- 第二層模型訓(xùn)練
將第一層的輸出結(jié)果作為訓(xùn)練集和測(cè)試集
#劃分訓(xùn)練集和測(cè)試集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#邏輯回歸模型訓(xùn)練
from sklearn.linear_model import LogisticRegression
#調(diào)參
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每個(gè)參數(shù)值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳參數(shù)值為:{}'.format(rfsearch4.best_params_))
# print('最佳參數(shù)值roc_auc得分為:{}'.format(rfsearch4.best_score_))
#調(diào)參結(jié)果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7832498511331395
0.9763271659779821
通過堆疊的方法礼搁,將recall值從76.4%提高到78.3%。