攜程用戶流失預(yù)警項(xiàng)目

1妻柒、項(xiàng)目介紹

  • 背景
    攜程作為中國(guó)領(lǐng)先的綜合性旅行服務(wù)公司属提,每天向超過2.5億會(huì)員提供全方位的旅行服務(wù)使套,在這海量的網(wǎng)站訪問量中西采,我們可分析用戶的行為數(shù)據(jù)來挖掘潛在的信息資源凰萨。其中,客戶流失率是考量業(yè)務(wù)成績(jī)的一個(gè)非常關(guān)鍵的指標(biāo)械馆。本項(xiàng)目的目的是為了深入了解用戶畫像及行為偏好胖眷,找到最優(yōu)算法,挖掘出影響用戶流失的關(guān)鍵因素霹崎,從而更好地完善產(chǎn)品設(shè)計(jì)珊搀、提升用戶體驗(yàn)。
  • 評(píng)估標(biāo)準(zhǔn)
    評(píng)分指標(biāo)為97%精確度下的召回率尾菇,即:在precision>=0.97的recall中境析,選取max(recall)。
  • 數(shù)據(jù)集
    數(shù)據(jù)集包括49個(gè)指標(biāo)(某一周的數(shù)據(jù))派诬,預(yù)測(cè)的目標(biāo)樣本為流失樣本(即label=1)劳淆,將這些指標(biāo)按訂單相關(guān)、酒店相關(guān)和客戶行為相關(guān)進(jìn)行歸類默赂。


    指標(biāo)分類.png

2憔儿、項(xiàng)目流程

2.1 數(shù)據(jù)處理

2.1.1 目標(biāo)特征分布

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t')
df = df_orign.copy()
df['label'].value_counts()

返回


label字段分布.PNG

流失和未流失的用戶比例2:5,樣本不算特別不平衡放可,此處不做處理谒臼。

2.1.2 處理異常值

  • 通過describe函數(shù)發(fā)現(xiàn)朝刊,部分與價(jià)格有關(guān)的字段(delta_price1、delta_price2蜈缤、lowestprice)存在負(fù)值的情況拾氓。


    價(jià)格字段存在負(fù)值.PNG

    存在負(fù)值的具體數(shù)量如下:


    價(jià)格字段負(fù)值記錄數(shù).PNG

    三個(gè)字段的分布如下:
    delta_price1分布.PNG

    delta_price2分布圖.PNG

    lowestprice分布圖.PNG

根據(jù)上圖可見,delta_price1【負(fù)值>=25%】底哥、delta_price2【負(fù)值>=25%】咙鞍、lowestprice【僅有1條負(fù)值記錄】,考慮到字段的分布情況(近似正態(tài)分布)以及實(shí)際需求趾徽,此處用0來替換價(jià)格異常值续滋。

df[['delta_price1','delta_price2','lowestprice']] = df[['delta_price1','delta_price2','lowestprice']].applymap(lambda x: 0 if x<0 else x)
  • 24小時(shí)內(nèi)登陸時(shí)長(zhǎng)內(nèi)登錄時(shí)長(zhǎng)不應(yīng)該超過24小時(shí),將大于24的值改為24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24

2.1.3 處理缺失值

各個(gè)字段缺失值情況:


nan.png

根據(jù)上圖顯示:幾乎所有字段都存在缺失值孵奶,存在缺失值的字段均為連續(xù)性字段疲酌,其中historyvisit_7ordernum缺失值超過80%,已沒有分析的必要了袁,考慮將其刪除朗恳。
剩余存在缺失值的字段,考慮用其他值來填充空值载绿。

  • 用其他字段填充
    計(jì)算字段的相關(guān)性發(fā)現(xiàn):
    commentnums_pre和novoters_pre相關(guān)性較強(qiáng)粥诫;
    commentnums_pre2和novoters_pre2相關(guān)性較強(qiáng)。
(df['commentnums_pre']/df['novoters_pre']).describe()

返回


評(píng)分率.PNG

此處取上圖結(jié)果的中位數(shù)65%作為評(píng)分率崭庸,考慮用novoters_pre*65%來填充commentnums_pre,commentnums_pre/65%來填充novoters_pre怀浆。

def fill_commentnum_novoter_pre(x):
    if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']):
        x['commentnums_pre'] = x['novoters_pre']*0.65
    elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
        x['novoters_pre'] = x['commentnums_pre']/0.65
    else:
        return x
    return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
df[['commentnums_pre','novoters_pre']].info()

返回


填充之后的缺失值情況.PNG

填充了commentnums_pre和novoters_pre的部分缺失值,剩余缺失值用中位數(shù)填充怕享。

同上执赡,填充commentnums_pre2和novoters_pre2字段,剩余缺失值用均值填充熬粗。

def fill_commentnum_novoter_pre2(x):
    if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
        x['commentnums_pre2'] = x['novoters_pre2']*0.65
    elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
        x['novoters_pre2'] = x['commentnums_pre2']/0.65
    else:
        return x
    return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
  • 均值搀玖、中位數(shù)、0填充
#均值(極端值影響不大驻呐,符合近似正態(tài)分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
            'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))
#中位數(shù)
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
               'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
                ,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))
#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
  • 聚類填充
    commentnums和novoters灌诅、cancelrate、hoteluv存在較強(qiáng)相關(guān)性含末,考慮通過聚類取中位數(shù)的方式來填充commentnums猜拾。
#commentnums:當(dāng)前酒店點(diǎn)評(píng)數(shù)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler()  # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']

同理佣盒,取starprefer和consuming_capacity聚類后每類avgprice的均值來填充avgprice的空值

#avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler()  # 聚類算距離挎袜,需要先標(biāo)準(zhǔn)化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']

取consuming_capacity和avgprice聚類后的中位數(shù)來填充delta_price1

#delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler()  # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = 187#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = 100#data['fill1']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = 26#data['fill2']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = 1269#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = 323#data['fill0']
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = 573#data['fill0']
df['delta_price1'] = data['delta_price1']

取 consuming_capacity和avgprice聚類delta_price2的中位數(shù)來填充delta_price2

#delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler()  # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = 91#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = 419#data['fill1']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = 18#data['fill2']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = 205#data['fill0']
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = 1042#data['fill0']
df['delta_price2'] = data['delta_price2']
  • 分段填充
    consuming_capacity和starprefer相關(guān)盯仪,考慮通過starprefer分段來填充consuming_capacity紊搪。
    看一下這兩個(gè)字段的描述情況:


    描述0.PNG

    描述1.PNG

    描述2.PNG

    描述3.PNG

    根據(jù)上述描述情況,將starprefer分成三段全景,將每塊區(qū)域內(nèi)consuming_capacity的均值來填充consuming_capacity的空值耀石。

fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
    if x.isnull()['consuming_capacity']:
        if x['starprefer']<60:
            x['consuming_capacity'] = fill1
        elif (x['starprefer']<80)&(x['starprefer']>=60):
            x['consuming_capacity'] = fill2
        else:
            x['consuming_capacity'] = fill3
    else:
        return x
    return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)

以上,缺失值處理完畢

2.2 特征工程

2.2.1 新增字段

  • 時(shí)間字段
    新增字段:訪問日期和入住日期間隔天數(shù)booking_gap爸黄、入住日期是星期幾week_day滞伟、入住日期是否是周末is_weekend
#格式為年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#訪問日期和入住日期間隔天數(shù)
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期幾
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
  • 是否是同一個(gè)樣本【選取部分客戶行為指標(biāo)】
    查看字段sid,發(fā)現(xiàn)95%都是老用戶炕贵,新用戶很少梆奈,一周內(nèi)部分用戶可能會(huì)下多個(gè)訂單,為了方便后續(xù)劃分訓(xùn)練集和驗(yàn)證集称开,此處添加一個(gè)user_tag來區(qū)分是否是同一個(gè)用戶的訂單亩钟。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
                  df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
                 df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
                  df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
                df['historyvisit_visit_detailpagenum'].map(str) + \
                  df['delta_price2'].map(str) +  \
                df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
df['user_tag'].unique().shape

返回670226,即實(shí)際這周有670226個(gè)用戶下過訂單钥弯。

  • 用戶字段和酒店字段
    選取部分用戶相關(guān)字段進(jìn)行聚類創(chuàng)建用戶字段user_group径荔,選取部分酒店相關(guān)字段進(jìn)行聚類創(chuàng)建酒店字段hotel_group督禽。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
             'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚類之前先標(biāo)準(zhǔn)化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
    km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
    km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:218580.269018  4:218580.416497 5:218581.368953 6:218581.203569 
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score))  #3:266853.481135  4:268442.314369 5:268796.468103 6:268796.707149

2.2.2 連續(xù)特征離散化

historyvisit_avghotelnum大部分都小于5脆霎,將字段處理成小于等于5和大于5的離散值;
ordercanncelednum大部分都小于5狈惫,將字段處理成小于等于5和大于5的離散值睛蛛;
sid等于1是新訪設(shè)為0,其他設(shè)為1為老用戶胧谈。
avgprice忆肾、lowestprice、starprefer菱肖、consuming_capacity和h進(jìn)行數(shù)值分段離散化客冈。

df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)  
#分段離散化
def discrete_avgprice(x):
    if x<=200:
        return 0
    elif x<=400:
        return 1
    elif x<=600:
        return 2
    else:
        return 3
    
def discrete_lowestprice(x):
    if x<=100:
        return 0
    elif x<=200:
        return 1
    elif x<=300:
        return 2
    else:
        return 3
    
def discrete_starprefer(x):
    if x==0:
        return 0
    elif x<=60:
        return 1
    elif x<=80:
        return 2
    else:
        return 3
    
def discrete_consuming_capacity(x):
    if x<0:
        return 0
    elif x<=20:
        return 1
    elif x<=40:
        return 2
    elif x<=60:
        return 3
    else:
        return 4
    
def discrete_h(x):
    if x>=0 and x<6:#凌晨訪問
        return 0
    elif x<12:#上午訪問
        return 1
    elif x<18:#下午訪問
        return 2
    else:
        return 3#晚上訪問
    
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)

對(duì)當(dāng)前的數(shù)值型類別變量進(jìn)行離散特征熱編碼,此處用OneHotEncoder方法

discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
                  ,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
                 'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)

2.2.3 刪除字段

去掉兩類字段:
d稳强、arrival场仲、sampleid、firstorder_bu這幾個(gè)對(duì)分析沒有意義的字段退疫;
historyvisit_totalordernum和ordernum_oneyear這兩個(gè)字段值相等渠缕,此處取ordernum_oneyear這個(gè)字段,刪除historyvisit_totalordernum褒繁;
decisionhabit_user和historyvisit_avghotelnum數(shù)值較一致亦鳞,此處選擇historyvisit_avghotelnum,刪除decisionhabit_user。

encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user'],axis=1)
encode_df_new.shape

最終去除目標(biāo)字段label和劃分訓(xùn)練集字段user_tag燕差,共有79個(gè)字段遭笋。

2.3 模型訓(xùn)練

2.3.1 劃分訓(xùn)練集和驗(yàn)證集

為了保證訓(xùn)練集和驗(yàn)證集獨(dú)立同分布,將數(shù)據(jù)按照user_tag進(jìn)行排序徒探,取前70%作為訓(xùn)練集坐梯,剩余的作為驗(yàn)證集。

ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]

2.3.2 比較各個(gè)模型的訓(xùn)練效果

所有模型的調(diào)參都采用GridSearchCV網(wǎng)格搜索進(jìn)行刹帕。

  • GBDT
#調(diào)整的參數(shù):
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate吵血,需要配合調(diào)整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最終的參數(shù)結(jié)果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('調(diào)參之后:測(cè)試集中precision>=0.97對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.15988300816140671
0.8808204850185188

  • xgboost
#調(diào)整的參數(shù):
#迭代器個(gè)數(shù)n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合調(diào)整n_esgtimators


from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.7640022417597814
0.9754939563495324

  • 隨機(jī)森林
#調(diào)整的參數(shù):
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.666135416301797
0.9616117844760916

  • Adaboost
bdt = AdaBoostClassifier(algorithm="SAMME",
                         n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.00019265123121650496
0.7300356696791559

  • DecisionTree
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.0
0.8340018840954033

根據(jù)上述結(jié)果可知,xgboost的訓(xùn)練效果最好护锤,當(dāng)precision>=0.97時(shí)胸私,recall最大能達(dá)到76.4%。

2.3.3 模型堆疊

后面也嘗試了模型堆疊的方法侦另,看是否能得到更好的效果,首先利用上述提到的各個(gè)模型尉共,根據(jù)特征重要性選取了57個(gè)特征褒傅,然后利用KFold方法進(jìn)行5折交叉驗(yàn)證,得到五種模型的驗(yàn)證集和測(cè)試集結(jié)果袄友,分別作為第二層的訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集殿托,并用邏輯回歸模型來訓(xùn)練這五個(gè)特征,最終得到的結(jié)果是當(dāng)precision>=0.97時(shí)剧蚣,recall最大能達(dá)到78.3%支竹,比原來的76.4%稍有提高。

  • 選取重要特征
#篩選特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier

def get_top_n_features(train_x, train_y):

    # random forest
    rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
    rf_est.fit(train_x, train_y)
    feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
                                          'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)

    # AdaBoost
    ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
    ada_est.fit(train_x, train_y)
    feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
                                           'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)

    
    # GradientBoosting
    gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
    gb_est.fit(train_x, train_y)
    feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
                                          'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)

    # DecisionTree
    dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
    dt_est.fit(train_x, train_y)
    feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
                                          'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
    
    # xgbc
    xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
    xg_est.fit(train_x, train_y)
    feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
                                          'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)

    
    return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg

feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg], 
                               ignore_index=True).drop_duplicates()
    
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, 
                                   feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n

最終從79個(gè)特征中選取了57個(gè)鸠按。

  • 第一層模型訓(xùn)練
#第一層
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)

def get_out_fold(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((5, ntest))
    oof_train_prob = np.zeros((ntrain,))
    oof_test_prob = np.zeros((ntest,))
    oof_test_skf_prob = np.empty((5, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
        oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
        oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
        print('現(xiàn)在是第{}層'.format(i))
        print('訓(xùn)練集索引如下:')
        print(train_index)
        print('測(cè)試集索引如下:')
        print(test_index)
    oof_test[:] = oof_test_skf.mean(axis=0)
    oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)

x_train = train_x_new.values 
x_test = test_x_new.values 
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
  • 第二層模型訓(xùn)練
    將第一層的輸出結(jié)果作為訓(xùn)練集和測(cè)試集
#劃分訓(xùn)練集和測(cè)試集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#邏輯回歸模型訓(xùn)練
from sklearn.linear_model import LogisticRegression
#調(diào)參
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每個(gè)參數(shù)值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳參數(shù)值為:{}'.format(rfsearch4.best_params_))
# print('最佳參數(shù)值roc_auc得分為:{}'.format(rfsearch4.best_score_))
#調(diào)參結(jié)果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))

返回
0.7832498511331395
0.9763271659779821
通過堆疊的方法礼搁,將recall值從76.4%提高到78.3%。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末目尖,一起剝皮案震驚了整個(gè)濱河市馒吴,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌瑟曲,老刑警劉巖饮戳,帶你破解...
    沈念sama閱讀 216,744評(píng)論 6 502
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異测蹲,居然都是意外死亡莹捡,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,505評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門扣甲,熙熙樓的掌柜王于貴愁眉苦臉地迎上來篮赢,“玉大人齿椅,你說我怎么就攤上這事∑羝” “怎么了涣脚?”我有些...
    開封第一講書人閱讀 163,105評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)寥茫。 經(jīng)常有香客問我遣蚀,道長(zhǎng),這世上最難降的妖魔是什么纱耻? 我笑而不...
    開封第一講書人閱讀 58,242評(píng)論 1 292
  • 正文 為了忘掉前任芭梯,我火速辦了婚禮,結(jié)果婚禮上弄喘,老公的妹妹穿的比我還像新娘玖喘。我一直安慰自己,他們只是感情好蘑志,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,269評(píng)論 6 389
  • 文/花漫 我一把揭開白布累奈。 她就那樣靜靜地躺著,像睡著了一般急但。 火紅的嫁衣襯著肌膚如雪澎媒。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,215評(píng)論 1 299
  • 那天波桩,我揣著相機(jī)與錄音戒努,去河邊找鬼。 笑死突委,一個(gè)胖子當(dāng)著我的面吹牛柏卤,可吹牛的內(nèi)容都是我干的冬三。 我是一名探鬼主播匀油,決...
    沈念sama閱讀 40,096評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼勾笆!你這毒婦竟也來了敌蚜?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,939評(píng)論 0 274
  • 序言:老撾萬榮一對(duì)情侶失蹤窝爪,失蹤者是張志新(化名)和其女友劉穎弛车,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體蒲每,經(jīng)...
    沈念sama閱讀 45,354評(píng)論 1 311
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡纷跛,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,573評(píng)論 2 333
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了邀杏。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片贫奠。...
    茶點(diǎn)故事閱讀 39,745評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡唬血,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出唤崭,到底是詐尸還是另有隱情拷恨,我是刑警寧澤,帶...
    沈念sama閱讀 35,448評(píng)論 5 344
  • 正文 年R本政府宣布谢肾,位于F島的核電站腕侄,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏芦疏。R本人自食惡果不足惜冕杠,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,048評(píng)論 3 327
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望酸茴。 院中可真熱鬧拌汇,春花似錦、人聲如沸弊决。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,683評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽飘诗。三九已至与倡,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間昆稿,已是汗流浹背纺座。 一陣腳步聲響...
    開封第一講書人閱讀 32,838評(píng)論 1 269
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留溉潭,地道東北人净响。 一個(gè)月前我還...
    沈念sama閱讀 47,776評(píng)論 2 369
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像喳瓣,于是被迫代替她去往敵國(guó)和親馋贤。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,652評(píng)論 2 354