旅游平臺(tái)用戶流失預(yù)警

  • 項(xiàng)目介紹
  • 數(shù)據(jù)探索
  • 特征工程
  • 模型訓(xùn)練
  • 模型融合

1婆排、項(xiàng)目介紹

  • 背景

    攜程作為中國(guó)領(lǐng)先的綜合性旅行服務(wù)公司蒋得,每天向超過2.5億會(huì)員提供全方位的旅行服務(wù),在這海量的網(wǎng)站訪問量中猪腕,我們可分析用戶的行為數(shù)據(jù)來(lái)挖掘潛在的信息資源姜胖。其中,客戶流失率是考量業(yè)務(wù)成績(jī)的一個(gè)非常關(guān)鍵的指標(biāo)纳猪。此次分析的目的是為了深入了解用戶畫像及行為偏好氧卧,找到最優(yōu)算法,挖掘出影響用戶流失的關(guān)鍵因素氏堤,從而更好地完善產(chǎn)品設(shè)計(jì)沙绝、提升用戶體驗(yàn)搏明。

  • 評(píng)估標(biāo)準(zhǔn)

    要求在精確度達(dá)到97%的情況下,最大化召回率闪檬。

  • 數(shù)據(jù)集:

    官方共提供2個(gè)數(shù)據(jù)集星著,分別為訓(xùn)練集userlostprob_train.txt和測(cè)試集userlostprob_test.txt。訓(xùn)練集為2016.05.15-2016.05.21期間一周的訪問數(shù)據(jù)粗悯,測(cè)試集為2016.05.22-2016.05.28期間一周的訪問數(shù)據(jù)虚循。測(cè)試集不提供目標(biāo)變量label,需自行預(yù)測(cè)样傍。為保護(hù)客戶隱私横缔,不提供uid等信息。此外衫哥,數(shù)據(jù)經(jīng)過了脫敏茎刚,和實(shí)際商品的訂單量、瀏覽量撤逢、轉(zhuǎn)化率等有一些差距膛锭,但是不會(huì)影響這個(gè)問題的可解性。

    數(shù)據(jù)的特征除了id和label以外大致可以分為三類笛质,一種是訂單本身的特征泉沾,比如訂單的預(yù)定日期以及訂單的入住日期等;另外一種是和用戶相關(guān)的特征妇押;還有一類特征是和酒店相關(guān)的特征跷究,比如酒店的點(diǎn)評(píng)人數(shù)、酒店的星級(jí)偏好等敲霍。

2俊马、數(shù)據(jù)探索

# 加載包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.cluster import KMeans
from sklearn import metrics
pd.set_option('display.max_rows', 200) 
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)
# 獲取數(shù)據(jù)
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t') #
df = df_orign.copy()  # 復(fù)制一份數(shù)據(jù)
df.shape

(689945, 51)

2.1 目標(biāo)變量分布

df['label'].value_counts() 
0    500588
1    189357
Name: label, dtype: int64
  • 流失和未流失的用戶比例2:5,樣本不算特別不平衡肩杈,此處不做處理柴我。

2.2 處理異常值

df.describe()

觀察到用戶偏好價(jià)格delta_price1、delta_price2扩然,以及當(dāng)前酒店可訂最低價(jià) lowestprice存在一些負(fù)值艘儒,理論上酒店的價(jià)格不可能為負(fù)。同時(shí)數(shù)據(jù)分布比較集中夫偶,因此采取中位數(shù)填充界睁。而客戶價(jià)值customer_value_profit、ctrip_profits也不應(yīng)該為負(fù)值兵拢,這里將其填充為0翻斟。deltaprice_pre2_t1是酒店價(jià)格與對(duì)手價(jià)差均值,可以為負(fù)值说铃,無(wú)需處理访惜。

# 過濾掉前面非數(shù)值的字段
df_min=df.min().iloc[4:]  
# 查看值為負(fù)的字段
index=df_min[df_min<0].index.tolist()
# 查看存在負(fù)值的字段的值分布
plt.figure(figsize=(20,10))
for i in range(len(index)):
    plt.subplot(2,3,i+1)
    plt.hist(df[index[i]],bins=100)
    plt.title(index[i])
  • 數(shù)據(jù)分布比較集中嘹履,因此采取中位數(shù)填充。
neg1=['delta_price1','delta_price2','lowestprice']   # 填充中位數(shù)
neg2=['customer_value_profit','ctrip_profits']  # 填充0
for col in neg1:
    df.loc[df[col]<0,col]=df[col].median()
for col in neg2:
    df.loc[df[col]<0,col]=0
  • 24小時(shí)內(nèi)登陸時(shí)長(zhǎng)內(nèi)登錄時(shí)長(zhǎng)不應(yīng)該超過24小時(shí)债热,將大于24的值改為24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24

2.3 格式轉(zhuǎn)換

  • 訪問日期d和入住日期arrival是字符串格式砾嫉,需要進(jìn)行格式轉(zhuǎn)換,將字符串格式轉(zhuǎn)換為日期格式
df['d']=pd.to_datetime(df['d'],format="%Y-%m-%d")
df['arrival']=pd.to_datetime(df['arrival'],format="%Y-%m-%d")

3阳柔、特征工程

  • 數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)效果的上限焰枢,而模型和算法只是逼近這個(gè)上限。特征工程是建模前的關(guān)鍵步驟舌剂,特征處理得好,可以提升模型的性能暑椰。

3.1 缺失值處理

  • 查看字段缺失比例
na_rate=(len(df)-df.count())/len(df)  # 計(jì)算缺失值比例(類型一維序列)
na_rate.sort_values(ascending=True,inplace=True) # 排序
na_rate=pd.DataFrame(na_rate,columns=['rate'])  # 轉(zhuǎn)化為數(shù)據(jù)框
# 畫圖
plt.figure(figsize=(6,12)) 
plt.barh(na_rate.index,na_rate['rate'],alpha = 0.5)
plt.xlabel('na_rate') # 添加軸標(biāo)簽
plt.xlim([0,1]) # 刻度范圍
for x,y in enumerate(na_rate['rate']):
    plt.text(y,x,'%.2f%%'%y)  # 添加數(shù)值標(biāo)簽

根據(jù)上圖顯示:幾乎所有字段都存在缺失值霍转,存在缺失值的字段均為連續(xù)性字段,其中historyvisit_7ordernum缺失值超過80%一汽,已沒有分析的必要避消,考慮將其刪除。剩余存在缺失值的字段召夹,考慮用其他值來(lái)填充空值岩喷。

  • 使用相關(guān)字段相互填充
    計(jì)算字段的相關(guān)性發(fā)現(xiàn):commentnums_pre和novoters_pre相關(guān)性較強(qiáng);commentnums_pre2和novoters_pre2相關(guān)性較強(qiáng)监憎。



    此處取上圖結(jié)果的中位數(shù)65%作為評(píng)分率纱意,考慮用novoters_pre*65%來(lái)填充commentnums_pre;
    commentnums_pre/65%來(lái)填充novoters_pre鲸阔。 填充了commentnums_pre和novoters_pre的部分缺失值偷霉,剩余缺失值用中位數(shù)填充。

def fill_commentnum_novoter_pre(x):
    if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']): 
        x['commentnums_pre'] = x['novoters_pre']*0.65
    elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
        x['novoters_pre'] = x['commentnums_pre']/0.65
    else:
        return x
    return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
def fill_commentnum_novoter_pre2(x):
    if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
        x['commentnums_pre2'] = x['novoters_pre2']*0.65
    elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
        x['novoters_pre2'] = x['commentnums_pre2']/0.65
    else:
        return x
    return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
# 均值填充(極端值影響不大褐筛,符合近似正態(tài)分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
            'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))

#中位數(shù)填充
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
               'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
                ,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))

#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
  • 分段填充
    consuming_capacity和starprefer相關(guān)类少,考慮通過starprefer分段來(lái)填充consuming_capacity。 看一下這兩個(gè)字段的描述情況:



    將starprefer分為三段:<60,60~80,>80

fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
    if x.isnull()['consuming_capacity']:
        if x['starprefer']<60:
            x['consuming_capacity'] = fill1
        elif (x['starprefer']<80)&(x['starprefer']>=60):
            x['consuming_capacity'] = fill2
        else:
            x['consuming_capacity'] = fill3
    else:
        return x
    return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)
  • 聚類填充
    commentnums和novoters渔扎、cancelrate硫狞、hoteluv存在較強(qiáng)相關(guān)性
    考慮通過聚類取中位數(shù)的方式來(lái)填充commentnums。
#commentnums:當(dāng)前酒店點(diǎn)評(píng)數(shù)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler()  # 聚類算距離晃痴,需要先標(biāo)準(zhǔn)化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']

取starprefer和consuming_capacity聚類后每類avgprice的均值來(lái)填充avgprice的空值

# avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler()  # 聚類算距離残吩,需要先標(biāo)準(zhǔn)化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']

取consuming_capacity和avgprice聚類后的中位數(shù)來(lái)填充delta_price1

# delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler()  # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = (data.loc[data['label_pred'] == 0,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = (data.loc[data['label_pred'] == 1,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = (data.loc[data['label_pred'] == 2,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = (data.loc[data['label_pred'] == 3,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = (data.loc[data['label_pred'] == 4,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = (data.loc[data['label_pred'] == 5,'delta_price1']).median()
df['delta_price1'] = data['delta_price1']

取 consuming_capacity和avgprice聚類delta_price2的中位數(shù)來(lái)填充delta_price2

# delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler()  # 聚類算距離愧旦,需要先標(biāo)準(zhǔn)化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = (data.loc[data['label_pred'] == 0,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = (data.loc[data['label_pred'] == 1,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = (data.loc[data['label_pred'] == 2,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = (data.loc[data['label_pred'] == 3,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = (data.loc[data['label_pred'] == 4,'delta_price2']).median()
df['delta_price2'] = data['delta_price2']
  • 以上世剖,缺失值處理完畢

3.2 新增字段

  • 時(shí)間字段
    新增字段:訪問日期和入住日期間隔天數(shù)booking_gap、入住日期是星期幾week_day笤虫、入住日期是否是周末is_weekend
#格式為年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#訪問日期和入住日期間隔天數(shù)
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期幾
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
  • 是否是同一個(gè)樣本【選取部分客戶行為指標(biāo)】
    查看字段sid旁瘫,發(fā)現(xiàn)95%都是老用戶祖凫,新用戶很少,一周內(nèi)部分用戶可能會(huì)下多個(gè)訂單酬凳,為了方便后續(xù)劃分訓(xùn)練集和驗(yàn)證集惠况,此處添加一個(gè)user_tag來(lái)區(qū)分是否是同一個(gè)用戶的訂單。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
                  df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
                 df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
                  df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
                df['historyvisit_visit_detailpagenum'].map(str) + \
                  df['delta_price2'].map(str) +  \
                df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
  • 用戶字段和酒店字段
    選取部分用戶相關(guān)字段進(jìn)行聚類創(chuàng)建用戶字段user_group宁仔,選取部分酒店相關(guān)字段進(jìn)行聚類創(chuàng)建酒店字段hotel_group稠屠。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
             'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚類之前先標(biāo)準(zhǔn)化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
    km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
    km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:218580.269018  4:218580.416497 5:218581.368953 6:218581.203569 
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score))  #3:266853.481135  4:268442.314369 5:268796.468103 6:268796.707149

3.3 連續(xù)特征離散化

historyvisit_avghotelnum大部分都小于5,將字段處理成小于等于5和大于5的離散值翎苫;
ordercanncelednum大部分都小于5权埠,將字段處理成小于等于5和大于5的離散值;
sid等于1是新訪設(shè)為0煎谍,其他設(shè)為1為老用戶攘蔽。
avgprice、lowestprice呐粘、starprefer满俗、consuming_capacity和h進(jìn)行數(shù)值分段離散化。

df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)  
#分段離散化
def discrete_avgprice(x):
    if x<=200:
        return 0
    elif x<=400:
        return 1
    elif x<=600:
        return 2
    else:
        return 3
    
def discrete_lowestprice(x):
    if x<=100:
        return 0
    elif x<=200:
        return 1
    elif x<=300:
        return 2
    else:
        return 3
    
def discrete_starprefer(x):
    if x==0:
        return 0
    elif x<=60:
        return 1
    elif x<=80:
        return 2
    else:
        return 3
    
def discrete_consuming_capacity(x):
    if x<0:
        return 0
    elif x<=20:
        return 1
    elif x<=40:
        return 2
    elif x<=60:
        return 3
    else:
        return 4
    
def discrete_h(x):
    if x>=0 and x<6:#凌晨訪問
        return 0
    elif x<12:#上午訪問
        return 1
    elif x<18:#下午訪問
        return 2
    else:
        return 3#晚上訪問
    
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)
  • 對(duì)當(dāng)前的數(shù)值型類別變量進(jìn)行離散特征熱編碼作岖,此處用OneHotEncoder方法
discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
                  ,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
                 'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)

3.4 刪除字段

去掉兩類字段: d唆垃、arrival、sampleid痘儡、firstorder_bu這幾個(gè)對(duì)分析沒有意義的字段辕万; historyvisit_totalordernum和ordernum_oneyear這兩個(gè)字段值相等,此處取ordernum_oneyear這個(gè)字段谤辜,刪除historyvisit_totalordernum蓄坏; decisionhabit_user和historyvisit_avghotelnum數(shù)值較一致,此處選擇historyvisit_avghotelnum丑念,刪除decisionhabit_user涡戳。

encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user','historyvisit_7ordernum'],axis=1)
encode_df_new.shape

4、模型訓(xùn)練

4.1 劃分訓(xùn)練集和驗(yàn)證集

為了保證訓(xùn)練集和驗(yàn)證集獨(dú)立同分布脯倚,將數(shù)據(jù)按照user_tag進(jìn)行排序渔彰,取前70%作為訓(xùn)練集,剩余的作為驗(yàn)證集推正。

ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]

4.2比較各個(gè)模型的訓(xùn)練效果

所有模型的調(diào)參都采用GridSearchCV網(wǎng)格搜索進(jìn)行恍涂。

  • 決策樹
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.0
0.8340018840954033
  • 隨機(jī)森林
#調(diào)整的參數(shù):
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.666135416301797
0.9616117844760916
  • Adaboost
from sklearn.ensemble import AdaBoostClassifier
bdt = AdaBoostClassifier(algorithm="SAMME",
                         n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.00019265123121650496
0.7300356696791559
  • GBDT
#調(diào)整的參數(shù):
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate,需要配合調(diào)整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最終的參數(shù)結(jié)果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('調(diào)參之后:測(cè)試集中precision>=0.97對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.15988300816140671
0.8808204850185188
  • xgboost
#調(diào)整的參數(shù):
#迭代器個(gè)數(shù)n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合調(diào)整n_esgtimators


from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7640022417597814
0.9754939563495324
  • 根據(jù)上述結(jié)果可知,xgboost的訓(xùn)練效果最好售睹,當(dāng)precision>=0.97時(shí),recall最大能達(dá)到76.4%炒瘸。

5淤堵、模型融合

后面也嘗試了模型堆疊的方法,看是否能得到更好的效果顷扩,首先利用上述提到的各個(gè)模型拐邪,根據(jù)特征重要性選取了57個(gè)特征,然后利用KFold方法進(jìn)行5折交叉驗(yàn)證隘截,得到五種模型的驗(yàn)證集和測(cè)試集結(jié)果扎阶,分別作為第二層的訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集,并用邏輯回歸模型來(lái)訓(xùn)練這五個(gè)特征婶芭,最終得到的結(jié)果是當(dāng)precision>=0.97時(shí)东臀,recall最大能達(dá)到78.3%,比原來(lái)的76.4%稍有提高犀农。

  • 選取重要特征
#篩選特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier

def get_top_n_features(train_x, train_y):

    # random forest
    rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
    rf_est.fit(train_x, train_y)
    feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
                                          'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)

    # AdaBoost
    ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
    ada_est.fit(train_x, train_y)
    feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
                                           'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)

    
    # GradientBoosting
    gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
    gb_est.fit(train_x, train_y)
    feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
                                          'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)

    # DecisionTree
    dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
    dt_est.fit(train_x, train_y)
    feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
                                          'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
    
    # xgbc
    xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
    xg_est.fit(train_x, train_y)
    feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
                                          'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)

    
    return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg

feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg], 
                               ignore_index=True).drop_duplicates()
    
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, 
                                   feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n

最終從79個(gè)特征中選取了57個(gè)啡邑。

  • 第一層模型訓(xùn)練
#第一層
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)

def get_out_fold(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((5, ntest))
    oof_train_prob = np.zeros((ntrain,))
    oof_test_prob = np.zeros((ntest,))
    oof_test_skf_prob = np.empty((5, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
        oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
        oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
        print('現(xiàn)在是第{}層'.format(i))
        print('訓(xùn)練集索引如下:')
        print(train_index)
        print('測(cè)試集索引如下:')
        print(test_index)
    oof_test[:] = oof_test_skf.mean(axis=0)
    oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)

x_train = train_x_new.values 
x_test = test_x_new.values 
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
  • 第二層模型訓(xùn)練
    將第一層的輸出結(jié)果作為訓(xùn)練集和測(cè)試集
#劃分訓(xùn)練集和測(cè)試集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#邏輯回歸模型訓(xùn)練
from sklearn.linear_model import LogisticRegression
#調(diào)參
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每個(gè)參數(shù)值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳參數(shù)值為:{}'.format(rfsearch4.best_params_))
# print('最佳參數(shù)值roc_auc得分為:{}'.format(rfsearch4.best_score_))
#調(diào)參結(jié)果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7832498511331395
0.9763271659779821

通過堆疊的方法,將recall值從76.4%提高到78.3%井赌。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載,如需轉(zhuǎn)載請(qǐng)通過簡(jiǎn)信或評(píng)論聯(lián)系作者贵扰。
  • 序言:七十年代末仇穗,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子戚绕,更是在濱河造成了極大的恐慌纹坐,老刑警劉巖,帶你破解...
    沈念sama閱讀 221,695評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件舞丛,死亡現(xiàn)場(chǎng)離奇詭異耘子,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)球切,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,569評(píng)論 3 399
  • 文/潘曉璐 我一進(jìn)店門谷誓,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人吨凑,你說(shuō)我怎么就攤上這事捍歪。” “怎么了鸵钝?”我有些...
    開封第一講書人閱讀 168,130評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵糙臼,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我恩商,道長(zhǎng)变逃,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,648評(píng)論 1 297
  • 正文 為了忘掉前任怠堪,我火速辦了婚禮揽乱,結(jié)果婚禮上名眉,老公的妹妹穿的比我還像新娘。我一直安慰自己锤窑,他們只是感情好璧针,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,655評(píng)論 6 397
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著渊啰,像睡著了一般探橱。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上绘证,一...
    開封第一講書人閱讀 52,268評(píng)論 1 309
  • 那天隧膏,我揣著相機(jī)與錄音,去河邊找鬼嚷那。 笑死胞枕,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的魏宽。 我是一名探鬼主播腐泻,決...
    沈念sama閱讀 40,835評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼队询!你這毒婦竟也來(lái)了派桩?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,740評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤蚌斩,失蹤者是張志新(化名)和其女友劉穎铆惑,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體送膳,經(jīng)...
    沈念sama閱讀 46,286評(píng)論 1 318
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡员魏,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,375評(píng)論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了叠聋。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片撕阎。...
    茶點(diǎn)故事閱讀 40,505評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖晒奕,靈堂內(nèi)的尸體忽然破棺而出闻书,到底是詐尸還是另有隱情,我是刑警寧澤脑慧,帶...
    沈念sama閱讀 36,185評(píng)論 5 350
  • 正文 年R本政府宣布魄眉,位于F島的核電站,受9級(jí)特大地震影響闷袒,放射性物質(zhì)發(fā)生泄漏坑律。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,873評(píng)論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望晃择。 院中可真熱鬧冀值,春花似錦、人聲如沸宫屠。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,357評(píng)論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)浪蹂。三九已至抵栈,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間坤次,已是汗流浹背古劲。 一陣腳步聲響...
    開封第一講書人閱讀 33,466評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留缰猴,地道東北人产艾。 一個(gè)月前我還...
    沈念sama閱讀 48,921評(píng)論 3 376
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像滑绒,于是被迫代替她去往敵國(guó)和親闷堡。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,515評(píng)論 2 359