- 項(xiàng)目介紹
- 數(shù)據(jù)探索
- 特征工程
- 模型訓(xùn)練
- 模型融合
1婆排、項(xiàng)目介紹
-
背景:
攜程作為中國(guó)領(lǐng)先的綜合性旅行服務(wù)公司蒋得,每天向超過2.5億會(huì)員提供全方位的旅行服務(wù),在這海量的網(wǎng)站訪問量中猪腕,我們可分析用戶的行為數(shù)據(jù)來(lái)挖掘潛在的信息資源姜胖。其中,客戶流失率是考量業(yè)務(wù)成績(jī)的一個(gè)非常關(guān)鍵的指標(biāo)纳猪。此次分析的目的是為了深入了解用戶畫像及行為偏好氧卧,找到最優(yōu)算法,挖掘出影響用戶流失的關(guān)鍵因素氏堤,從而更好地完善產(chǎn)品設(shè)計(jì)沙绝、提升用戶體驗(yàn)搏明。
-
評(píng)估標(biāo)準(zhǔn):
要求在精確度達(dá)到97%的情況下,最大化召回率闪檬。
-
數(shù)據(jù)集:
官方共提供2個(gè)數(shù)據(jù)集星著,分別為訓(xùn)練集userlostprob_train.txt和測(cè)試集userlostprob_test.txt。訓(xùn)練集為2016.05.15-2016.05.21期間一周的訪問數(shù)據(jù)粗悯,測(cè)試集為2016.05.22-2016.05.28期間一周的訪問數(shù)據(jù)虚循。測(cè)試集不提供目標(biāo)變量label,需自行預(yù)測(cè)样傍。為保護(hù)客戶隱私横缔,不提供uid等信息。此外衫哥,數(shù)據(jù)經(jīng)過了脫敏茎刚,和實(shí)際商品的訂單量、瀏覽量撤逢、轉(zhuǎn)化率等有一些差距膛锭,但是不會(huì)影響這個(gè)問題的可解性。
數(shù)據(jù)的特征除了id和label以外大致可以分為三類笛质,一種是訂單本身的特征泉沾,比如訂單的預(yù)定日期以及訂單的入住日期等;另外一種是和用戶相關(guān)的特征妇押;還有一類特征是和酒店相關(guān)的特征跷究,比如酒店的點(diǎn)評(píng)人數(shù)、酒店的星級(jí)偏好等敲霍。
2俊马、數(shù)據(jù)探索
# 加載包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.cluster import KMeans
from sklearn import metrics
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)
# 獲取數(shù)據(jù)
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t') #
df = df_orign.copy() # 復(fù)制一份數(shù)據(jù)
df.shape
(689945, 51)
2.1 目標(biāo)變量分布
df['label'].value_counts()
0 500588
1 189357
Name: label, dtype: int64
- 流失和未流失的用戶比例2:5,樣本不算特別不平衡肩杈,此處不做處理柴我。
2.2 處理異常值
df.describe()
觀察到用戶偏好價(jià)格delta_price1、delta_price2扩然,以及當(dāng)前酒店可訂最低價(jià) lowestprice存在一些負(fù)值艘儒,理論上酒店的價(jià)格不可能為負(fù)。同時(shí)數(shù)據(jù)分布比較集中夫偶,因此采取中位數(shù)填充界睁。而客戶價(jià)值customer_value_profit、ctrip_profits也不應(yīng)該為負(fù)值兵拢,這里將其填充為0翻斟。deltaprice_pre2_t1是酒店價(jià)格與對(duì)手價(jià)差均值,可以為負(fù)值说铃,無(wú)需處理访惜。
# 過濾掉前面非數(shù)值的字段
df_min=df.min().iloc[4:]
# 查看值為負(fù)的字段
index=df_min[df_min<0].index.tolist()
# 查看存在負(fù)值的字段的值分布
plt.figure(figsize=(20,10))
for i in range(len(index)):
plt.subplot(2,3,i+1)
plt.hist(df[index[i]],bins=100)
plt.title(index[i])
- 數(shù)據(jù)分布比較集中嘹履,因此采取中位數(shù)填充。
neg1=['delta_price1','delta_price2','lowestprice'] # 填充中位數(shù)
neg2=['customer_value_profit','ctrip_profits'] # 填充0
for col in neg1:
df.loc[df[col]<0,col]=df[col].median()
for col in neg2:
df.loc[df[col]<0,col]=0
- 24小時(shí)內(nèi)登陸時(shí)長(zhǎng)內(nèi)登錄時(shí)長(zhǎng)不應(yīng)該超過24小時(shí)债热,將大于24的值改為24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24
2.3 格式轉(zhuǎn)換
- 訪問日期d和入住日期arrival是字符串格式砾嫉,需要進(jìn)行格式轉(zhuǎn)換,將字符串格式轉(zhuǎn)換為日期格式
df['d']=pd.to_datetime(df['d'],format="%Y-%m-%d")
df['arrival']=pd.to_datetime(df['arrival'],format="%Y-%m-%d")
3阳柔、特征工程
- 數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)效果的上限焰枢,而模型和算法只是逼近這個(gè)上限。特征工程是建模前的關(guān)鍵步驟舌剂,特征處理得好,可以提升模型的性能暑椰。
3.1 缺失值處理
- 查看字段缺失比例
na_rate=(len(df)-df.count())/len(df) # 計(jì)算缺失值比例(類型一維序列)
na_rate.sort_values(ascending=True,inplace=True) # 排序
na_rate=pd.DataFrame(na_rate,columns=['rate']) # 轉(zhuǎn)化為數(shù)據(jù)框
# 畫圖
plt.figure(figsize=(6,12))
plt.barh(na_rate.index,na_rate['rate'],alpha = 0.5)
plt.xlabel('na_rate') # 添加軸標(biāo)簽
plt.xlim([0,1]) # 刻度范圍
for x,y in enumerate(na_rate['rate']):
plt.text(y,x,'%.2f%%'%y) # 添加數(shù)值標(biāo)簽
根據(jù)上圖顯示:幾乎所有字段都存在缺失值霍转,存在缺失值的字段均為連續(xù)性字段,其中historyvisit_7ordernum缺失值超過80%一汽,已沒有分析的必要避消,考慮將其刪除。剩余存在缺失值的字段召夹,考慮用其他值來(lái)填充空值岩喷。
-
使用相關(guān)字段相互填充
計(jì)算字段的相關(guān)性發(fā)現(xiàn):commentnums_pre和novoters_pre相關(guān)性較強(qiáng);commentnums_pre2和novoters_pre2相關(guān)性較強(qiáng)监憎。
此處取上圖結(jié)果的中位數(shù)65%作為評(píng)分率纱意,考慮用novoters_pre*65%來(lái)填充commentnums_pre;
commentnums_pre/65%來(lái)填充novoters_pre鲸阔。 填充了commentnums_pre和novoters_pre的部分缺失值偷霉,剩余缺失值用中位數(shù)填充。
def fill_commentnum_novoter_pre(x):
if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']):
x['commentnums_pre'] = x['novoters_pre']*0.65
elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
x['novoters_pre'] = x['commentnums_pre']/0.65
else:
return x
return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
def fill_commentnum_novoter_pre2(x):
if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
x['commentnums_pre2'] = x['novoters_pre2']*0.65
elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
x['novoters_pre2'] = x['commentnums_pre2']/0.65
else:
return x
return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
# 均值填充(極端值影響不大褐筛,符合近似正態(tài)分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))
#中位數(shù)填充
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))
#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
-
分段填充
consuming_capacity和starprefer相關(guān)类少,考慮通過starprefer分段來(lái)填充consuming_capacity。 看一下這兩個(gè)字段的描述情況:
將starprefer分為三段:<60,60~80,>80
fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
if x.isnull()['consuming_capacity']:
if x['starprefer']<60:
x['consuming_capacity'] = fill1
elif (x['starprefer']<80)&(x['starprefer']>=60):
x['consuming_capacity'] = fill2
else:
x['consuming_capacity'] = fill3
else:
return x
return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)
- 聚類填充
commentnums和novoters渔扎、cancelrate硫狞、hoteluv存在較強(qiáng)相關(guān)性
考慮通過聚類取中位數(shù)的方式來(lái)填充commentnums。
#commentnums:當(dāng)前酒店點(diǎn)評(píng)數(shù)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler() # 聚類算距離晃痴,需要先標(biāo)準(zhǔn)化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']
取starprefer和consuming_capacity聚類后每類avgprice的均值來(lái)填充avgprice的空值
# avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler() # 聚類算距離残吩,需要先標(biāo)準(zhǔn)化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']
取consuming_capacity和avgprice聚類后的中位數(shù)來(lái)填充delta_price1
# delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler() # 聚類算距離,需要先標(biāo)準(zhǔn)化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = (data.loc[data['label_pred'] == 0,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = (data.loc[data['label_pred'] == 1,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = (data.loc[data['label_pred'] == 2,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = (data.loc[data['label_pred'] == 3,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = (data.loc[data['label_pred'] == 4,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = (data.loc[data['label_pred'] == 5,'delta_price1']).median()
df['delta_price1'] = data['delta_price1']
取 consuming_capacity和avgprice聚類delta_price2的中位數(shù)來(lái)填充delta_price2
# delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler() # 聚類算距離愧旦,需要先標(biāo)準(zhǔn)化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = (data.loc[data['label_pred'] == 0,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = (data.loc[data['label_pred'] == 1,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = (data.loc[data['label_pred'] == 2,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = (data.loc[data['label_pred'] == 3,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = (data.loc[data['label_pred'] == 4,'delta_price2']).median()
df['delta_price2'] = data['delta_price2']
- 以上世剖,缺失值處理完畢
3.2 新增字段
- 時(shí)間字段
新增字段:訪問日期和入住日期間隔天數(shù)booking_gap、入住日期是星期幾week_day笤虫、入住日期是否是周末is_weekend
#格式為年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#訪問日期和入住日期間隔天數(shù)
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期幾
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
- 是否是同一個(gè)樣本【選取部分客戶行為指標(biāo)】
查看字段sid旁瘫,發(fā)現(xiàn)95%都是老用戶祖凫,新用戶很少,一周內(nèi)部分用戶可能會(huì)下多個(gè)訂單酬凳,為了方便后續(xù)劃分訓(xùn)練集和驗(yàn)證集惠况,此處添加一個(gè)user_tag來(lái)區(qū)分是否是同一個(gè)用戶的訂單。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
df['historyvisit_visit_detailpagenum'].map(str) + \
df['delta_price2'].map(str) + \
df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
- 用戶字段和酒店字段
選取部分用戶相關(guān)字段進(jìn)行聚類創(chuàng)建用戶字段user_group宁仔,選取部分酒店相關(guān)字段進(jìn)行聚類創(chuàng)建酒店字段hotel_group稠屠。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚類之前先標(biāo)準(zhǔn)化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:218580.269018 4:218580.416497 5:218581.368953 6:218581.203569
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('數(shù)據(jù)聚calinski_harabaz指數(shù)為:%f'%(score)) #3:266853.481135 4:268442.314369 5:268796.468103 6:268796.707149
3.3 連續(xù)特征離散化
historyvisit_avghotelnum大部分都小于5,將字段處理成小于等于5和大于5的離散值翎苫;
ordercanncelednum大部分都小于5权埠,將字段處理成小于等于5和大于5的離散值;
sid等于1是新訪設(shè)為0煎谍,其他設(shè)為1為老用戶攘蔽。
avgprice、lowestprice呐粘、starprefer满俗、consuming_capacity和h進(jìn)行數(shù)值分段離散化。
df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)
#分段離散化
def discrete_avgprice(x):
if x<=200:
return 0
elif x<=400:
return 1
elif x<=600:
return 2
else:
return 3
def discrete_lowestprice(x):
if x<=100:
return 0
elif x<=200:
return 1
elif x<=300:
return 2
else:
return 3
def discrete_starprefer(x):
if x==0:
return 0
elif x<=60:
return 1
elif x<=80:
return 2
else:
return 3
def discrete_consuming_capacity(x):
if x<0:
return 0
elif x<=20:
return 1
elif x<=40:
return 2
elif x<=60:
return 3
else:
return 4
def discrete_h(x):
if x>=0 and x<6:#凌晨訪問
return 0
elif x<12:#上午訪問
return 1
elif x<18:#下午訪問
return 2
else:
return 3#晚上訪問
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)
- 對(duì)當(dāng)前的數(shù)值型類別變量進(jìn)行離散特征熱編碼作岖,此處用OneHotEncoder方法
discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)
3.4 刪除字段
去掉兩類字段: d唆垃、arrival、sampleid痘儡、firstorder_bu這幾個(gè)對(duì)分析沒有意義的字段辕万; historyvisit_totalordernum和ordernum_oneyear這兩個(gè)字段值相等,此處取ordernum_oneyear這個(gè)字段谤辜,刪除historyvisit_totalordernum蓄坏; decisionhabit_user和historyvisit_avghotelnum數(shù)值較一致,此處選擇historyvisit_avghotelnum丑念,刪除decisionhabit_user涡戳。
encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user','historyvisit_7ordernum'],axis=1)
encode_df_new.shape
4、模型訓(xùn)練
4.1 劃分訓(xùn)練集和驗(yàn)證集
為了保證訓(xùn)練集和驗(yàn)證集獨(dú)立同分布脯倚,將數(shù)據(jù)按照user_tag進(jìn)行排序渔彰,取前70%作為訓(xùn)練集,剩余的作為驗(yàn)證集推正。
ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]
4.2比較各個(gè)模型的訓(xùn)練效果
所有模型的調(diào)參都采用GridSearchCV網(wǎng)格搜索進(jìn)行恍涂。
- 決策樹
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.0
0.8340018840954033
- 隨機(jī)森林
#調(diào)整的參數(shù):
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.666135416301797
0.9616117844760916
- Adaboost
from sklearn.ensemble import AdaBoostClassifier
bdt = AdaBoostClassifier(algorithm="SAMME",
n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.00019265123121650496
0.7300356696791559
- GBDT
#調(diào)整的參數(shù):
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate,需要配合調(diào)整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最終的參數(shù)結(jié)果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('調(diào)參之后:測(cè)試集中precision>=0.97對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.15988300816140671
0.8808204850185188
- xgboost
#調(diào)整的參數(shù):
#迭代器個(gè)數(shù)n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合調(diào)整n_esgtimators
from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7640022417597814
0.9754939563495324
- 根據(jù)上述結(jié)果可知,xgboost的訓(xùn)練效果最好售睹,當(dāng)precision>=0.97時(shí),recall最大能達(dá)到76.4%炒瘸。
5淤堵、模型融合
后面也嘗試了模型堆疊的方法,看是否能得到更好的效果顷扩,首先利用上述提到的各個(gè)模型拐邪,根據(jù)特征重要性選取了57個(gè)特征,然后利用KFold方法進(jìn)行5折交叉驗(yàn)證隘截,得到五種模型的驗(yàn)證集和測(cè)試集結(jié)果扎阶,分別作為第二層的訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集,并用邏輯回歸模型來(lái)訓(xùn)練這五個(gè)特征婶芭,最終得到的結(jié)果是當(dāng)precision>=0.97時(shí)东臀,recall最大能達(dá)到78.3%,比原來(lái)的76.4%稍有提高犀农。
- 選取重要特征
#篩選特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
def get_top_n_features(train_x, train_y):
# random forest
rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
rf_est.fit(train_x, train_y)
feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)
# AdaBoost
ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
ada_est.fit(train_x, train_y)
feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)
# GradientBoosting
gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gb_est.fit(train_x, train_y)
feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)
# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
dt_est.fit(train_x, train_y)
feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
# xgbc
xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xg_est.fit(train_x, train_y)
feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)
return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg
feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg],
ignore_index=True).drop_duplicates()
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada,
feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n
最終從79個(gè)特征中選取了57個(gè)啡邑。
- 第一層模型訓(xùn)練
#第一層
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)
def get_out_fold(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((5, ntest))
oof_train_prob = np.zeros((ntrain,))
oof_test_prob = np.zeros((ntest,))
oof_test_skf_prob = np.empty((5, ntest))
for i, (train_index, test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
print('現(xiàn)在是第{}層'.format(i))
print('訓(xùn)練集索引如下:')
print(train_index)
print('測(cè)試集索引如下:')
print(test_index)
oof_test[:] = oof_test_skf.mean(axis=0)
oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
x_train = train_x_new.values
x_test = test_x_new.values
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1, scale_pos_weight=1, seed=27,
subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
- 第二層模型訓(xùn)練
將第一層的輸出結(jié)果作為訓(xùn)練集和測(cè)試集
#劃分訓(xùn)練集和測(cè)試集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#邏輯回歸模型訓(xùn)練
from sklearn.linear_model import LogisticRegression
#調(diào)參
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每個(gè)參數(shù)值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳參數(shù)值為:{}'.format(rfsearch4.best_params_))
# print('最佳參數(shù)值roc_auc得分為:{}'.format(rfsearch4.best_score_))
#調(diào)參結(jié)果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97時(shí)對(duì)應(yīng)的最大recall為:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分為:{}'.format(auc_test))
返回
0.7832498511331395
0.9763271659779821
通過堆疊的方法,將recall值從76.4%提高到78.3%井赌。