閑言碎語

??決定轉(zhuǎn)行做數(shù)據(jù)分析有3個月了辈末，這段時間里學(xué)了SQL纽哥，Python，PowerBI献雅，還掌握了電商和內(nèi)容平臺的基本業(yè)務(wù)邏輯和簡單機(jī)器學(xué)習(xí)算法谤绳，是時候該提升一下自己的實(shí)戰(zhàn)經(jīng)驗(yàn)了占锯。
??之前做過三個項(xiàng)目，1.豆瓣電影數(shù)據(jù)分析缩筛，側(cè)重點(diǎn)是數(shù)據(jù)爬取+matplotlib可視化+pandas操作消略；2.CD網(wǎng)站用戶行為分析，這個項(xiàng)目主要是照著秦路老師的思路做的瞎抛，在練習(xí)的基礎(chǔ)上艺演，加了一些自己的理解，再加入了RFM模型桐臊，對用戶進(jìn)行分層胎撤，側(cè)重業(yè)務(wù)理解+pandas操作；3.Titanic生存者預(yù)測断凶，主要工作是把Kaggle上的開源項(xiàng)目復(fù)現(xiàn)了一遍伤提，掌握了數(shù)據(jù)挖掘的基本流程：分析、特征工程认烁、建模肿男、調(diào)參、模型融合却嗡、模型評估舶沛。

項(xiàng)目概述

??今天這個項(xiàng)目同樣來自于Kaggle，內(nèi)容是：電信運(yùn)營商用戶流失數(shù)據(jù)窗价。
??分析主要圍繞降低電信運(yùn)營商用戶流失率展開如庭，根據(jù)用戶的個人情況、服務(wù)屬性撼港、合同信息展開分析坪它，找出影響用戶流失的關(guān)鍵因素，并建立了用戶流失的分類模型帝牡，針對潛在的流失用戶制定預(yù)警與召回策略哟楷。

分析思路

第一部分：數(shù)據(jù)預(yù)處理

導(dǎo)入數(shù)據(jù)、類型轉(zhuǎn)換否灾、處理異常值

第二部分：從流失率角度進(jìn)行分析

用戶的個人情況卖擅、服務(wù)屬性、合同信息對于流失率的影響

第三部分：從用戶價(jià)值角度進(jìn)行分析

用戶繳費(fèi)金額分布、用戶累計(jì)繳費(fèi)金額分布惩阶、用戶終身價(jià)值(LTV)

第四部分：通過分類模型預(yù)測用戶流失

特征工程挎狸、模型選擇（單一模型、多模型融合）断楷、模型評估

1锨匆、數(shù)據(jù)預(yù)處理

1.1 導(dǎo)入數(shù)據(jù)、庫文件

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import warnings
Filename='WA_Fn-UseC_-Telco-Customer-Churn.csv'
%matplotlib inline
Telco_Data_Origin=pd.read_csv(Filename)
Telco_Data=Telco_Data_Origin.copy()
Telco_Data.info()

圖1.字段概況

查看字段冬筒，發(fā)現(xiàn)字段可以分為四類：

用戶流失情況：ChurnKey=['customerID','Churn']
用戶類型：CustomerAttributes=['gender','SeniorCitizen','Partner','Dependents','tenure']
服務(wù)屬性：ServiceAttributes=['PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup', 'DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
合同信息：ContractAttributes=['Contract','PaperlessBilling','PaymentMethod','MonthlyCharges','TotalCharges']

1.2 類型轉(zhuǎn)換

#字符型轉(zhuǎn)換>>數(shù)值型
for i in ['Churn','Partner','Dependents','PhoneService','PaperlessBilling']:
    Telco_Data[i]=Telco_Data_Origin[i].apply(lambda x:1 if x=='Yes' else 0)
for i in ['MultipleLines','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']:
    Telco_Data[i]=Telco_Data_Origin[i].apply(lambda x:1 if x=='Yes' else (0 if x=='No' else np.nan))
Telco_Data['gender']=Telco_Data_Origin['gender'].apply(lambda x:1 if x=='Male' else 0)
#總繳費(fèi)數(shù)值型轉(zhuǎn)換
Telco_Data['TotalCharges']=Telco_Data_Origin['TotalCharges'].convert_objects(convert_numeric=True)
Telco_Data.loc[Telco_Data['TotalCharges'].isnull(),'TotalCharges']=Telco_Data[Telco_Data['TotalCharges'].isnull()].MonthlyCharges

進(jìn)行數(shù)據(jù)類型轉(zhuǎn)換恐锣，主要是將字符型數(shù)據(jù)轉(zhuǎn)換成數(shù)值型，注意TotalCharges字段中有11個用戶數(shù)據(jù)缺失舞痰，通過觀察發(fā)現(xiàn)土榴，這11人tenure=0，在數(shù)據(jù)采集時响牛，這11人剛剛辦理套餐玷禽，使用未滿一個月，在此以MonthlyCharges進(jìn)行填充呀打。

2矢赁、流失率數(shù)據(jù)分析

2.1 描述分析

Telco_Data.describe().loc['mean']

各字段占比

在此僅展示均值結(jié)果，均值代表各字段陽性用戶占比贬丛，對于tenure撩银，MonthlyCharges，TotalCharges豺憔，則是人均使用期限额获、人均月繳費(fèi)、人均總繳費(fèi)

2.2 用戶流失比例

#中文字體
plt.style.use('ggplot')
fontsize=18
plt.rcParams['font.sans-serif']=['SimHei'] 
plt.rcParams.update({'font.size': fontsize})
#畫圖
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
Telco_Data[ChurnKey].Churn.value_counts().plot.bar(ax=ax)
#標(biāo)注比例
Churn1=Telco_Data[ChurnKey].Churn.value_counts()[0]
Churn2=Telco_Data[ChurnKey].Churn.value_counts()[1]
Total=Telco_Data[ChurnKey].Churn.value_counts().sum()
ax.text(x=0-0.12,y=Churn1+50,s='%.2f %%' % (Churn1/Total*100),fontsize=fontsize-3)
ax.text(x=1-0.12,y=Churn2+50,s='%.2f %%' % (Churn2/Total*100),fontsize=fontsize-3)
#坐標(biāo)軸
ax.set_ylim([0,6000])
ax.set_xticklabels(['未流失','流失'],rotation=0)
ax.set_title('流失情況')
plt.show()

用戶流失比例

作出流失率直方圖焕阿，可以看到流失率占比26.54%，數(shù)據(jù)集為非平衡數(shù)據(jù)集首启。

2.3 用戶類型分析

下面分別繪制性別暮屡、年齡、伴侶毅桃、子女的分布直方圖

#根據(jù)性別與是否流失進(jìn)行透視
Sex_Churn=pd.pivot_table(data=Telco_Data,values='customerID',index='gender',columns='Churn',aggfunc='count')
#根據(jù)性別與是否流失進(jìn)行透視
Senior_Churn=pd.pivot_table(data=Telco_Data,values='customerID',index='SeniorCitizen',columns='Churn',aggfunc='count')
#根據(jù)Partner與是否流失進(jìn)行透視
Partner_Churn=pd.pivot_table(data=Telco_Data,values='customerID',index='Partner',columns='Churn',aggfunc='count')
#根據(jù)Dependents與是否流失進(jìn)行透視
Dependents_Churn=pd.pivot_table(data=Telco_Data,values='customerID',index='Dependents',columns='Churn',aggfunc='count')
#性別年長與否
fig,ax=plt.subplots(1,2,figsize=(15, 6))

Pivot=[Sex_Churn,Senior_Churn]
Title=['流失與性別的關(guān)系','流失與年長與否的關(guān)系']
Label=[['女','男'],['年輕','年長']]
lim=[[0,3500],[0,5000]]
fontsize=15

for i in range(2):
    Pivot[i].plot.bar(title=Title[i],ax=ax[i])
    Total=Pivot[i].sum().sum()
    ax[i].text(x=0,y=Pivot[i].iloc[0,0]+30,s='%.2f %%' % (Pivot[i].iloc[0,0]/Pivot[i].iloc[0,:].sum()*100),fontsize=fontsize,horizontalalignment='right')
    ax[i].text(x=0,y=Pivot[i].iloc[0,1]+30,s='%.2f %%' % (Pivot[i].iloc[0,1]/Pivot[i].iloc[0,:].sum()*100),fontsize=fontsize,horizontalalignment='left')
    ax[i].text(x=1,y=Pivot[i].iloc[1,0]+30,s='%.2f %%' % (Pivot[i].iloc[1,0]/Pivot[i].iloc[1,:].sum()*100),fontsize=fontsize,horizontalalignment='right')
    ax[i].text(x=1,y=Pivot[i].iloc[1,1]+30,s='%.2f %%' % (Pivot[i].iloc[1,1]/Pivot[i].iloc[1,:].sum()*100),fontsize=fontsize,horizontalalignment='left')
    ax[i].set_xticklabels(Label[i],rotation=0)
    ax[i].set_ylim(lim[i])
    ax[i].legend(['未流失','流失'],fontsize=16,loc='upper right')(Sex_Churn.iloc[0,1]/Total*100))

性別褒纲、年齡分布

#伴侶，子女
fig,ax=plt.subplots(1,2,figsize=(15, 6))

Pivot=[Partner_Churn,Dependents_Churn]
Title=['流失與是否有伴侶的關(guān)系','流失與是否有子女的關(guān)系']
Label=[['無','有'],['無','有']]
lim=[[0,3500],[0,3800]]
fontsize=15

for i in range(2):
    Pivot[i].plot.bar(title=Title[i],ax=ax[i])
    Total=Pivot[i].sum().sum()
    ax[i].text(x=0,y=Pivot[i].iloc[0,0]+30,s='%.2f %%' % (Pivot[i].iloc[0,0]/Pivot[i].iloc[0,:].sum()*100),fontsize=fontsize,horizontalalignment='right')
    ax[i].text(x=0,y=Pivot[i].iloc[0,1]+30,s='%.2f %%' % (Pivot[i].iloc[0,1]/Pivot[i].iloc[0,:].sum()*100),fontsize=fontsize,horizontalalignment='left')
    ax[i].text(x=1,y=Pivot[i].iloc[1,0]+30,s='%.2f %%' % (Pivot[i].iloc[1,0]/Pivot[i].iloc[1,:].sum()*100),fontsize=fontsize,horizontalalignment='right')
    ax[i].text(x=1,y=Pivot[i].iloc[1,1]+30,s='%.2f %%' % (Pivot[i].iloc[1,1]/Pivot[i].iloc[1,:].sum()*100),fontsize=fontsize,horizontalalignment='left')

    ax[i].set_xticklabels(Label[i],rotation=0)
    ax[i].set_ylim(lim[i])
    ax[i].legend(['未流失','流失'],fontsize=16,loc='upper right')(Sex_Churn.iloc[0,1]/Total*100))

是否有伴侶钥飞、子女

#已使用月份
fig,ax=plt.subplots(1,2,figsize=(15, 6))
Telco_Data[Telco_Data.Churn==0].tenure.hist(bins=20,ax=ax[0],density=True,color='#054E9F',alpha=0.6,label='數(shù)量分布')
sns.kdeplot(Telco_Data[Telco_Data.Churn==0].tenure,shade=True,color='Red',label='kde',legend='kde',ax=ax[0])
ax[0].set_xlim([-5,75]);ax[0].set_title('用戶使用月份分布(未流失)')

Telco_Data[Telco_Data.Churn==1].tenure.hist(bins=20,ax=ax[1],density=True,color='#054E9F',alpha=0.6,label='數(shù)量分布')
sns.kdeplot(Telco_Data[Telco_Data.Churn==1].tenure,shade=True,color='Red',label='kde',legend='kde',ax=ax[1])
ax[1].set_xlim([-5,75]);ax[1].set_title('用戶使用月份分布(流失)')

用戶使用月份分布

可以看到莺掠，在用戶類型上：

1. 性別對于用戶流失沒有顯著差異，不同性別的流失率與整體流失率幾乎沒有區(qū)別读宙。
1. 年輕用戶占比高于年長用戶彻秆，前者占總數(shù)比例84%，后者占比16%，年長用戶更容易發(fā)生流失唇兑，約41.7%的年長用戶流失酒朵。
1. 有伴侶/子女的用戶流失率比沒有伴侶/子女的用戶更低。
1. 使用服務(wù)時間上扎附，未流失用戶集中在兩端蔫耽，新開通服務(wù)的用戶以及長期用戶占比較高；流失用戶比較集中留夜，流失主要發(fā)生在開通服務(wù)后的半年內(nèi)匙铡，隨后流失比例趨于穩(wěn)定。這是符合電信服務(wù)的特點(diǎn)的碍粥，人們一般不會頻繁更換運(yùn)營商鳖眼。

2.4 服務(wù)屬性分析

2.4.1 是否使用電話/網(wǎng)絡(luò)服務(wù)

數(shù)據(jù)透視、可視化步驟與上面相近即纲，這里不再贅述具帮，可得

網(wǎng)絡(luò)與電話服務(wù)

絕大多數(shù)的用戶開通了電話服務(wù)，只有10%的用戶沒有開通低斋，開不開通電話服務(wù)流失率都在25%左右蜂厅，對比整體流失率26.51%，可以說明是否使用電話服務(wù)對于流失沒有顯著影響膊畴。
未開通網(wǎng)絡(luò)服務(wù)的用戶約有2成掘猿，未開通網(wǎng)絡(luò)服務(wù)的用戶流失率比開通的要低得多,約為7.4%；開通網(wǎng)絡(luò)服務(wù)的用戶中唇跨，使用光纖上網(wǎng)的用戶流失率較高稠通，約為42%，這說明买猖，可能是網(wǎng)絡(luò)服務(wù)對用戶流失造成較大的影響改橘。

2.4.2 細(xì)分服務(wù)選項(xiàng)

細(xì)分服務(wù)影響

圖中第1項(xiàng)為電話服務(wù)的細(xì)分服務(wù)，第2-7項(xiàng)為網(wǎng)絡(luò)服務(wù)的細(xì)分服務(wù)玉控，圖上標(biāo)注了每項(xiàng)服務(wù)內(nèi)流失/留存用戶占總用戶比例飞主，可以看到：

對于電話服務(wù)，開通子服務(wù)MultipleLines的用戶流失率為28.61%高诺，開通電話服務(wù)的用戶流失率為26.71%碌识，并沒有顯著差異。
對于網(wǎng)絡(luò)服務(wù)虱而，保障類服務(wù)(OnlineSecurity,TechSupport,OnlineBackup,DeviceProtection)的流失率要低于娛樂類服務(wù)(StreamingTV筏餐，StreamingMovies)

下面進(jìn)一步觀察DSL與Fiber Optic兩種上網(wǎng)方式下，各類網(wǎng)絡(luò)服務(wù)對流失率的影響牡拇。

各類網(wǎng)絡(luò)服務(wù)的影響

各種細(xì)分服務(wù)下魁瞪，光纖上網(wǎng)用戶的流失率都要高于DSL上網(wǎng)穆律，這說明是運(yùn)營商提供的光纖上網(wǎng)服務(wù)引起了用戶不滿，并造成了大量流失佩番，而非光纖上網(wǎng)下某一子服務(wù)造成的众旗。
另外，保障性服務(wù)確實(shí)能夠降低用戶流失率趟畏。

2.5 合同信息分析

2.5.1 支付方式分析

支付期限的影響

選擇月度套餐每月支付的用戶流失率要高于購買一年/兩年套餐的用戶贡歧，說明鼓勵用戶訂購長期套餐有助于維持用戶留存。

支付方式的影響
支付方式對于用戶流失有著較大影響赋秀，選用電子支票的用戶流失率高達(dá)45.29%利朵。
賬單是否紙質(zhì)化也有著相對大的影響，采用紙質(zhì)化賬單的用戶流失率比非紙質(zhì)化賬單的要低猎莲。
這兩張圖說明绍弟，電子支付流失率要高于紙質(zhì)支付方式，推測原因是用戶自身而非支付方式的原因著洼。

2.5.2 支付金額分析

#總繳費(fèi)分布
fig,ax=plt.subplots(1,2,figsize=(15, 6))
sns.violinplot(x='Churn',y='TotalCharges',data=Telco_Data,showmeans=False,showmedians=True,ax=ax[0])
ax[0].set_title('總繳費(fèi)分布')
ax[0].set_xticklabels(['未流失','流失'],rotation=0)
sns.violinplot(x='Churn',y='MonthlyCharges',data=Telco_Data,showmeans=False,showmedians=True,ax=ax[1])
ax[1].set_title('月度繳費(fèi)分布')
ax[1].set_xticklabels(['未流失','流失'],rotation=0)

總繳費(fèi)與月度繳費(fèi)用戶分布提琴圖

fig,ax=plt.subplots(1,2,figsize=(15, 6))
qparts=20
#總繳費(fèi)
Telco_Data_Cut=pd.cut(Telco_Data.TotalCharges,bins=qparts)
Telco_Data_Cut=pd.concat([Telco_Data_Cut,Telco_Data.Churn],axis=1)
(Telco_Data_Cut.groupby('TotalCharges').sum()/Telco_Data_Cut.groupby('TotalCharges').count()).plot.bar(ax=ax[0])
#月繳費(fèi)
Telco_Data_Cut=pd.cut(Telco_Data.MonthlyCharges,bins=qparts)
Telco_Data_Cut=pd.concat([Telco_Data_Cut,Telco_Data.Churn],axis=1)
(Telco_Data_Cut.groupby('MonthlyCharges').sum()/Telco_Data_Cut.groupby('MonthlyCharges').count()).plot.bar(ax=ax[1])

不同繳費(fèi)用戶的流失率分布圖

對于月度繳費(fèi)而言樟遣，流失用戶集中在三個區(qū)域內(nèi)，分別是a.20左右身笤；b.45-55豹悬；c.75-100。流失率最高的兩個區(qū)間為28-48,68-109
對于總繳費(fèi)液荸，無論流失與否瞻佛，用戶數(shù)量都隨著總繳費(fèi)數(shù)量逐漸減少。流失率隨著總繳費(fèi)額減少而呈現(xiàn)減少趨勢娇钱。

2.6 小結(jié)

這一部分主要分析了用戶流失情況伤柄，總體流失率為26.25%，分別從1.用戶類型文搂、2.服務(wù)屬性适刀、3.合同信息(支付方式、支付金額)這三個角度分析煤蹭，發(fā)現(xiàn)：

用戶類型上笔喉，年齡、是否有子女伴侶對流失率有較大影響疯兼，年長的然遏、沒有子女伴侶的用戶是高流失群體贫途。
服務(wù)屬性上吧彪，使用光纖上網(wǎng)網(wǎng)絡(luò)服務(wù)的用戶更容易流失，網(wǎng)絡(luò)服務(wù)下的子服務(wù)也對流失有影響丢早，保障類服務(wù)能降低流失率姨裸，而娛樂類服務(wù)會導(dǎo)致流失率增加秧倾，可能是與用戶的預(yù)期不符造成。
支付方式上傀缩，簽訂長期合同的用戶不易流失那先，兩年期>一年期>每月支付的合同形式；電子支付赡艰，采用無紙化賬單的用戶更容易流失售淡。
支付金額上，從月度繳費(fèi)上看慷垮，流失用戶集中在三個區(qū)域內(nèi)，分別是a.20左右；b.45-55竞慢；c.75-100胰锌。流失率最高的兩個區(qū)間為28-48,68-109

3芹血、用戶價(jià)值分析

3.1 用戶繳費(fèi)金額分布

fig,ax=plt.subplots(1,2,figsize=(15, 6))
ax[0].scatter(x=Telco_Data.tenure,y=Telco_Data.TotalCharges)
ax[0].set_xlabel('用戶使用月份');ax[0].set_ylabel('用戶總繳費(fèi)')
ax[1].scatter(x=Telco_Data.MonthlyCharges,y=Telco_Data.TotalCharges/Telco_Data.tenure)
ax[1].set_xlabel('用戶月度繳費(fèi)');ax[1].set_ylabel('用戶總繳費(fèi)/用戶使用月份')

用戶繳費(fèi)金額分布

金額上乡恕，總繳費(fèi)均攤到每月的費(fèi)用與月度繳費(fèi)基本落在一條直線上言询，說明每名用戶的套餐金額隨時間并沒有發(fā)生太大的變化。

3.2 累計(jì)繳費(fèi)金額分布

fig,ax=plt.subplots(1,2,figsize=(15, 6))
ax[0].set_xlabel('用戶排名');ax[0].set_ylabel('用戶月度繳費(fèi)占比')
(Telco_Data.MonthlyCharges.sort_values(ascending=True).cumsum()/\
(Telco_Data.MonthlyCharges.sort_values(ascending=True).cumsum()).max()).reset_index(drop=True).\
plot(ax=ax[0])

ax[1].set_xlabel('用戶排名');ax[1].set_ylabel('用戶總繳費(fèi)占比')
(Telco_Data.TotalCharges.sort_values(ascending=True).cumsum()/\
Telco_Data.TotalCharges.sort_values(ascending=True).cumsum()).max()).reset_index(drop=True)\
.plot(ax=ax[1])

用戶累計(jì)消費(fèi)圖

對用戶月度繳費(fèi)傲宜、總繳費(fèi)進(jìn)行排序运杭，并做出累計(jì)曲線，可以發(fā)現(xiàn)：

總繳費(fèi)曲線變化比月度繳費(fèi)曲線變化更為明顯函卒，前期更為平坦辆憔，后期拉升更為劇烈。
從月度付費(fèi)來看报嵌，曲線變化并不明顯虱咧，繳費(fèi)金額前1000(14.3%)的用戶對營業(yè)額貢獻(xiàn)大約在23%左右，從總繳費(fèi)來看锚国，這一數(shù)值達(dá)到40%腕巡，這說明用戶的價(jià)值往往依靠長期穩(wěn)定的付費(fèi)，這也符合電信行業(yè)的特點(diǎn)血筑。
這也說明降低流失率绘沉，比拉新煎楣，引導(dǎo)用戶提升繳費(fèi)金額(升級套餐) 更具有效益。

3.3 用戶終身價(jià)值(LTV)分析

按照已使用月份tenure作為分組依據(jù)(tenure=0的歸入1進(jìn)行計(jì)算)车伞，分別計(jì)算各組的流失率择懂，月度繳費(fèi)均值A(chǔ)RPU以及歷史繳費(fèi)總額。用戶剩余價(jià)值=ARPU/(1-流失率)

Telco_Data_LTV=Telco_Data.copy()
IDXtelco=Telco_Data_LTV.loc[Telco_Data_LTV.tenure==0].index
Telco_Data_LTV.loc[IDXtelco,'tenure']=1
Telco_Data_LTV=Telco_Data_LTV.groupby(by='tenure').agg({'Churn':'mean','customerID':'count','MonthlyCharges':'mean','TotalCharges':'mean'})
Telco_Data_LTV.columns=['ChurnRate','CustomerNum','MonthlyChargesAvg','TotalChargesAvg']
Telco_Data_LTV['RemaingCharges']=Telco_Data_LTV['MonthlyChargesAvg']/Telco_Data_LTV['ChurnRate']
fig,ax=plt.subplots(2,2,figsize=(15,15))
Telco_Data_LTV.ChurnRate.plot(ax=ax[0,0],title='流失率ChurnRate')
Telco_Data_LTV.MonthlyChargesAvg.plot(ax=ax[0,1],title='人均月繳費(fèi)MonthlyChargesAvg')
Telco_Data_LTV.TotalChargesAvg.plot(ax=ax[1,0],title='人均總繳費(fèi)TotalChargesAvg')
Telco_Data_LTV.RemaingCharges.plot(ax=ax[1,1],title='人均剩余價(jià)值RemaingCharges ')

流失率另玖、用戶價(jià)值隨已使用時間的變化

Telco_Data_LTV['LTV']=Telco_Data_LTV['TotalChargesAvg']+Telco_Data_LTV['RemaingCharges']
Telco_Data_LTV['LTV'].plot(title='用戶生命周期總價(jià)值LTV',figsize=(7,5))
#流失率ChurnRate困曙，CustomerNum，人均月繳費(fèi)MonthlyChargesAvg谦去，人均總繳費(fèi)TotalChargesAvg赂弓，人均剩余價(jià)值RemaingCharges 
#LTV=TotalChargesAvg+RemaingCharges

用戶生命周期總價(jià)值

使用服務(wù)時長越長，用戶生命周期總價(jià)值越高哪轿，這是符合我們認(rèn)知的盈魁。需要特別注意的是，使用服務(wù)時長達(dá)到72個月的用戶生命周期總價(jià)值特別高窃诉，這是由于他們的流失率極低杨耙，可以認(rèn)為他們是該電信運(yùn)營商的核心用戶。
這說明了飘痛，每一位長期用戶珊膜，都能為運(yùn)營商帶來穩(wěn)定持續(xù)的收益。電信行業(yè)更應(yīng)當(dāng)關(guān)注用戶流失宣脉，培養(yǎng)長期用戶车柠。如果能夠預(yù)測用戶流失，針對流失用戶制定針對性對策塑猖，將帶來持久的收益提升竹祷。

4.留存率預(yù)測

4.1 特征工程

4.1.1 特征提取與編碼

#對于離散特征，采用One-Hot編碼
ConvertFeatures=['MultipleLines','OnlineSecurity','DeviceProtection','InternetService','OnlineBackup',
                 'TechSupport','StreamingTV','StreamingMovies','Contract','PaymentMethod']
Telco_DT=Telco_Data.copy()
for i in ConvertFeatures:
    Telco_DT[i]=pd.factorize(Telco_DT[i])[0]
    Dummy=pd.get_dummies(Telco_DT[i],prefix=i)
    Telco_DT=pd.concat([Telco_DT,Dummy],axis=1)

#對于連續(xù)特征羊苟，采用標(biāo)準(zhǔn)化方式處理
from sklearn import preprocessing 
ConvertNumericalFeatures=['tenure','MonthlyCharges','TotalCharges']
scaler = preprocessing.StandardScaler().fit(Telco_DT[ConvertNumericalFeatures])
Telco_DT[ConvertNumericalFeatures]=scaler.transform(Telco_DT[ConvertNumericalFeatures])

4.1.2 特征相關(guān)性分析

colormap = plt.cm.viridis
fontsize=11
plt.rcParams['font.sans-serif']=['SimHei'] 
plt.rcParams.update({'font.size': fontsize})
plt.figure(figsize=(14,12))
plt.title('Pearson Correaltion of Feature',y=1.05,size=15)
sns.heatmap(Telco_DT[Telco_Data.columns].drop('customerID',axis=1).astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,linecolor='white',annot=True)

特征間的相關(guān)性矩陣

4.1.3 特征間的數(shù)據(jù)分布

Features=['gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','InternetService','Contract',
          'PaperlessBilling','PaymentMethod','MonthlyCharges','TotalCharges','Churn']
FeaturePlot = sns.pairplot(Telco_DT[Features],hue='Churn',
                                      palette = 'seismic',size=1.8,diag_kind ='kde',diag_kws=
                                      dict(shade=True),plot_kws=dict(s=10))
FeaturePlot.set(xticklabels=[])

特征間的數(shù)據(jù)分布

4.2 采用不同模型篩選特征

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import ensemble
from sklearn import model_selection
from imblearn.over_sampling import SMOTE

def get_top_n_features(train_data_X, train_data_Y, top_n_features):
    # Random Forest
    rf_est = RandomForestClassifier(random_state=0)
    rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
    rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=-1, cv=10, verbose=1)
    rf_grid.fit(train_data_X, train_data_Y)
    print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
    print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
    print('Top N Features RF Train Score:' + str(rf_grid.score(train_data_X, train_data_Y)))
    feature_imp_sorted_rf = pd.DataFrame({'feature': list(train_data_X),
                                          'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
    features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
    print('Sample 10 Features from RF Classifier')
    print(str(features_top_n_rf[:10]))

    # AdaBoost
    ada_est =AdaBoostClassifier(random_state=0)
    ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1]}
    ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=-1, cv=10, verbose=1)
    ada_grid.fit(train_data_X, train_data_Y)
    print('Top N Features Best Ada Params:' + str(ada_grid.best_params_))
    print('Top N Features Best Ada Score:' + str(ada_grid.best_score_))
    print('Top N Features Ada Train Score:' + str(ada_grid.score(train_data_X, train_data_Y)))
    feature_imp_sorted_ada = pd.DataFrame({'feature': list(train_data_X),
                                           'importance': ada_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
    features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
    print('Sample 10 Feature from Ada Classifier:')
    print(str(features_top_n_ada[:10]))

    # ExtraTree
    et_est = ExtraTreesClassifier(random_state=0)
    et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [20]}
    et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=-1, cv=10, verbose=1)
    et_grid.fit(train_data_X, train_data_Y)
    print('Top N Features Best ET Params:' + str(et_grid.best_params_))
    print('Top N Features Best ET Score:' + str(et_grid.best_score_))
    print('Top N Features ET Train Score:' + str(et_grid.score(train_data_X, train_data_Y)))
    feature_imp_sorted_et = pd.DataFrame({'feature': list(train_data_X),
                                          'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
    features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
    print('Sample 10 Features from ET Classifier:')
    print(str(features_top_n_et[:10]))
    
    # GradientBoosting
    gb_est =GradientBoostingClassifier(random_state=0)
    gb_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1], 'max_depth': [20]}
    gb_grid = model_selection.GridSearchCV(gb_est, gb_param_grid, n_jobs=-1, cv=10, verbose=1)
    gb_grid.fit(train_data_X, train_data_Y)
    print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
    print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
    print('Top N Features GB Train Score:' + str(gb_grid.score(train_data_X, train_data_Y)))
    feature_imp_sorted_gb = pd.DataFrame({'feature': list(train_data_X),
                                           'importance': gb_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
    features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
    print('Sample 10 Feature from GB Classifier:')
    print(str(features_top_n_gb[:10]))
    
    # DecisionTree
    dt_est = DecisionTreeClassifier(random_state=0)
    dt_param_grid = {'min_samples_split': [2, 4], 'max_depth': [20]}
    dt_grid = model_selection.GridSearchCV(dt_est, dt_param_grid, n_jobs=-1, cv=10, verbose=1)
    dt_grid.fit(train_data_X, train_data_Y)
    print('Top N Features Best DT Params:' + str(dt_grid.best_params_))
    print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
    print('Top N Features DT Train Score:' + str(dt_grid.score(train_data_X, train_data_Y)))
    feature_imp_sorted_dt = pd.DataFrame({'feature': list(train_data_X),
                                          'importance': dt_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
    features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
    print('Sample 10 Features from DT Classifier:')
    print(str(features_top_n_dt[:10]))
    
    # merge five models
    features_top_n_5mods = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et, features_top_n_gb, features_top_n_dt], 
                               ignore_index=True)
    
    features_importance_all = [feature_imp_sorted_rf, feature_imp_sorted_ada, feature_imp_sorted_et, 
                               feature_imp_sorted_gb, feature_imp_sorted_dt]
                                #pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, feature_imp_sorted_et, 
                              #     feature_imp_sorted_gb, feature_imp_sorted_dt],ignore_index=True)
    
    return features_top_n_5mods , features_importance_all
#篩選特征
feature_to_pick = 40
TestTelcoX=Telco_DT.drop(ConvertFeatures+['customerID']+['Churn'],axis=1).iloc[:,:]
TestTelcoY=Telco_DT['Churn'].iloc[:]
features_top_n_5mods,features_importance_all = get_top_n_features(TestTelcoX,TestTelcoY,feature_to_pick)

采用Random Forest塑陵、AdaBoost、Extra Tree蜡励、GBDT令花、Decision Tree五種模型，篩選出每個模型中重要性排名前40的特征凉倚。

#共5個模型兼都，每個模型搜集前n=20條特征，去重得到feature_top_n稽寒；feature_importance為所有模型特征的重要程度扮碧，共5*cols條
#下面展示每種算法重要性排名前十的屬性
N=feature_to_pick;NN=20
RF_imp_idx=features_top_n_5mods[0:NN];RF_imp=features_importance_all[0].set_index('feature',drop=True).loc[RF_imp_idx,:]
ADA_imp_idx=features_top_n_5mods[N:N+NN];ADA_imp=features_importance_all[1].set_index('feature',drop=True).loc[ADA_imp_idx,:]
ET_imp_idx=features_top_n_5mods[N*2:N*2+NN];ET_imp=features_importance_all[2].set_index('feature',drop=True).loc[ET_imp_idx,:]
GB_imp_idx=features_top_n_5mods[N*3:N*3+NN];GB_imp=features_importance_all[3].set_index('feature',drop=True).loc[GB_imp_idx,:]
DT_imp_idx=features_top_n_5mods[N*4:N*4+NN];DT_imp=features_importance_all[4].set_index('feature',drop=True).loc[DT_imp_idx,:]

warnings.filterwarnings("ignore")
fig,ax=plt.subplots(3,2,figsize=(18,15))
RF_imp.sort_values(by='importance').plot.barh(ax=ax[0,0],title='Random Forest')
ADA_imp.sort_values(by='importance').plot.barh(ax=ax[0,1],title='Adaboost')
GB_imp.sort_values(by='importance').plot.barh(ax=ax[1,0],title='GBDT')
ET_imp.sort_values(by='importance').plot.barh(ax=ax[1,1],title='Extra Tree')
DT_imp.sort_values(by='importance').plot.barh(ax=ax[2,0],title='Decision Tree')

#將每個模型的重要特征去重
features_matters=features_top_n_5mods.drop_duplicates()
imp_all=pd.concat(features_importance_all)
imp_all=imp_all.groupby('feature').mean().loc[features_matters,:]
imp_all.sort_values(by='importance').tail(20).plot.barh(ax=ax[2,1],title='All')

各個模型各特征重要性排名

列舉在各個模型中重要性排名前20的特征直方圖，加權(quán)平均得到所有模型的特征排名瓦胎。
其中芬萍，用戶總繳費(fèi)、月度繳費(fèi)以及已經(jīng)使用時間這三項(xiàng)重要性最強(qiáng)搔啊，畢竟用戶留存時間越久柬祠，付得錢越多，說明用戶粘性越高负芋。合同支付方式漫蛔、使用哪類網(wǎng)絡(luò)服務(wù)、賬單是否紙質(zhì)化等因素也占有較為重要的地位旧蛾，基本符合前期的分析結(jié)果莽龟。

4.3 生成訓(xùn)練集、測試集

#選取前35項(xiàng)作為特征
features_matters=imp_all.sort_values(by='importance',ascending=False).index[0:35].tolist()
train_X,test_X,train_Y,test_Y=model_selection.train_test_split(Telco_DT[features_matters],Telco_DT['Churn'],test_size=0.2)
over_samples=SMOTE(random_state=1234)
train_X,train_Y=over_samples.fit_sample(train_X,train_Y)

#提取數(shù)值
x_train = train_X.values 
x_test = test_X.values
y_train = train_Y.values
y_test = test_Y.values

由于數(shù)據(jù)集陽性與陰性樣本非平衡锨天，直接預(yù)測時毯盈，模型會對陰性樣本產(chǎn)生偏好，雖然能保證較高的整體準(zhǔn)確率病袄，但召回率較低搂赋，在生成訓(xùn)練集與測試集時采用SMOTE過采樣算法對數(shù)據(jù)集進(jìn)行平衡。
測試集與訓(xùn)練集比例為1:4益缠。

4.4 采用單一模型進(jìn)行測試

下面4.4.1-4.4.6分別計(jì)算了Random Forest脑奠、Adaboost、Decision Tree幅慌、KNN宋欺、Extra Tree、GBDT這六種模型的訓(xùn)練與測試結(jié)果
對每個模型胰伍，主要步驟包括：通過訓(xùn)練集進(jìn)行10重驗(yàn)證調(diào)節(jié)參數(shù)齿诞、在測試集上對模型進(jìn)行評估計(jì)算準(zhǔn)確率(Precision)與召回率(Recall)、作出ROC曲線與混淆矩陣

4.4.1 Random Forest

#RF模型調(diào)參
rf_est = RandomForestClassifier(warm_start=True,max_features='sqrt',
                            min_samples_split=3,min_samples_leaf=2,n_jobs=-1,verbose=0)
rf_param_grid = {'n_estimators': [700], 'max_depth': [8],'min_samples_split':[10],'min_samples_leaf':[20]}
#n_estimators:[500,600,700,800,900,1000]
#max_depth:[6,8,10,12,15,20]
#min_samples_split:range(10, 90, 20)
#min_samples_leaf:range(5, 65, 10),
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=-1, cv=10, verbose=1,scoring=None)
rf_grid.fit(x_train, y_train)
print('RandomForest 最佳參數(shù)',rf_grid.best_params_)
print('RandomForest 最佳得分',rf_grid.best_score_)

隨機(jī)森林調(diào)參結(jié)果

#RF模型擬合
rf = RandomForestClassifier(max_depth=8, n_estimators=700,warm_start=False,max_features='auto',min_samples_leaf=20,
                            min_samples_split=10,n_jobs=-1,verbose=0)
rf.fit(x_train,y_train)
y_predict = rf.predict(x_test)
y_predict_proba = rf.predict_proba(x_test)[:,1]
RST=[];AST=[];RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

預(yù)測精度

ROC曲線

混淆矩陣

4.4.2 AdaBoost

#Ada模型調(diào)參
ada_est=AdaBoostClassifier(n_estimators=100,learning_rate=0.5)
ada_param_grid = {'n_estimators': [100,200,300,400,500,600]}
#n_estimators:[500,600,700,800,900,1000]
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=-1, cv=10, verbose=1,scoring=None)
ada_grid.fit(x_train, y_train)
print('AdaBoost 最佳參數(shù)',ada_grid.best_params_)
print('AdaBoost 最佳得分',ada_grid.best_score_)

AdaBoost調(diào)參結(jié)果

#Ada模型擬合
ada=AdaBoostClassifier(n_estimators=300,learning_rate=0.1)
ada.fit(x_train,y_train)
y_predict = ada.predict(x_test)
y_predict_proba = ada.predict_proba(x_test)[:,1]
RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

預(yù)測精度

ROC曲線

混淆矩陣

4.4.3 Decision Tree

#DT模型調(diào)參
dt_est = DecisionTreeClassifier()
dt_param_grid = {'max_depth': [5,8,16,20]}
dt_grid = model_selection.GridSearchCV(dt_est, dt_param_grid, n_jobs=-1, cv=10,verbose=1,scoring='recall')
dt_grid.fit(x_train, y_train)
print('DecisionTree 最佳參數(shù)',dt_grid.best_params_)
print('DecisionTree 最佳得分',dt_grid.best_score_)

決策樹調(diào)參結(jié)果

#DT模型擬合
dt=DecisionTreeClassifier(max_depth=5)
dt.fit(x_train,y_train)
y_predict = dt.predict(x_test)
y_predict_proba = dt.predict_proba(x_test)[:,1]
RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)

預(yù)測精度

ROC曲線

混淆矩陣

4.4.4 KNN

KNN模型調(diào)參
knn_est = KNeighborsClassifier()
knn_param_grid = {'n_neighbors': [100,200,300]}
knn_grid = model_selection.GridSearchCV(knn_est, knn_param_grid, n_jobs=-1, cv=10,verbose=1,scoring=None)
knn_grid.fit(x_train, y_train)
print('knn 最佳參數(shù)',knn_grid.best_params_)
print('knn 最佳得分',knn_grid.best_score_)

KNN調(diào)參結(jié)果

#KNN模型擬合
knn=KNeighborsClassifier(n_neighbors=100)
knn.fit(x_train,y_train)
y_predict = knn.predict(x_test)
y_predict_proba = knn.predict_proba(x_test)[:,1]
RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

預(yù)測精度

ROC曲線

混淆矩陣

4.4.5 Extra Tree

#ET模型調(diào)參
et_est=ExtraTreesClassifier()
et_param_grid = {'n_estimators': [600], 'max_depth': [8],'min_samples_leaf':[5],'min_samples_split':[10]}
#n_estimators:[500,600,700,800,900,1000]
#max_depth:[6,8,10,12,15,20]
#min_samples_split:range(10, 90, 20)
#min_samples_leaf:range(10, 60, 10),
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=-1, cv=10,verbose=1,scoring=None)
et_grid.fit(x_train, y_train)
print('ExtraTree 最佳參數(shù)',et_grid.best_params_)
print('ExtraTree 最佳得分',et_grid.best_score_)

Extra Tree調(diào)參結(jié)果

#ET模型擬合
et=ExtraTreesClassifier(n_estimators=600,max_depth=8,min_samples_leaf=10,min_samples_split=20)
et.fit(x_train,y_train)
y_predict = et.predict(x_test)
y_predict_proba = et.predict_proba(x_test)[:,1]
RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

預(yù)測精度

ROC曲線

混淆矩陣

4.4.6 GBDT

#GBDT模型調(diào)參
gb_est=ExtraTreesClassifier()
gb_param_grid = {'n_estimators': [100], 'max_depth': [5],'min_samples_leaf':[10],'min_samples_split':[20]}
#n_estimators:[500,600,700,800,900,1000]
#max_depth:[6,8,10,12,15,20]
#min_samples_split:range(10, 90, 20)
#min_samples_leaf:range(10, 60, 10),
gb_grid = model_selection.GridSearchCV(gb_est, gb_param_grid, n_jobs=-1, cv=10,verbose=1,scoring=None)
gb_grid.fit(x_train, y_train)
print('ExtraTree 最佳參數(shù)',gb_grid.best_params_)
print('ExtraTree 最佳得分',gb_grid.best_score_)

GBDT調(diào)參結(jié)果

#gb模型擬合
gb = GradientBoostingClassifier(n_estimators=100,learning_rate=0.008,min_samples_split=20,min_samples_leaf=10,max_depth=5,verbose=0)
gb.fit(x_train,y_train)
y_predict = gb.predict(x_test)
y_predict_proba = gb.predict_proba(x_test)[:,1]
RST.append(y_predict);AST.append(y_predict_proba)
#擬合結(jié)果在訓(xùn)練集上可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

模型精度

ROC曲線

混淆矩陣

4.5 模型融合測試

采用兩種模型融合策略骂租，Voting與Stacking分別進(jìn)行測試掌挚。

4.4.1 Voting

AVG=np.zeros((len(RST[0])))
AVG_Pred=np.zeros((len(AST[0])))
#對前面幾種模型的結(jié)果進(jìn)行加權(quán)平均
for i in range(len(RST)):
    for j in range(len(RST[i])):
        AVG[j]=AVG[j]+RST[i][j]/6
        AVG_Pred[j]=AVG_Pred[j]+AST[i][j]/6
for j in range(len(AVG)):
    if(AVG[j]>0.5):
        AVG[j]=1.0
    else:
        AVG[j]=0.0
#可視化
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,AVG))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,AVG))
fpr,tpr,threshold=metrics.roc_curve(y_test,AVG_Pred)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(AVG,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

Voting模型精度

ROC曲線

混淆矩陣

4.4.2 Stacking-LR

第一層采用Random Forest，Adaboost菩咨，KNeighbors吠式，Decision Tree，GBDT五種模型抽米，對每個學(xué)習(xí)器進(jìn)行K-fold交叉驗(yàn)證特占，將驗(yàn)證集的結(jié)果拼湊玻淑，作為下一層的輸入瑰煎。
第二層使用LR模型馋辈，將第一層的預(yù)測結(jié)果作為特征進(jìn)行學(xué)習(xí)攀芯。

#這里的方法借鑒了[http://blog.csdn.net/koala_tree](http://blog.csdn.net/koala_tree)
from sklearn.model_selection import KFold
#K重驗(yàn)證參數(shù)
ntrain = train_X.shape[0]
ntest = test_X.shape[0]
SEED = 0 #for reproducibility
NFOLDS = 7 # set folds for out-of-fold prediction
kf = KFold(n_splits = NFOLDS,random_state=SEED,shuffle=False)
 
def get_out_fold(clf,x_train,y_train,x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS,ntest))
    
    for i, (train_index,test_index) in enumerate(kf.split(x_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
        
        clf.fit(x_tr,y_tr)
        
        oof_train[test_index] = clf.predict_proba(x_te)[:,1]
        oof_test_skf[i,:] = clf.predict_proba(x_test)[:,1]
        
    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1,1),oof_test.reshape(-1,1)

#第一層訓(xùn)練
#得出第一層的結(jié)果抱怔、第二層輸入
ada_oof_train,ada_oof_test = get_out_fold(ada,x_train,y_train,x_test) #Ada
rf_oof_train,rf_oof_test = get_out_fold(rf,x_train,y_train,x_test)  # Random Forest
dt_oof_train,dt_oof_test = get_out_fold(dt,x_train,y_train,x_test)  # DT
knn_oof_train,knn_oof_test = get_out_fold(knn,x_train,y_train,x_test)  # KNeighbors
et_oof_train,et_oof_test = get_out_fold(et,x_train,y_train,x_test)  # ET
gb_oof_train,gb_oof_test = get_out_fold(gb,x_train,y_train,x_test)  # GB

#第二層訓(xùn)練
x_train_2 = np.concatenate((rf_oof_train,ada_oof_train,knn_oof_train,dt_oof_train,gb_oof_train),axis=1)
x_test_2 = np.concatenate((rf_oof_test,ada_oof_test,knn_oof_test,dt_oof_test,gb_oof_test),axis=1)

lr = LogisticRegression(tol=0.00001, C=0.1, random_state=1234, max_iter=20,solver='liblinear',class_weight=None,penalty='l1')
lr.fit(x_train_2, y_train)
y_predict = lr.predict(x_test_2)
y_predict_proba = lr.predict_proba(x_test_2)[:,1]

#可視化結(jié)果
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

Stacking模型精度

ROC曲線

混淆矩陣

4.4.3 Stacking-Xgboost

再嘗試采用Xgboost模型作為第二層學(xué)習(xí)器

#第二層訓(xùn)練集
x_train_2 = np.concatenate((rf_oof_train,ada_oof_train,knn_oof_train,dt_oof_train,et_oof_train,gb_oof_train),axis=1)
x_test_2 = np.concatenate((rf_oof_test,ada_oof_test,knn_oof_test,dt_oof_test,et_oof_test,gb_oof_test),axis=1)
#x_train = np.concatenate((rf_oof_train,ada_oof_train,et_oof_train,gb_oof_train,dt_oof_train,knn_oof_train,svm_oof_train),axis=1)
#x_test =np.concatenate((rf_oof_test,ada_oof_test,et_oof_test,gb_oof_test,dt_oof_test,knn_oof_test,svm_oof_test),axis=1)
from xgboost import XGBClassifier,XGBRegressor

#xgboost調(diào)參
gbm_est = XGBClassifier(min_child_weight=3,gamma=0.9,subsample=0.8,
                    colsample_bytree=0.8,objective='binary:logistic',nthread=-1,scale_pos_weight=1)
gbm_param_grid = {'n_estimators': [50], 'max_depth': [6],'min_child_weight':[2]}
gbm_grid = model_selection.GridSearchCV(gbm_est, gbm_param_grid, n_jobs=-1, cv=5, verbose=1,scoring='recall')
gbm_grid.fit(x_train_2, y_train)
print('模型最佳得分：\n',gbm_grid.best_score_)
print('模型最佳參數(shù)：\n',gbm_grid.best_params_)
#xgboost訓(xùn)練
gbm = XGBClassifier(**gbm_grid.best_params_,gamma=0.0,subsample=1.0,
                    colsample_bytree=0.8,objective='binary:logistic',nthread=-1,scale_pos_weight=1).fit(x_train,y_train)
gbm.fit(x_train_2,y_train)
y_predict = gbm.predict(x_test_2)
y_predict_proba = gbm.predict_proba(x_test_2)[:,1]

#可視化結(jié)果
from sklearn import metrics
print('模型在測試集的預(yù)測準(zhǔn)確率：\n',metrics.accuracy_score(y_test,y_predict))
print('模型在測試集的預(yù)測召回率：\n',metrics.recall_score(y_test,y_predict))
fpr,tpr,threshold=metrics.roc_curve(y_test,y_predict_proba)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,alpha=0.5,edgecolor='black',color='steelblue')
plt.plot(fpr,tpr,lw=1,color='black')
plt.plot([0,1],[0,1],color='red',linestyle='--')
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.text(x=0.5,y=0.3, s="ROC curve (area=%0.2f)" % roc_auc)
#混淆矩陣
cm=pd.crosstab(y_predict,y_test)#=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,cmap='GnBu',fmt='d')
plt.xlabel('Real')
plt.ylabel('Predict')
plt.show()

模型精度

ROC曲線

混淆矩陣

4.6 預(yù)測部分總結(jié)

4.4,4.5節(jié)分別采用單模型與多模型融合的方式對用戶流失進(jìn)行分類，主要內(nèi)容為

特征選擇：選擇重要性排名前35位的特征進(jìn)行建模蚓炬。
模型選擇：分別采用RandomForest熏兄，AdaBoost，DecisionTree嗤疯，KNN冤今，ExtraTree，GBDT等模型進(jìn)行預(yù)測茂缚。
模型優(yōu)化目標(biāo)：考慮到問題重點(diǎn)關(guān)注流失率戏罢，應(yīng)該盡可能找出所有流失的用戶制定關(guān)懷政策，因此脚囊，在預(yù)測時應(yīng)當(dāng)提升模型的召回率龟糕。注：樣本中流失用戶僅占1/4，首先應(yīng)當(dāng)采用SMOTE算法對數(shù)據(jù)集進(jìn)行平衡改進(jìn)悔耘。
單模型預(yù)報(bào)結(jié)果：對模型分別調(diào)參后讲岁，得到單模型AUC值分別為0.86,0.86,0.83,0.85,0.84,0.85，說明這些模型都能夠較好地預(yù)測出客戶流失衬以，RandomForest,AdaBoost與KNN在測試集上召回率超過80%催首。
模型融合：考慮到單個學(xué)習(xí)器未必能夠獲得穩(wěn)定的預(yù)測結(jié)果，進(jìn)一步采用了模型融合進(jìn)行研究泄鹏，分別采用voting郎任，stacking這兩種策略進(jìn)行融合，stacking策略下分別采用LR回歸與Xgboost作為二級學(xué)習(xí)器建模备籽。其中舶治，Xgboost-Stacking策略表現(xiàn)較差，AUC僅為0.81车猬，voting與LR-stacking的AUC值均為0.85霉猛，召回率分別為0.8275,0.805。
結(jié)合實(shí)際問題珠闰，推薦使用模型為RandomForest惜浅，AdaBoost或Voting模型。

5 總結(jié)與建議

總結(jié)：

通過分析發(fā)現(xiàn)伏嗜，高流失用戶表現(xiàn)出以下幾個特征：無伴侶或子女坛悉，年長，使用光纖上網(wǎng)服務(wù)承绸，附加娛樂性服務(wù)而非保障性服務(wù)裸影，選擇月度付費(fèi)而非年度付費(fèi)，采用線上支付方式军熏，電子賬單轩猩，使用時間不足半年的新用戶。
通過數(shù)據(jù)挖掘，得到了多個分類預(yù)測模型均践，AUC值達(dá)到0.85晤锹。受非平衡數(shù)據(jù)集的影響，模型在測試集上的準(zhǔn)確率不高彤委，但召回率在80%以上鞭铆，能夠覆蓋絕大多數(shù)的流失用戶。

建議：

用戶類型上葫慎，針對年長的、沒有伴侶薇宠、子女的用戶可以推出相應(yīng)的優(yōu)惠套餐或在一定期限內(nèi)提供禮品等優(yōu)惠活動偷办；可以進(jìn)一步對支付金額鉆取，計(jì)算各類服務(wù)澄港、各年齡段下各種支付金額用戶的流失率椒涯，研究是否是不同類型的用戶對不同服務(wù)的價(jià)格敏感性不同。
網(wǎng)絡(luò)服務(wù)上回梧，運(yùn)營商應(yīng)當(dāng)進(jìn)一步調(diào)研光纖服務(wù)废岂，可以從兩個方面入手：a.服務(wù)質(zhì)量問題 b.用戶對于光纖服務(wù)的價(jià)格滿意度；提供網(wǎng)絡(luò)服務(wù)時狱意，可以免費(fèi)提供一些保障性的增值服務(wù)湖苞，以此提升用戶留存；針對娛樂性服務(wù)详囤，同樣需要進(jìn)行a财骨，b兩方面調(diào)研。
合同與支付方式上藏姐，應(yīng)當(dāng)鼓勵用戶簽訂長期合同隆箩，適當(dāng)推出一年期、兩年期的優(yōu)惠套餐羔杨、附贈娛樂性或保障性增值服務(wù)捌臊，提升用戶粘性；建議對采用電子支付兜材，無紙質(zhì)化賬單的用戶進(jìn)行進(jìn)一步挖掘理澎。考慮到電子支付的發(fā)展趨勢曙寡，這些用戶的流失可能不是支付方式導(dǎo)致的矾端，需要挖掘這些用戶的共同特點(diǎn)，進(jìn)行引導(dǎo)卵皂。
通過預(yù)測模型秩铆，將流失可能性較高的用戶單獨(dú)管理，制定更有針對性的個性化套餐服務(wù)，將他們培養(yǎng)成具有較高粘性的長期用戶殴玛。

電信運(yùn)營商用戶流失分析與預(yù)測

閑言碎語

項(xiàng)目概述

分析思路

第一部分：數(shù)據(jù)預(yù)處理

第二部分：從流失率角度進(jìn)行分析

第三部分：從用戶價(jià)值角度進(jìn)行分析

第四部分：通過分類模型預(yù)測用戶流失

1锨匆、數(shù)據(jù)預(yù)處理

1.1 導(dǎo)入數(shù)據(jù)、庫文件

1.2 類型轉(zhuǎn)換

2矢赁、流失率數(shù)據(jù)分析

2.1 描述分析

2.2 用戶流失比例

2.3 用戶類型分析

2.4 服務(wù)屬性分析

2.4.1 是否使用電話/網(wǎng)絡(luò)服務(wù)

2.4.2 細(xì)分服務(wù)選項(xiàng)

2.5 合同信息分析

2.5.1 支付方式分析

2.5.2 支付金額分析

2.6 小結(jié)

3芹血、 用戶價(jià)值分析

3.1 用戶繳費(fèi)金額分布

3.2 累計(jì)繳費(fèi)金額分布

3.3 用戶終身價(jià)值(LTV)分析

4.留存率預(yù)測

4.1 特征工程

4.1.1 特征提取與編碼

4.1.2 特征相關(guān)性分析

4.1.3 特征間的數(shù)據(jù)分布

4.2 采用不同模型篩選特征

4.3 生成訓(xùn)練集、測試集

4.4 采用單一模型進(jìn)行測試

4.4.1 Random Forest

4.4.2 AdaBoost

4.4.3 Decision Tree

4.4.4 KNN

4.4.5 Extra Tree

4.4.6 GBDT

4.5 模型融合測試

4.4.1 Voting

4.4.2 Stacking-LR

4.4.3 Stacking-Xgboost

4.6 預(yù)測部分總結(jié)

5 總結(jié)與建議

3芹血、用戶價(jià)值分析