2017/07 -- 2017/09 天池智慧交通預測賽思路及模型總結(jié)(一)
說在前面的話
ML的建模方法和數(shù)據(jù)處理方法看來是一個CS專業(yè)學生必備的技能了芝发,但是課余時間單純的學習一些基礎(chǔ)的模型感覺并沒有實質(zhì)性的進展绪商,因此,決定在暑期參加一次比賽來更好地學習這樣一個領(lǐng)域后德。本篇博客就是為了自己整理一下暑期比賽的材料部宿,思路簡要概述一下抄腔,然后附上源碼瓢湃。有興趣的同學希望我們一起探討。
(ps:整體選用了三種模型赫蛇,本篇博客先總結(jié)傳統(tǒng)機器學習模型:xgb和LGBM绵患,關(guān)于LSTM和static后續(xù)再上。)
關(guān)于數(shù)據(jù)悟耘,如果需要的同學可以私信我落蝙。
第一賽季排名:61
復 賽 排 名: 12
隊員:moka,LRain暂幼,過把火
只可惜天池的比賽跟京東的不相同筏勒,只能前五名進決賽,區(qū)區(qū)千分位的差距旺嬉,心里還是感覺有些不甘管行。
1 賽題介紹
移動互聯(lián)網(wǎng)時代的開啟使得每個出行者都成為了交通信息的貢獻者,超大規(guī)模的位置數(shù)據(jù)在云端進行處理和融合生成城市全時段邪媳,無盲區(qū)的交通信息捐顷。本屆算法挑戰(zhàn)賽以“移動互聯(lián)時代的智慧交通預測”為主題荡陷,邀請參賽者基于互聯(lián)網(wǎng)交通信息建立算法模型,精準預測各關(guān)鍵路段在某個時段的通行時間迅涮,實現(xiàn)對交通狀態(tài)波動起伏的預判废赞,助力社會智慧出行和城市交通智能管控。組委會將通過計算參賽者提交預測值和記錄真實值之間的誤差確定預測準確率叮姑,評估所提交的預測算法唉地。
1.1 數(shù)據(jù)源
移動APP數(shù)據(jù)實時匿名收集用戶地理位置信息, 處理和融合生成城市全時段传透,無盲區(qū)的交通信息渣蜗。本次大賽將提供城市關(guān)鍵路段(link)的屬性信息,路段間網(wǎng)絡(luò)拓撲結(jié)構(gòu)以及每條路段在歷史各時間段內(nèi)的通行時間旷祸,供參賽者建立和測試算法模型耕拷。
1.2 數(shù)據(jù)介紹
a. 路段(link)屬性表
每條道路的每個通行方向由多條路段(link)構(gòu)成,數(shù)據(jù)集中會提供每條link的唯一標識托享,長度骚烧,寬度,以及道路類型闰围,如表1所示赃绊;圖1示例了地面道路link1和link2的屬性信息。
b. link上下游關(guān)系表
link之間按照車輛允許通行的方向存在上下游關(guān)系羡榴,數(shù)據(jù)集中提供每條link的直接上游link和直接下游link碧查,如表2所示;圖2示例了link2的in_links和out_links校仑。
c. link歷史通行時間表
數(shù)據(jù)集中記錄了歷史每天不同時間段內(nèi)(2min為一個時間段)每條link上的平均旅行時間忠售,每個時間段的平均旅行時間是基于在該時間段內(nèi)進入link的車輛在該link上的旅行時間產(chǎn)出。字段解釋如表3所示:
1.3 目標說明
其實初賽跟復賽的要求是一樣的迄沫,都是基于已給數(shù)據(jù)來預測未來一段時間內(nèi)的交通流量的情況稻扬。
初賽是預測2016年6月份[8:00-9:00)每條link上每兩分鐘時間片的平均旅行時間。
復賽:
復賽第一階段
預測2017年7月1日至7月15早高峰羊瘩、日平峰和晚高峰三個時段中各一個小時的數(shù)據(jù)泰佳,預測時段為:
- 早高峰: [8:00 - 9:00)
- 日平峰: [15:00 - 16:00)
- 晚高峰: [18:00 - 19:00)
復賽第二階段
預測2017年7月1日至7月31早高峰、日平峰和晚高峰三個時段中各一個小時的數(shù)據(jù)尘吗,預測時段為:
- 早高峰:[8:00 - 9:00)
- 日平峰:[15:00 - 16:00)
- 晚高峰:[18:00 - 19:00 )
1.4 評估指標
指標說明:
ttp : 參賽者提交的預測通行時間
ttr: 真實通行時間
N: 預測link數(shù)量
T_i: 第i個link上的預測時間片數(shù)量
MAPE值越低說明模型準確度越高逝她。
1.5 模型架構(gòu):
我們的模型分為三個:
LightGBM
LSTM
Static(線性規(guī)則)
2 賽題分析
1、首先我們根據(jù)題目要求可以基本判斷這是一道回歸預測問題睬捶。
2黔宛、其次我們還可以很容易看出來賽題數(shù)據(jù)與時間序列相關(guān),結(jié)合數(shù)據(jù)量還算較大侧戴,因此可以考慮使用LSTM宁昭。
3跌宛、原始數(shù)據(jù)的維數(shù)并不是很大,因此特征的挖掘局限但也較為容易积仗,對于傳統(tǒng)的機器學習模型疆拘,xgboost和LightGBM都可以考慮拿來做baseline。
4寂曹、預測難點:
可能我們第一想法就是進行單點預測哎迄,也就是前一個來預測后一個,但是問題要求我們一次性預測出連續(xù)時間段的結(jié)果隆圆,即使這樣漱挚,可能我們也會嘗試去一個個預測,再將前一個預測值拿來預測后一個渺氧,但是這樣的誤差堆積會導致結(jié)果越來越差旨涝。經(jīng)實驗表明,單點預測到第四個的時候侣背,誤差就已經(jīng)超出可接受范圍了白华。
因此我們考慮利用前一段時間的數(shù)據(jù)來將目標連續(xù)結(jié)果一起給出。這樣我們的數(shù)據(jù)貢獻來自于全部的前一段時間的數(shù)據(jù)贩耐,這樣的構(gòu)造要求我們必須足夠好地去構(gòu)造歷史數(shù)據(jù)以及前一段時間的數(shù)據(jù)弧腥。
有以上基本認識后,這個比賽的大致方向其實就差不多了潮太,事實證明賽后top同學的思路也是這樣的管搪,只是差別在于特征冊處理以及模型的調(diào)參。
3 數(shù)據(jù)分析
首先對數(shù)據(jù)進行一個大致的查閱铡买。
3.1 數(shù)據(jù)可視化
借助Tableau更鲁,對數(shù)據(jù)進行了較為直觀的查閱。尤其是對于這些時序性數(shù)據(jù)寻狂,可視化的方法能夠給我們一個較為全面的參考和對比岁经。
針對道路數(shù)據(jù)的特性朋沮,我們著重在Tableau上對同期數(shù)據(jù)進行對比蛇券,尤其是不同月的相同小時及分鐘的數(shù)據(jù),一是為了驗證同期車流量整體趨勢是否上升或下降的猜想樊拓,而是為了選取合適的粗細粒度來設(shè)計統(tǒng)計特征纠亚。如下圖所示
上圖著重選取了幾條具有不同時序特征的道路進行分析,從圖中可以明顯看出大多數(shù)路段同期單位小時內(nèi)的趨勢基本相似筋夏,但仍存在類似于淺藍色道路這種每個月趨勢性信息不明顯的道路蒂胞。考慮到車流量受星期因素影響較大条篷,因此我們又在工作日骗随、周末維度對數(shù)據(jù)進行分析蛤织。如下圖所示:
在對數(shù)據(jù)細分到每周的星期之后,發(fā)現(xiàn)相同的道路受星期的因素較大鸿染,可以猜想指蚜,某些位于工業(yè)園區(qū)的道路周末由于休假的緣故,因此車流量在低位平緩涨椒,但在工作日會在早晚高峰呈現(xiàn)擁堵的情況摊鸡;相反,對于某些靠近商業(yè)區(qū)或是休閑區(qū)的道路蚕冬,工作日游客較少免猾,因此相對較為平緩,但在周末甚至周五晚上囤热,這些道路會出現(xiàn)較大的起伏猎提。這樣的特點也在一定程度上幫助我們在設(shè)計線性加權(quán)模型時,將星期的因素進行單獨考慮旁蔼。
3.2 離群點
監(jiān)測數(shù)據(jù)的準確性以及個別抽樣算法所導致的數(shù)據(jù)誤差是不可避免的忧侧,我們在觀測了全部數(shù)據(jù)后的確發(fā)現(xiàn)有這樣數(shù)據(jù)的存在,因此牌芋,我們采用outliers的方式蚓炬,剔除了首尾95%分位點的數(shù)據(jù)。
3.3 缺失值
時序數(shù)據(jù)的缺失很常見躺屁,目前也有很多研究是針對這一方面進行研究肯夏,例如我們實驗室的鄭宇老師(目前是微軟亞研院的主要負責人之一),他所做的城市計算方向中包含了城市時空數(shù)據(jù)缺失值的填補方向犀暑,具體可參閱鄭宇老師的主頁驯击。
對于這一賽題的數(shù)據(jù)來說废境,由于只有單維數(shù)據(jù)(時域數(shù)據(jù))则披,而沒有空間或是其他相關(guān)數(shù)據(jù),因此我們選擇填充的空間就比較有限支示。最終我們選用了前后均值來替代中間的缺失值广辰。對于某些有頭無尾或是有尾無頭的數(shù)據(jù)暇矫,我們就選用最近鄰的數(shù)據(jù)來替代。
4 特征工程
特征工程可參考如下流程圖
4.1 嘗試
數(shù)據(jù)為時間序列數(shù)據(jù)择吊,起初嘗試LGBM + 原始數(shù)據(jù)的前30個點來預測后30個點李根,但是效果很差(除了第一個點),打印出前30個點的特征重要性發(fā)現(xiàn)几睛,只有最后2個點的重要性最高房轿,前面的28個點都幾乎很小,因此放棄使用純原始數(shù)據(jù)來做時間序列預測。
(ps:因為不懂時間序列預測囱持,所以后來發(fā)現(xiàn)在做時間序列預測的時候夯接,不能直接使用純原始值,需要加上額外的統(tǒng)計特征)
4.2 時間特征
由于是是時序數(shù)據(jù)纷妆,因此我們首先要對時間進行挖掘钻蹬。因為只有時間給好了相關(guān)的數(shù)據(jù)處理才好進行,例如一些groupby等聚合操作凭需。這里要感謝河北工大-“麻婆豆腐”學長的支援问欠。代碼如下:
def AddBaseTimeFeature(df):
df['time_interval_begin']=pd.to_datetime(df['time_interval'].map(lambdax:x[1:20]))
df=df.drop(['date','time_interval'],axis=1)
df['time_interval_month']=df['time_interval_begin'].map(lambdax:x.strftime('%m'))
df['time_interval_day']=df['time_interval_begin'].map(lambdax:x.day)
df['time_interval_begin_hour']=df['time_interval_begin'].map(lambdax:x.strftime('%H'))
df['time_interval_minutes']=df['time_interval_begin'].map(lambdax:x.strftime('%M'))
#Monday=1,Sunday=7
df['time_interval_week']=df['time_interval_begin'].map(lambdax:x.weekday()+1)
df['time_interval_point_num']=df['time_interval_minutes'].map(lambdax:str((int(x)+2)/2))
return df
link_info=pd.read_table(data_path+'/new_data'+'/gy_contest_link_info.txt',sep=';',dtype={'link_ID':'str'})
link_info=link_info.sort_values('link_ID')
training_data=pd.read_table(data_path+'/new_data'+'/gy_contest_traveltime_training_data_second.txt',sep=';',dtype={'link_ID':'str'})
feature_data=pd.merge(training_data,link_info,on='link_ID')
feature_data=feature_data.sort_values(['link_ID','time_interval'])
print('正在生成最終特征矩陣')
feature_data_date=AddBaseTimeFeature(feature_data)
print('正在寫最終特征矩陣')
feature_data_date.to_csv(data_path+'/new_data'+'/feature_data_2017.csv',index=False)
print('正在讀特征矩陣')
feature_data=pd.read_csv(data_path+'/data'+'/feature_data_2017_without_missdata.csv',dtype={"link_ID":str})##指定linkID為str(Object),方便對其進行oneHot
week=pd.get_dummies(feature_data['time_interval_week'],prefix='week')
delfeature_data['time_interval_week']
print('特征矩陣與week-oneHot拼接')
feature_data=pd.concat([feature_data,week],axis=1)
##加入每個點是第幾個點的類別信息
print('特征矩陣與point_num拼接')
point_num=pd.get_dummies(feature_data['time_interval_point_num'],prefix='point_num')
delfeature_data['time_interval_point_num']
feature_data=pd.concat([feature_data,point_num],axis=1)
4.3 travel_time 統(tǒng)計特征的處理
嘗試使用統(tǒng)計特征的時候粒蜈,一股腦將常用的統(tǒng)計指標全部算出來顺献,包括mean,mode枯怖,std注整,max,min等度硝,同時外加前一個月此處需要感謝豆腐大佬(@麻婆豆腐)的代碼共享(初賽還沒結(jié)束肿轨,豆腐就把自己的源代碼貢獻在技術(shù)圈),Python代碼部分沒有很難懂的地方蕊程,但還是需要一點Dataframe的操作知識椒袍,其實多用幾次基本就爛熟于心。在前期對原始數(shù)據(jù)進行了日期格式處理的基礎(chǔ)上藻茂,我們開始進行統(tǒng)計值的特征處理部分驹暑。特征處理部分的代碼如下:
需要注意的是,為了對特征進行歸一化處理辨赐,我們將所有的數(shù)據(jù)在訓練前都做了ln處理优俘,使其更近似服從正態(tài)分布。然后在最后將其預測值做exp處理掀序,使其返回原始值帆焕。
根據(jù)時間特征,為了極大化前段時間數(shù)據(jù)對后面一段時間連續(xù)數(shù)據(jù)的影響不恭,采用了滑窗方法來對前一段時間的travel_time進行處理叶雹。
'''
train data 4月訓練
'''
train = pd.DataFrame()
train4 = pd.DataFrame()
for curHour in [8,15,18]:
print("train4 curHour", curHour)
trainTmp = feature_data.loc[(feature_data.time_interval_month == 4)&
(feature_data.time_interval_begin_hour==curHour)
# &(feature_data.time_interval_day<=15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 4)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
#train = pd.merge(train,tmp,on=['link_ID','time_interval_day','time_interval_begin_hour'],how='left')
trainTmp = pd.merge(trainTmp,tmp,on=['link_ID','time_interval_day'],how='left')
train4 = pd.concat([train4,trainTmp], axis=0)
print(" train4.shape", train4.shape)
train4_history = feature_data.loc[(feature_data.time_interval_month == 3),: ]
train4_history = train4_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
train4 = pd.merge(train4,train4_history,on=['link_ID','time_interval_minutes'],how='left')
train_history2 = feature_data.loc[(feature_data.time_interval_month == 3),: ]
train_history2 = train_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([ ('median_h', np.median),
('mode_h', mode_function)]).reset_index()
train4 = pd.merge(train4, train_history2,on=['link_ID','time_interval_begin_hour'],how='left')
print("train4.shape", train4.shape)
train = train4
train_label = np.log1p(train.pop('travel_time'))
train_time = train.pop('time_interval_begin')
train.drop(['time_interval_month'],inplace=True,axis=1)
train_link=train.pop('link_ID') #(253001, 35)
print("train.shape", train.shape)
'''
test 評測6月整月
'''
test = pd.DataFrame()
for curHour in [8,15,18]:
print("test curHour", curHour)
testTmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
testTmp = pd.merge(testTmp,tmp,on=['link_ID','time_interval_day'],how='left')
test = pd.concat([test,testTmp], axis=0)
print("test.shape", test.shape)
test_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
test_history = test_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
test = pd.merge(test,test_history,on=['link_ID','time_interval_minutes'],how='left')
test_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
test_history2 = test_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([ ('median_h', np.median),
('mode_h', mode_function)]).reset_index()
test = pd.merge(test,test_history2,on=['link_ID','time_interval_begin_hour'],how='left')
test_label = np.log1p(test.pop('travel_time'))
test_time = test.pop('time_interval_begin')
test.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
test_link=test.pop('link_ID')
5 機器學習模型
我們首先使用了競賽中的兩大殺器:xbg和LGBM.
模型我就不介紹了畢竟真的太出名了。
具體的實現(xiàn)流程如下圖:
訓練集使用前面特征選取部分提到的3月和4月的數(shù)據(jù)县袱,測試集分別有5月下半月和6月一整月浑娜。
在訓練過程中的驗證集test我們傳入的是6月一整月,只不過在最后我們會預測出5月下半月和6月一整月的數(shù)據(jù)作為輸出式散,以便手動在計算一次MAPE指標。
下面我只給出源碼打颤。
需要注意的是暴拄,我們需要自定義MAPE的損失函數(shù)漓滔,如下:
def mape_ln(y,d):
c=d.get_label()
result= -np.sum(np.abs(np.expm1(y)-np.abs(np.expm1(c)))/np.abs(np.expm1(c)))/len(c)
return "mape",result
其中y為預測值,d為真實值
xgboost訓練乖篷、驗證及預測源碼:
import xgboost as xgb
xlf = xgb.XGBRegressor(max_depth=8,
learning_rate=0.01,
n_estimators=1000,
silent=True,
objective=mape_object,
#objective='reg:linear',
nthread=-1,
gamma=0,
min_child_weight=6,
max_delta_step=0,
subsample=0.9,
colsample_bytree=0.8,
colsample_bylevel=1,
reg_alpha=1e0,
reg_lambda=0,
scale_pos_weight=1,
seed=9,
missing=None)
xlf.fit(train.values, train_label.values, eval_metric=mape_ln,
verbose=True, eval_set=[(test.values, test_label.values)],
early_stopping_rounds=10)
'''
預測sub响驴,并保存結(jié)果 5月下
'''
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 5)&
(feature_data.time_interval_begin_hour==curHour)
#&(feature_data.time_interval_day>15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 5)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 4),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 4),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = xlf.predict(sub.values, ntree_limit=xlf.best_iteration)
mape_ln1(sub_pred, sub_label) #('mape', -0.27325180044232567)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['xgb_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/xgb_pred_m5.csv', index=False)
'''
預測sub,并保存結(jié)果 6月整月
'''
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = xlf.predict(sub.values, ntree_limit=xlf.best_iteration)
mape_ln1(sub_pred, sub_label)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['xgb_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/xgb_pred_m6.csv', index=False)
'''
預測sub撕蔼,并保存結(jié)果 7月上
'''
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 7)&
(feature_data.time_interval_begin_hour==curHour)
# &(feature_data.time_interval_day<=15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 7)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = xlf.predict(sub.values, ntree_limit=xlf.best_iteration)
mape_ln1(sub_pred, sub_label)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['xgb_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/xgb_pred_m7.csv', index=False)
LGBM訓練及預測:
# 中位數(shù)
def mode_function(df):
counts = mode(df)
return counts[0][0]
feature_data = pd.read_csv('E:/data/all_data_M34567.csv',dtype={'link_ID':str})
'''
train data 4月訓練
'''
train = pd.DataFrame()
train4 = pd.DataFrame()
for curHour in [8,15,18]:
print("train4 curHour", curHour)
trainTmp = feature_data.loc[(feature_data.time_interval_month == 4)&
(feature_data.time_interval_begin_hour==curHour)
# &(feature_data.time_interval_day<=15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 4)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
trainTmp = pd.merge(trainTmp,tmp,on=['link_ID','time_interval_day'],how='left')
train4 = pd.concat([train4,trainTmp], axis=0)
print(" train4.shape", train4.shape)
train4_history = feature_data.loc[(feature_data.time_interval_month == 3),: ]
train4_history = train4_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
train4 = pd.merge(train4,train4_history,on=['link_ID','time_interval_minutes'],how='left')
train_history2 = feature_data.loc[(feature_data.time_interval_month == 3),: ]
train_history2 = train_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([ ('median_h', np.median),
('mode_h', mode_function)]).reset_index()
train4 = pd.merge(train4, train_history2,on=['link_ID','time_interval_begin_hour'],how='left')
print("train4.shape", train4.shape)
train = train4
train_label = np.log1p(train.pop('travel_time'))
train_time = train.pop('time_interval_begin')
train.drop(['time_interval_month'],inplace=True,axis=1)
train_link=train.pop('link_ID') #(253001, 35)
print("train.shape", train.shape)
'''
test 評測6月整月 [374] valid_0's mape: 0.284432
'''
test = pd.DataFrame()
for curHour in [8,15,18]:
print("test curHour", curHour)
testTmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
testTmp = pd.merge(testTmp,tmp,on=['link_ID','time_interval_day'],how='left')
test = pd.concat([test,testTmp], axis=0)
print("test.shape", test.shape)
test_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
test_history = test_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
test = pd.merge(test,test_history,on=['link_ID','time_interval_minutes'],how='left')
test_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
test_history2 = test_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([ ('median_h', np.median),
('mode_h', mode_function)]).reset_index()
test = pd.merge(test,test_history2,on=['link_ID','time_interval_begin_hour'],how='left')
test_label = np.log1p(test.pop('travel_time'))
test_time = test.pop('time_interval_begin')
test.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
test_link=test.pop('link_ID')
def mape_ln1(y,d):
#c=d.get_label()
c=d
result= -np.sum(np.abs(np.expm1(y)-np.abs(np.expm1(c)))/np.abs(np.expm1(c)))/len(c)
return "mape",result
def mape_object(d,y):
# print(d)
# print(y)
grad=1.0*(y-d)/d
hess=1.0/d
return grad,hess
def mape_ln_gbm(d,y):
# c=d.get_label()
result=np.sum(np.abs(np.expm1(y)-np.abs(np.expm1(d)))/np.abs(np.expm1(d)))/len(d)
return "mape",result,False
import lightgbm as lgb
lgbmodel = lgb.LGBMRegressor(num_leaves=32,
# max_depth=9,
max_bin=511,
learning_rate=0.01,
n_estimators=2000,
silent=True,
objective=mape_object,
min_child_weight=6,
colsample_bytree=0.8,
reg_alpha=1e0,
reg_lambda=0)
lgbmodel.fit(train.values, train_label.values, eval_metric=mape_ln_gbm,
verbose=True, eval_set=[(test.values, test_label.values)],
early_stopping_rounds=100)
pred = lgbmodel.predict(test.values, num_iteration= lgbmodel.best_iteration)
'''
預測sub豁鲤,并保存結(jié)果 5月下
'''
test
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 5)&
(feature_data.time_interval_begin_hour==curHour)
# &(feature_data.time_interval_day>15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 5)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 4),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 4),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = lgbmodel.predict(sub.values, num_iteration= lgbmodel.best_iteration)
#mape_ln1(sub_pred, sub_label) ('mape', -0.27112186522435494)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['gbm_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/gbm_pred_m5.csv', index=False)
'''
預測sub,并保存結(jié)果 6月整月
'''
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 6)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = lgbmodel.predict(sub.values, num_iteration= lgbmodel.best_iteration)
#mape_ln1(sub_pred, sub_label)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['gbm_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/gbm_pred_m6.csv', index=False)
'''
預測sub鲸沮,并保存結(jié)果 7月上
'''
sub = pd.DataFrame()
for curHour in [8,15,18]:
print("sub curHour", curHour)
subTmp = feature_data.loc[(feature_data.time_interval_month == 7)&
(feature_data.time_interval_begin_hour==curHour)
# &(feature_data.time_interval_day<=15)
,:]
for i in [58,48,38,28,18,0]:
tmp = feature_data.loc[(feature_data.time_interval_month == 7)&
(feature_data.time_interval_begin_hour==curHour-1)
&(feature_data.time_interval_minutes >= i),:]
tmp = tmp.groupby(['link_ID', 'time_interval_day'])[
'travel_time'].agg([('mean_%d' % (i), np.mean), ('median_%d' % (i), np.median),
('mode_%d' % (i), mode_function)]).reset_index()
subTmp = pd.merge(subTmp,tmp,on=['link_ID','time_interval_day'],how='left')
sub = pd.concat([sub,subTmp], axis=0)
print("sub.shape", sub.shape)
sub_history = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history = sub_history.groupby(['link_ID', 'time_interval_minutes'])[
'travel_time'].agg([('mean_m', np.mean), ('median_m', np.median),
('mode_m', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history,on=['link_ID','time_interval_minutes'],how='left')
sub_history2 = feature_data.loc[(feature_data.time_interval_month == 5),: ]
sub_history2 = sub_history2.groupby(['link_ID', 'time_interval_begin_hour'])[
'travel_time'].agg([('median_h', np.median),
('mode_h', mode_function)]).reset_index()
sub = pd.merge(sub,sub_history2,on=['link_ID','time_interval_begin_hour'],how='left')
sub_label = np.log1p(sub.pop('travel_time'))
sub_time = sub.pop('time_interval_begin')
sub.drop(['time_interval_month'],inplace=True,axis=1)
#去掉link_ID
sub_link = sub.pop('link_ID')
#預測
sub_pred = lgbmodel.predict(sub.values, num_iteration= lgbmodel.best_iteration)
#mape_ln1(sub_pred, sub_label)
sub_out = pd.concat([sub_link, sub], axis=1)
sub_out = pd.concat([sub_out,np.expm1(sub_label)],axis=1)
sub_out['gbm_pred'] = np.expm1(sub_pred)
sub_out.to_csv('./predict_result/gbm_pred_m7.csv', index=False)
總結(jié)
本次就先總結(jié)基本用到的特征處理和機器學習模型琳骡,重點還是看代碼吧,因為個人時間問題讼溺,沒能一氣呵成把所有模型寫完楣号,后面會陸續(xù)完善。如有錯誤請大家交流指正怒坯。
我的博客 : https://NingSM.github.io
轉(zhuǎn)載請注明原址炫狱,謝謝。