公共自行車使用量預(yù)測

主要內(nèi)容

知識點

1.GBDT
2.xgboost的核心算法
3.GBDT和xgboost的區(qū)別

數(shù)據(jù)分析和參數(shù)調(diào)優(yōu)過程

一、簡要說明
二、數(shù)據(jù)分析
三、調(diào)參前的一些說明
四只厘、使用 GridSearchCV調(diào)參
五烙丛、進(jìn)一步調(diào)參
六、結(jié)果匯總

知識點：

1.GBDT
(1)先提一下GBDT算法羔味，因為xgboost本質(zhì)上還是一個GBDT河咽，但是比起GBDT，xgboost力爭把速度和效率發(fā)揮到極致介评。
(2)GBDT算法：第一個分類器是預(yù)測出數(shù)據(jù)的結(jié)果库北，然后下一個弱分類器是用損失函數(shù)的負(fù)梯度來近似殘差(殘差：實際值與預(yù)測值之差)，之后的弱學(xué)習(xí)器都是用殘差去擬合預(yù)測結(jié)果们陆，所有弱分類器的結(jié)果相加等于預(yù)測值寒瓦。弱分類器的表現(xiàn)形式是各棵樹坪仇。

2.xgboost的核心算法
(1)不斷地添加樹杂腰，不斷地進(jìn)行特征分裂來生長一棵樹(即二叉樹)，每次添加一個樹椅文，其實是學(xué)習(xí)一個新函數(shù)喂很，去擬合上次預(yù)測的殘差。
(2)當(dāng)我們訓(xùn)練完成得到k棵樹皆刺，我們要預(yù)測一個樣本的分?jǐn)?shù)少辣，其實就是根據(jù)這個樣本的特征，在每棵樹中會落到對應(yīng)的一個葉子節(jié)點羡蛾，每個葉子節(jié)點就對應(yīng)一個分?jǐn)?shù)
(3)最后只需要將每棵樹對應(yīng)的分?jǐn)?shù)加起來就是該樣本的預(yù)測值漓帅。

3.GBDT和xgboost的區(qū)別
二者較大的不同就是目標(biāo)函數(shù)的定義，xgboost的目標(biāo)函數(shù)引入了泰勒展開式痴怨，使得其最終的目標(biāo)函數(shù)只依賴于每個數(shù)據(jù)點在誤差函數(shù)上的一階導(dǎo)數(shù)和二階導(dǎo)數(shù)忙干。

詳細(xì)了解：通俗理解kaggle比賽大殺器xgboost

數(shù)據(jù)分析和參數(shù)調(diào)優(yōu)過程

一、簡要說明
1.任務(wù)類型：回歸
2.背景介紹:
公共自行車低碳浪藻、環(huán)保捐迫、健康，并且解決了交通中“最后一公里”的痛點爱葵，在全國各個城市越來越受歡迎施戴。本練習(xí)賽的數(shù)據(jù)取自于兩個城市某街道上的幾處公共自行車停車樁。我們希望根據(jù)時間钧惧、天氣等信息暇韧，預(yù)測出該街區(qū)在一小時內(nèi)的被借取的公共自行車的數(shù)量。
3.標(biāo)桿模型
該模型預(yù)測結(jié)果的RMSE為：18.947

# -*- coding: utf-8 -*-

# 引入模塊
from xgboost import XGBRegressor
import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")

# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

# 取出訓(xùn)練集的y
y_train = train.pop('y')

# 建立一個默認(rèn)的xgboost回歸模型
reg = XGBRegressor()
reg.fit(train, y_train)
y_pred = reg.predict(test)

# 輸出預(yù)測結(jié)果至my_XGB_prediction.csv
submit['y'] = y_pred
submit.to_csv('my_XGB_prediction.csv', index=False)

4.變量說明

image.png

5.評價方法

image.png

二浓瞪、數(shù)據(jù)分析
1.查看數(shù)據(jù)是否有缺失值

import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

print(train.info())

info()函數(shù)說明：
功能：用于給出樣本數(shù)據(jù)的相關(guān)信息概覽：行數(shù)，列數(shù)耕蝉，列索引付鹿，列非空值個數(shù)，列類型傅寡，內(nèi)存占用
使用格式：data.info()
詳細(xì)了解：https://blog.csdn.net/Dreamer_rx/article/details/100804378」

運行結(jié)果

image.png

可以看到英岭，有10000個觀測值湾盒，沒有缺失值

2.觀察每個變量的基礎(chǔ)描述信息

import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

print(train.describe())

describe()函數(shù)說明：
功能：直接給出樣本數(shù)據(jù)的一些基本的統(tǒng)計量，包括均值诅妹，標(biāo)準(zhǔn)差罚勾，最大值，最小值吭狡，分位數(shù)等尖殃。
使用格式：data.describe()
詳細(xì)了解：https://blog.csdn.net/Dreamer_rx/article/details/100804378

運行結(jié)果

image.png

3.離群點分析
通過畫出特征的箱線圖來分析是否存在離群點。
箱線圖了解：https://blog.csdn.net/uinglin/article/details/79895993

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

fig, axes = plt.subplots(nrows=3,ncols=2)   #畫出 3*2=6 個子圖
fig.set_size_inches(12, 10)  #設(shè)置圖的大小
sn.boxplot(data=train, y="y", orient="v", ax=axes[0][0])   #axes[0][0]表示子圖的坐標(biāo)或者位置
sn.boxplot(data=train, y="y", x="city", orient="v", ax=axes[0][1])
sn.boxplot(data=train, y="y", x="hour", orient="v", ax=axes[1][0])
sn.boxplot(data=train, y="y", x="is_workday", orient="v", ax=axes[1][1])
sn.boxplot(data=train, y="y", x="weather", orient="v", ax=axes[2][0])
sn.boxplot(data=train, y="y", x="wind", orient="v", ax=axes[2][1])


axes[0][0].set(ylabel='bike',title="Box Plot On bike")
axes[0][1].set(xlabel='city', ylabel='bike',title="Box Plot On bike Across city")
axes[1][0].set(xlabel='Hour Of The Day', ylabel='bike',title="Box Plot On bike Across Hour Of The Day")
axes[1][1].set(xlabel='is_workday', ylabel='bike',title="Box Plot On bike Across is_workday")
axes[2][0].set(xlabel='weather', ylabel='bike',title="Box Plot On bike Across Hour Of weather")
axes[2][1].set(xlabel='wind', ylabel='bike',title="Box Plot On bike Across wind")

plt.show()

為什么不描繪出temp_1和temp_2的箱線圖呢划煮？

個人覺得是因為temp_1和temp_2這兩個特征有很多不同的取值送丰，畫出的圖會有很多箱子，很混亂弛秋，沒有分析的意義器躏，比如temp_1的箱線圖就很混亂，如下所示：

image.png

運行結(jié)果：

image.png

(weathher:1為晴朗蟹略，2為多云登失、陰天，3為輕度降水天氣挖炬，4為強降水天氣)

y軸count包括很多超過上警戒線的離群點揽浙，還有以下結(jié)論：
(1)"weather"的箱線圖：從中位數(shù)線可以看出，當(dāng)weather=3茅茂，4的時候（3為輕度降水天氣捏萍，4為強降水天氣），頻數(shù)較低空闲，這和我們的常識相符吧令杈，下雨騎單車的人會較少
(2)"Hour Of The Day"的箱線圖：中位數(shù)在早7點-8點，晚5點-6點較高碴倾。這兩個時間段正值上學(xué)放學(xué)逗噩、上班下班高峰期。
(3)"is_workday"的箱線圖**：大多數(shù)離群點來自1,也就是"Working Day"而非"Non Working Day".

4.刪除離群點
將y（即單車使用量）減去y的平均數(shù)跌榔，并將該差值大于3倍y的標(biāo)準(zhǔn)差的點作為離群點刪去异雁，小于的留下來，之所以這樣界定離群點僧须，應(yīng)該跟正態(tài)分布類似纲刀，因為差值大于3倍標(biāo)準(zhǔn)差的數(shù)據(jù)的出現(xiàn)概率很小

import pandas as pd
import numpy as np

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

#刪去離群點
train_without_outliers = train[np.abs(train["y"]-train["y"].mean())<=(3*train["y"].std())]  # np.mean()：求均值   np.std():求標(biāo)準(zhǔn)差

# 查看刪除離群點前 后 共有多少條數(shù)據(jù)
print("Shape Of The Before Ouliers: ",train.shape)   # np.shape()：功能是查看矩陣或者數(shù)組的維數(shù)。c.shape[1] 為第一維的長度(即多少列)担平，c.shape[0] 為第二維的長度(即多少行)示绊。
print("Shape Of The After Ouliers: ",train_without_outliers.shape)

運行結(jié)果：

image.png

5.相關(guān)性分析
分析變量之間的關(guān)系锭部，特別是因變量公共自行車的使用數(shù)量受哪些特征影響,剔除一些相關(guān)性比較微弱的特征。為此面褐，我們作y關(guān)于["city","hour","is_workday","weather","temp_1","temp_2","wind"]的關(guān)系熱力圖拌禾，從圖中我們可以直接看出變量之間相關(guān)性的強弱

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

#繪制關(guān)系熱力圖
corrMatt = train[["city","hour","is_workday","weather","temp_1","temp_2","wind","y"]].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=0.8, square=True,annot=True)
plt.show()

運行結(jié)果：

image.png

由于這個結(jié)果是根據(jù)沒有剔除前面分析的離群點得到的，看到公共自行車使用數(shù)量y與是否是工作日is_workday的相關(guān)系數(shù)只有0.029的時候展哭，我在想會不會把離群點刪去湃窍，得到的相關(guān)系數(shù)會有改變，因為運行之前覺得y與is_workday應(yīng)該有一定程度的相關(guān)關(guān)系匪傍。然后我將數(shù)據(jù)刪去離群點再運行了一次您市，發(fā)現(xiàn)結(jié)果是一樣的，可能是離群點相較訓(xùn)練集而言析恢，所占比重比較小墨坚，對變量之間的相關(guān)性影響不大。

由結(jié)果可以得到以下結(jié)論：
(1)hour, temp_1, temp_2 與y 存在弱的正相關(guān)關(guān)系映挂，wind與y存在較弱的正相關(guān)關(guān)系泽篮，city, weather與y存在較弱的負(fù)相關(guān)關(guān)系。
(2)is_workday 與 y幾乎沒有相關(guān)性柑船，可以考慮刪去這個特征帽撑。
(3)“temp_1” and "temp_2"具有強烈的相關(guān)性。這和我們的常識相符鞍时，天氣氣溫和我們感受到的溫度接近亏拉，這兩個預(yù)測變量存在強共線性，應(yīng)該考慮刪去一個逆巍。

6.刪除一些相關(guān)性較弱的特征

#刪掉一些相關(guān)性比較弱的特征
dropFeatures = ['temp_2',"is_workday"]
train = train.drop(dropFeatures,axis=1)
test = test.drop(dropFeatures,axis=1)

這里只刪除了temp_2和相關(guān)性非常微弱（只有0.029）的is_wokday

三及塘、調(diào)參前的一些說明

1.進(jìn)行調(diào)優(yōu)xgboost的幾個參數(shù)說明：

(1) learning rate ：每一步迭代的步長(或權(quán)重),該值太大，運行準(zhǔn)確率不高锐极，該值太小笙僚，運行速度慢。通過減少每一步的權(quán)重灵再，可以提高模型的魯棒性肋层。典型值為0.01-0.2。
(2)min_child_weight : 決定最小葉子節(jié)點樣本權(quán)重和,這個參數(shù)用于避免過擬合翎迁，當(dāng)它的值較大時栋猖，可以避免模型學(xué)習(xí)到局部的特殊樣本。但是如果這個值過高汪榔，會導(dǎo)致欠擬合蒲拉。
(3)max_depth：樹的最大深度，這個值也是用來避免過擬合的。參數(shù)范圍常為：3-10
(4)gamma：在節(jié)點分裂時全陨，只有分裂后損失函數(shù)的值下降了爆班，才會分裂這個節(jié)點衷掷。Gamma指定了節(jié)點分裂所需的最小損失函數(shù)下降值辱姨。這個參數(shù)的值越大，算法越保守戚嗅。
(5)subsample：這個參數(shù)控制對于每棵樹隨機采樣的比例雨涛。減小這個參數(shù)的值，算法會更加保守懦胞，避免過擬合替久。但如果這個值設(shè)置得過小，可能會導(dǎo)致欠擬合躏尉。參數(shù)范圍常為：0.5-1
(6)colsample_bytree : 用來控制每棵隨機采樣的列數(shù)的占比(每一列是一個特征)蚯根。參數(shù)范圍常為：0.5-1
(7)reg_alpha：權(quán)重的L1正則化項。可以應(yīng)用在很高維度的情況下胀糜，使得算法的速度更快颅拦。
(8)reg_lambda：權(quán)重的L2正則化項。這個參數(shù)是用來控制XGBoost的正則化部分的教藻。
詳細(xì)了解:xgboost的參數(shù)

2.gridSearchCV的一些參數(shù)和屬性介紹

(1)estimator：選擇使用的分類器距帅，并且傳入除需要確定最佳的參數(shù)之外的其他參數(shù)
(2)param_grid：需要最優(yōu)化的參數(shù)的取值，值為字典或者列表括堤，例如：param_grid =param_test1碌秸，param_test1 {'n_estimators':range(10,71,10)}。
(3)iid：默認(rèn)True,為True時各個樣本fold概率分布一致悄窃，誤差估計為所有樣本之和讥电，而非各個fold的平均。
(4)cv：交叉驗證參數(shù)轧抗，默認(rèn)為None恩敌，使用3折⊙恢拢可指定數(shù)量潮剪。
(5)best_score_：提供優(yōu)化過程期間觀察到的最好的評分
(6)best_params_：描述已取得最佳結(jié)果的參數(shù)(組合)
其余參數(shù)和方法：gridSearchCV（網(wǎng)格搜索）的參數(shù)、方法及示例

3.調(diào)參過程中會用到一個工具：gridSearchCV（網(wǎng)格搜索）分唾，它的作用是自動調(diào)參抗碰，只要把參數(shù)輸進(jìn)去，就能給出最優(yōu)化的結(jié)果和參數(shù)绽乔。但是這個方法適合于小數(shù)據(jù)集弧蝇。

4.設(shè)置一些參數(shù)的初始值(可以設(shè)置不同的值）：

(1)learning_rate = 0.1
(2)max_depth = 5 :這個參數(shù)的取值最好在3-10之間。選的起始值為5，也可以選擇其它的值看疗。起始值在4-6之間都是不錯的選擇沙峻。
(3)min_child_weight = 1
(4)gamma = 0
(5)subsample,colsample_bytree = 0.8
(6)scale_pos_weight = 1
(7)iid=False,
(8)cv=5

5.參數(shù)調(diào)優(yōu)的一般方法。

選擇較高的學(xué)習(xí)速率(learning rate)两芳。一般情況下摔寨，學(xué)習(xí)速率的值為0.1。但是怖辆，對于不同的問題是复，理想的學(xué)習(xí)速率有時候會在0.05到0.3之間波動。選擇對應(yīng)于此學(xué)習(xí)速率的理想決策樹數(shù)量竖螃。

對于給定的學(xué)習(xí)速率和決策樹數(shù)量淑廊，進(jìn)行決策樹特定參數(shù)調(diào)優(yōu)(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。

xgboost的正則化參數(shù)的調(diào)優(yōu)特咆。(reg_lambda, reg_alpha)季惩。這些參數(shù)可以降低模型的復(fù)雜度，從而提高模型的表現(xiàn)腻格。

降低學(xué)習(xí)速率画拾，確定理想?yún)?shù)

四、使用 GridSearchCV調(diào)參
1.首先根據(jù)已經(jīng)確定的learning_rate荒叶，調(diào)對應(yīng)的n_estimators(決策樹數(shù)量)
（1）對刪去離群點和無用特征后的數(shù)據(jù)進(jìn)行參數(shù)調(diào)優(yōu)

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

#刪去離群點
train_without_outliers = train[np.abs(train["y"]-train["y"].mean())<=(3*train["y"].std())]  # np.mean()：求均值   np.std():求標(biāo)準(zhǔn)差

#刪掉一些相關(guān)性比較弱的特征
dropFeatures = ['temp_2',"is_workday"]
train = train.drop(dropFeatures,axis=1)
test = test.drop(dropFeatures,axis=1)

# 取出訓(xùn)練集的y
y_train = train.pop('y')

param_test1 = {
    'n_estimators': range(100, 1000, 50)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, max_depth=5,
                                               min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 100} 0.7464271268024434
可以看出碾阁，算法的得分比較低

（2）考慮把與公共自行車使用量y的相關(guān)系數(shù)在0.1左右的特征剔除，剔除離群點的同時剔除"city", "is_workday", "weather", "wind"些楣，"temp_2"這些特征脂凶，繼續(xù)調(diào) n_estimators的值

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

#刪去離群點
train_without_outliers = train[np.abs(train["y"]-train["y"].mean())<=(3*train["y"].std())]  # np.mean()：求均值   np.std():求標(biāo)準(zhǔn)差

#刪掉一些相關(guān)性比較弱的特征
dropFeatures = ['city','is_workday','weather','wind','temp_2']
train = train.drop(dropFeatures,axis=1)
test = test.drop(dropFeatures,axis=1)

# 取出訓(xùn)練集的y
y_train = train.pop('y')

param_test1 = {
    'n_estimators': range(100, 1000, 50)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, max_depth=5,
                                               min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 100} 0.6238675776837738
可以看到得分更低了

（3）考慮不對數(shù)據(jù)進(jìn)行任何處理，直接調(diào)參：

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

# 取出訓(xùn)練集的y
y_train = train.pop('y')

param_test1 = {
    'n_estimators': range(100, 1000, 50)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, max_depth=5,
                                               min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 200} 0.903184767739565

在上面的調(diào)試過程中發(fā)現(xiàn)：對數(shù)據(jù)進(jìn)行處理后調(diào)參愁茁，算法的得分很低蚕钦，不對數(shù)據(jù)做任何處理進(jìn)行調(diào)參，得分高了挺多鹅很∷痪樱可能是因為這些數(shù)據(jù)已經(jīng)經(jīng)過清洗了，需要處理的地方比較少促煮。雖然有些特征與因變量的相關(guān)程度很低邮屁，但是可能刪去這些特征后，數(shù)據(jù)的特征較少菠齿，更不利于算法進(jìn)行預(yù)測佑吝。所以下面的調(diào)參過程用的數(shù)據(jù)都是原始數(shù)據(jù)

2.調(diào)參數(shù)：max_depth 和min_child_weight
先對這兩個參數(shù)調(diào)優(yōu)，是因為它們對最終結(jié)果有很大的影響绳匀。
（1）大范圍地調(diào)

param_test1 = {
    'max_depth': range(3, 10, 2),
    'min_child_weight': range(1, 6, 2)

}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.05, n_estimators=2500,
                                                gamma=0, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'max_depth': 5, 'min_child_weight': 5} 0.9033017174911162
得分高了一點
（2）小范圍地微調(diào)
可以看到理想的max_depth值為5芋忿，理想的min_child_weight值為5炸客。在這個值附近可以再進(jìn)一步調(diào)整，來找出理想值戈钢。把上下范圍各拓展1痹仙，因為之前進(jìn)行組合的時候，參數(shù)調(diào)整的步長是2殉了。

param_test1 = {
    'max_depth': [4, 5, 6],
    'min_child_weight': [4, 5, 6]

}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200,
                                                gamma=0, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：{'max_depth': 6, 'min_child_weight': 5} 0.9037898616664826
可以看到开仰，得分有所上升
3.gamma參數(shù)調(diào)優(yōu)
在已經(jīng)調(diào)整好其它參數(shù)的基礎(chǔ)上，就可以進(jìn)行g(shù)amma參數(shù)的調(diào)優(yōu)了,這里對gamma設(shè)置的取值范圍比較小宣渗，也可以設(shè)其它范圍抖所。

param_test1 = {
    'gamma':[i / 20.0 for i in range(0, 11)]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth= 6,
                                                min_child_weight=5, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'gamma': 0.5} 0.9041761291614719
4.調(diào)整subsample 和 colsample_bytree 參數(shù)

param_test1 = {
      'subsample': [i / 10.0 for i in range(5, 10)],
    'colsample_bytree': [i / 10.0 for i in range(5, 10)]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth= 6,
                                                min_child_weight=5, gamma=0.5,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果:
{'colsample_bytree': 0.8, 'subsample': 0.8} 0.9041761291614719
5.調(diào)正則化參數(shù) reg_alpha 和 reg_lambda

param_test1 = {
    'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05, 0.1],
    'reg_lambda': [0.05, 0.1, 0.5, 1, 1.5, 2, 2.5, 3]


}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth=6, min_child_weight=5,
                                               gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'reg_alpha': 0.1, 'reg_lambda': 3} 0.9048829604868451

第一次提交：RMSE為15.042

from xgboost import XGBRegressor
import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv(r"D:\m\sofasofa\train.csv")
test = pd.read_csv(r"D:\m\sofasofa\test.csv")
submit = pd.read_csv(r"D:\m\sofasofa\sample_submit.csv")

# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)  #刪除id那一列， 并在原來的數(shù)據(jù)上改變

# 取出訓(xùn)練集的y
y_train = train.pop('y')

reg = XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth=6,
                   min_child_weight=5, gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                   reg_alpha=0.1, reg_lambda=3, nthread=4, scale_pos_weight=1, seed=27)

reg.fit(train, y_train)
y_pred = reg.predict(test)

#輸出預(yù)測結(jié)果至my_LR_prediction.csv
submit['y'] = y_pred
submit.to_csv('my_LR_prediction.csv', index=False)

五痕囱、進(jìn)一步調(diào)參
1.繼續(xù)對正則化參數(shù)進(jìn)行調(diào)優(yōu)
(1)由于前面在對正則化參數(shù)reg_alpha 和 reg_lambda進(jìn)行調(diào)優(yōu)的過程中，給這2個參數(shù)設(shè)置的范圍太小暴匠，所以考慮給reg_alpha 和 reg_lambda一個大一點的范圍鞍恢，繼續(xù)調(diào)優(yōu)

param_test1 = {
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100],
    'reg_lambda': [1e-5, 1e-2, 0.1, 1, 100]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth=6, min_child_weight=5,
                                               gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：{'reg_alpha': 100, 'reg_lambda': 1} 0.906035918446076
（2）縮小范圍，繼續(xù)調(diào) reg_alpha'和reg_lambda

param_test1 = {
    'reg_alpha': range(0, 110, 10),
    'reg_lambda': [i / 10.0 for i in range(0, 11)]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth=6, min_child_weight=5,
                                               gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'reg_alpha': 90, 'reg_lambda': 0.6} 0.9063725963693422
可以看到得分是目前最高的

第二次參數(shù)調(diào)優(yōu)所得的值為：

image.png

提交結(jié)果每窖，得到的RMSE為15.083帮掉，誤差反而比前一次更大了，這可能是由于reg_alpha的值調(diào)得太大了窒典，reg_alpha的值越大蟆炊，越不容易過擬合。但是太大可能會導(dǎo)致算法欠擬合瀑志，泛化能力不高(這個說法只是個人推測涩搓，可能錯誤)

2.降低learning_rate，增加對應(yīng)learning_rate下生成樹的數(shù)量n_estimators劈猪。由于第二次調(diào)得的參數(shù)可能使算法欠擬合昧甘，所以可以考慮增加樹的數(shù)量n_estimators
（1）令learning_rate = 0.05

param_test1 = {
    'n_estimators': range(1000, 5000, 500)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.05, max_depth=6,
                                               min_child_weight=5, gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               reg_alpha=90, reg_lambda=0.6, nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 1000} 0.9044943682927349

提交

image.png

得RMSE：15.048

考慮進(jìn)一步微調(diào)n_estimators的值，將其范圍設(shè)在1000上下

param_test1 = {
    'n_estimators': range(100, 1500, 50)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.05, max_depth=6,
                                               min_child_weight=5, gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               reg_alpha=90, reg_lambda=0.6, nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 350} 0.906418990131387

提交

image.png

得RMSE：14.943

（2)進(jìn)一步縮小learning_rate的值战得，令learning_rate = 0.01

param_test1 = {
    'n_estimators': range(1000, 5000, 500)
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.01, max_depth=6,
                                               min_child_weight=5, gamma=0.5, subsample=0.8, colsample_bytree=0.8,
                                               reg_alpha=90, reg_lambda=0.6, nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'n_estimators': 2000} 0.9066264768539061

提交

image.png

得RMSE：14.852

3.由于第一次調(diào)gamma的值的時候充边，設(shè)置的范圍太小，而gamma值的范圍為：0到正無窮常侦，所以考慮將gamma值的范圍設(shè)置得大一點

param_test1 = {
    'gamma':[0,1,2,3,4,5,6,7,8,9]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.01, n_estimators=2000, max_depth=6,
                                               min_child_weight=5, subsample=0.8, colsample_bytree=0.8,
                                               reg_alpha=90, reg_lambda=0.6, nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：{'gamma': 5} 0.9066822275408573
縮小gamma的范圍浇冰，繼續(xù)微調(diào)gamma

param_test1 = {
    'gamma':[i / 10.0 for i in range(41, 60)]
}
gsearch1 = GridSearchCV(estimator=XGBRegressor(learning_rate=0.01, n_estimators=2000, max_depth=6,
                                               min_child_weight=5, subsample=0.8, colsample_bytree=0.8,
                                               reg_alpha=90, reg_lambda=0.6, nthread=4, scale_pos_weight=1, seed=27),
                        param_grid=param_test1, iid=False, cv=5)
gsearch1.fit(train, y_train)
print(gsearch1.best_params_, gsearch1.best_score_)

運行結(jié)果：
{'gamma': 5.0} 0.9066822275408573

提交

image.png

得：RMSE：14.853，可以看到RMSE的值變化很小很小聋亡，gamma的值對模型的影響可能不是很大肘习，但也有可能是因為把全部參數(shù)都調(diào)好后再來調(diào)gamma值的原因
其實我有一個疑問就是調(diào)參一定要按一定的順序調(diào)嗎，可以別的參數(shù)都調(diào)完再返回去調(diào)gamma的值嗎杀捻，所以下面我試了按照前面的順序從頭開始調(diào)井厌，但是在調(diào)gamma的值的時候蚓庭，把范圍擴(kuò)大了一點再微調(diào)，看看結(jié)果會不會不同仅仆，結(jié)果如下:

提交

image.png

得RMSE:14.93
提交

image.png

得RMSE:14.852
可以看出14.852和前面的14.853相差不大器赞，可能調(diào)gamma值的順序?qū)φ`差值的影響不大

六、結(jié)果匯總

image.png

目前調(diào)到最低的RMSE的值是14.852墓拜，比原來降低了4.095港柜，排名50。以上文章有錯誤之處咳榜，歡迎指出

在調(diào)參的時候我有一個問題夏醉，因為是一個一個或者一組一組地調(diào)參，尋找最優(yōu)的參數(shù)涌韩，那怎么能夠保證這些調(diào)得最優(yōu)參數(shù)的值組合起來也是最優(yōu)的呢? 然后在網(wǎng)上看到一個挺好的回答：
→這樣雖然不是最優(yōu)畔柔，但可以達(dá)到局部最優(yōu)，節(jié)約了很多時間臣樱。前提是對參數(shù)的分組要做好靶擦，盡可能使組間參數(shù)相互獨立。或者換個角度想雇毫，我們是不可能做到遍歷所有的參數(shù)組合的玄捕，我們要做的是在有限的時間內(nèi)找到盡可能最優(yōu)的參數(shù)組合。在同樣的時間內(nèi)棚放，分組調(diào)參的參數(shù)范圍可以比單個組合大得多枚粘。效果未必會比單個組合差。
上面的回答來自于XGBoost參數(shù)調(diào)優(yōu)完全指南（附Python代碼）的評論區(qū)

除了上面提及的鏈接飘蚯，本文還參考了：
1.數(shù)據(jù)競賽實戰(zhàn)（3）——公共自行車使用量預(yù)測
2.sofasofa競賽：一公共自行車使用量預(yù)測
3.共享單車需求預(yù)測問題：分析篇
4 共享單車需求預(yù)測問題：建模篇
5 XGBoost參數(shù)調(diào)優(yōu)完全指南（附Python代碼）