08六大回歸模型預(yù)測(cè)航班票價(jià)

導(dǎo)入庫(kù)

import pandas as pd
import numpy as np
pd.set_option("display.max_columns",33)

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import accuracy_score,confusion_matrix

import warnings
warnings.filterwarnings("ignore")

數(shù)據(jù)基本信息

df = pd.read_excel("Data_Train.xlsx")
df.head()

image.png

df.shape

image.png

df.isnull().sum()

image.png

df.dtypes

image.png

columns = df.columns.tolist()
columns

image.png

具體字段的中文含義：
Airline：不同類型的航空公司
Date_of_Journey：旅客的旅行開始日期
Source：旅客出發(fā)地
Destination：旅客目的地
Route：航班路線
Dep_Time：出發(fā)時(shí)間
Arrival_Time：抵達(dá)時(shí)間
Duration：持續(xù)時(shí)間亲怠；指的是航班完成從出發(fā)到目的地的旅程的整個(gè)時(shí)間
Total_Stops：總共停留地
Additional_Info：其他信息欢搜，比如：食物、設(shè)備信息等
Price：整個(gè)旅程的航班票價(jià)

df.info()

image.png

df.describe()

image.png

缺失值處理

import missingno as mso

mso.bar(df,color="blue")

plt.show()

image.png

# 缺失值刪除
df.dropna(inplace=True)

df.isnull().sum()

image.png

時(shí)間相關(guān)字段處理

# 時(shí)間處理
# 通過(guò)pd.to_datetime()直接將字符型的數(shù)據(jù)轉(zhuǎn)成時(shí)間類型的數(shù)據(jù)
# 通過(guò)dt.day或者df.month 直接獲取天或者月的信息
def change_to_datetime(col):
    df[col] = pd.to_datetime(df[col])

for col in ["Date_of_Journey","Dep_Time","Arrival_Time"]:
    change_to_datetime(col)

df.dtypes

image.png

# 提取天和月
df["day"] = df["Date_of_Journey"].dt.day
df["month"] = df["Date_of_Journey"].dt.month
df.head()

image.png

df.drop("Date_of_Journey",axis=1,inplace=True)

# 起飛時(shí)間和抵達(dá)時(shí)間處理
def extract_hour(data,col):
    data[col+ "_hour"] = data[col].dt.hour
    
def extract_minute(data,col):
    data[col+ "_minute"] = data[col].dt.minute
    
def drop_col(data,col):
    data.drop(col,axis=1,inplace=True)

extract_hour(df,"Dep_Time")
extract_minute(df,"Dep_Time")
drop_col(df,"Dep_Time")

extract_hour(df,"Arrival_Time")
extract_minute(df,"Arrival_Time")
drop_col(df,"Arrival_Time")

df.head()

image.png

# 航班持續(xù)時(shí)間
# 1蜻韭、將持續(xù)時(shí)間規(guī)范化處理，統(tǒng)一變成0h 1m
# duration = list(df["Duration"])

# for i in range(len(duration)):
#     if len(duration[i].split(' ')) == 2:
#         pass
#     else:
#         if 'h' in duration[i]:
#             duration[i] = duration[i] + ' 0m'
#         else:
#             duration[i] = '0h ' + duration[i]

def change_duration(x):
    if "h" in x and "m" in x:
        return x
    else:
        if "h" in x:
            return x + " 0m"
        else:
            return "0h " + x
        
df["Duration"] = df["Duration"].apply(change_duration)
df.head()

image.png

# 2滓鸠、從Duration字段中提取小時(shí)和分鐘
df1 = df["Duration"].str.extract(r'(?P<dur_hour>\d+)h (?P<dur_minute>\d+)m')
df1.head()

image.png

df = df.join(df1)
df.head()

image.png

df.drop("Duration",inplace=True,axis=1)

# 3、字段類型轉(zhuǎn)化：查看dur_hour和dur_minute的字段類型變化
df.dtypes

image.png

df["dur_hour"] = df["dur_hour"].astype(int)
df["dur_minute"] = df["dur_minute"].astype(int)

df.dtypes

image.png

字段編碼

# 1、針對(duì)字符型的字段
column = [column for column in df.columns if df[column].dtype == "object"]
column

image.png

# 2诀黍、數(shù)值型（連續(xù)型）字段
continuous_col = [column for column in df.columns if df[column].dtype != "object"]
continuous_col

image.png

2種編碼技術(shù)
標(biāo)稱數(shù)據(jù)：沒有任何順序，使用獨(dú)熱編碼oneot encoding
有序數(shù)據(jù)：存在一定的順序仗处，使用類型編碼labelEncoder

# 生成標(biāo)稱型字段組成的數(shù)據(jù)
categorical = df[column]
categorical.head()

image.png

不同字段編碼處理

# 航空公司-Airline
# 1蔗草、不同航空公司的數(shù)量統(tǒng)計(jì)：
airline = categorical["Airline"].value_counts().reset_index()
airline

image.png

# 2、查看航空公司與價(jià)格關(guān)系
plt.figure(figsize=(15,8))

sns.boxplot(x="Airline",y="Price",data=df.sort_values("Price",ascending=False))

plt.show()

image.png

Jet Airways Business公司的機(jī)票價(jià)格是最高的
其他公司的價(jià)格中位數(shù)是比較接近的

# 3疆柔、實(shí)現(xiàn)獨(dú)熱編碼
Airline = pd.get_dummies(categorical["Airline"],drop_first=True)
Airline.head()

image.png

# 停留地-Total_Stops
# 1咒精、和價(jià)格的關(guān)系
plt.figure(figsize=(15,8))

sns.boxplot(x="Total_Stops",y="Price",data=df.sort_values("Price",ascending=False))

plt.show()

image.png

# 2、實(shí)施硬編碼旷档；區(qū)別于航空公司的獨(dú)熱編碼
dict_stops = {"non-stop":0, "1 stop":1, "2 stops":2, "3 stops":3, "4 stops":4}
categorical["Total_Stops"] = categorical["Total_Stops"].map(dict_stops)
categorical.head()

image.png

# 出發(fā)地source
# 出發(fā)地和價(jià)格的關(guān)系：
plt.figure(figsize=(18,12))

sns.catplot(x="Source",y="Price",data=df.sort_values("Price",ascending=False),kind="boxen")

plt.show()

image.png

# 獨(dú)熱編碼的過(guò)程：
source = pd.get_dummies(categorical["Source"],drop_first=True)
source.head()

image.png

# 目的地-destination
# 目的地和價(jià)格的關(guān)系
plt.figure(figsize=(18, 12))

sns.boxplot(x="Destination",
           y="Price",
           data=df.sort_values("Price", ascending=False))

plt.show()

image.png

# 獨(dú)熱編碼的實(shí)現(xiàn)
destination = pd.get_dummies(categorical["Destination"], drop_first=True)
destination.head()

image.png

# 路線Route
# 1.不同路線的數(shù)量統(tǒng)計(jì)
categorical["Route"].value_counts()

image.png

# 2.路線名稱提取
# 從上面結(jié)果看出來(lái)最長(zhǎng)的路線中有5個(gè)地名模叙，我們一次性提取
# 沒有出現(xiàn)的數(shù)據(jù)則用NaN來(lái)表示:
categorical["Route1"] = categorical["Route"].str.split("→").str[0]
categorical["Route2"] = categorical["Route"].str.split("→").str[1]
categorical["Route3"] = categorical["Route"].str.split("→").str[2]
categorical["Route4"] = categorical["Route"].str.split("→").str[3]
categorical["Route5"] = categorical["Route"].str.split("→").str[4]
categorical.head()

image.png

# 3.缺失值字段
categorical.drop("Route", axis=1, inplace=True)
categorical.isnull().sum()

image.png

for i in ["Route3", "Route4", "Route5"]:
    categorical[i].fillna("None", inplace=True)

# 4.類型編碼LabelEncoder
from sklearn import preprocessing

le =preprocessing.LabelEncoder()
for i in ["Route1", "Route2", "Route3", "Route4", "Route5"]:
    categorical[i] = le.fit_transform(categorical[i])
    
categorical.head()

image.png

# 抵達(dá)時(shí)間/小時(shí)-Arrival_Time_hour
# 抵達(dá)目的地時(shí)間和價(jià)格的關(guān)系
df.plot.hexbin(x="Arrival_Time_hour", y="Price", gridsize=15)

plt.show()

image.png

建模數(shù)據(jù)

# 刪除無(wú)效字段
# 生成的全部字段信息
categorical.columns

image.png

# 將原始的無(wú)效字段直接刪除
drop_col(categorical, "Airline")
drop_col(categorical, "Source")
drop_col(categorical, "Destination")
drop_col(categorical, "Additional_Info")

# 最終數(shù)據(jù)
final_df = pd.concat([categorical, Airline, source, destination, df[continuous_col]], axis=1)
final_df.head()

image.png

# 離群點(diǎn)檢測(cè)
# 對(duì)上面生成的最終數(shù)據(jù)進(jìn)行離群點(diǎn)檢測(cè)
def plot(data, col):
    fig, (ax1, ax2) = plt.subplots(2, 1)
    sns.distplot(data[col], ax=ax1)
    sns.boxplot(data[col], ax=ax2)
    
plot(final_df, "Price")

image.png

# 對(duì)離群點(diǎn)填充均值，查看填充后的效果
final_df["Price"] = np.where(final_df["Price"]>=40000,
                            final_df["Price"].median(),
                            final_df["Price"])
plot(final_df, "Price")

image.png

# 數(shù)據(jù)切分
X = final_df.drop("Price", axis=1)
y = final_df["Price"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

特征選擇

from sklearn.feature_selection import mutual_info_classif

imp = pd.DataFrame(mutual_info_classif(X ,y), index=X.columns)
imp.columns = ["importance"]
imp.sort_values(by="importance", ascending=False)

image.png

評(píng)價(jià)指標(biāo)

# r2_score(重點(diǎn)關(guān)注), mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def predict(ml_model):
    print("Model is: ", ml_model)
    model = ml_model.fit(X_train, y_train)
    print("Training score: ", model.score(X_train, y_train))
    
    predictions = model.predict(X_test)
    print("Predictions: ", predictions)
    print("----------")
    
    r2score = r2_score(y_test, predictions)
    print("r2 score is: ", r2score)
    print("MAE:{}", mean_absolute_error(y_test, predictions))
    print("MSE:{}", mean_squared_error(y_test, predictions))
    print("RMSE:{}", np.sqrt(mean_squared_error(y_test, predictions)))
    
    sns.distplot(y_test - predictions)

建模

# 導(dǎo)入多種模型
# 邏輯回歸
from sklearn.linear_model import LogisticRegression
# K近鄰回歸
from sklearn.neighbors import KNeighborsRegressor
# 決策樹回歸
from sklearn.tree import DecisionTreeRegressor
# 支持向量機(jī)回歸
from sklearn.svm import SVR
# 梯度提升回歸鞋屈，隨機(jī)森林回歸
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# 隨機(jī)森林回歸樹
predict(RandomForestRegressor())

image.png

# 邏輯回歸
predict(LogisticRegression())

image.png

# K近鄰回歸
predict(KNeighborsRegressor())

image.png

# 決策樹回歸
predict(DecisionTreeRegressor())

image.png

# 支持向量機(jī)回歸
predict(SVR())

image.png

# 梯度提升回歸
predict(GradientBoostingRegressor())

image.png

模型調(diào)優(yōu)

# 調(diào)優(yōu)尋參
# 采用隨機(jī)搜索調(diào)優(yōu)
from sklearn.model_selection import RandomizedSearchCV

random_grid = {
    "n_estimators":[100, 120, 150, 180, 200, 220],
    "max_features":["auto", "sqrt"],
    "max_depth":[5, 10, 15, 20]
}

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, cv=3, verbose=2, n_jobs=-1)
rf_random.fit(X_train, y_train)

image.png

rf_random.best_params_

image.png

# 調(diào)優(yōu)后結(jié)果
prediction = rf_random.predict(X_test)
sns.distplot(y_test - prediction)

image.png

r2_score(y_test, prediction)

image.png

兩種常見求解r2方式

# 利用python間接求解
from sklearn.metrics import mean_squared_error

y_test = [1, 2, 3]
y_pred = [1.3, 2.1, 3.5]
1 - mean_squared_error(y_test, y_pred)/np.var(y_test)

image.png

# sklearn直接求解
from sklearn.metrics import r2_score

y_test = [1, 2, 3]
y_pred = [1.3, 2.1, 3.5]
r2_score(y_test, y_pred)

image.png

來(lái)源：尤而小屋

最后編輯于：2023.04.02 21:20:36

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末范咨，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子厂庇，更是在濱河造成了極大的恐慌渠啊，老刑警劉巖，帶你破解...
沈念sama閱讀 206,839評(píng)論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件权旷，死亡現(xiàn)場(chǎng)離奇詭異替蛉，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)拄氯，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,543評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門躲查，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人译柏，你說(shuō)我怎么就攤上這事镣煮。” “怎么了鄙麦？”我有些...
開封第一講書人閱讀 153,116評(píng)論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵典唇，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我胯府，道長(zhǎng)介衔，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,371評(píng)論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任盟劫，我火速辦了婚禮夜牡，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己塘装，他們只是感情好急迂，可當(dāng)我...
茶點(diǎn)故事閱讀 64,384評(píng)論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布。她就那樣靜靜地躺著蹦肴，像睡著了一般僚碎。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上阴幌，一...
開封第一講書人閱讀 49,111評(píng)論 1贊 285
城市分裂傳說(shuō)
那天勺阐，我揣著相機(jī)與錄音，去河邊找鬼矛双。笑死渊抽，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的议忽。我是一名探鬼主播懒闷，決...
沈念sama閱讀 38,416評(píng)論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼栈幸！你這毒婦竟也來(lái)了愤估？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,053評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤速址，失蹤者是張志新（化名）和其女友劉穎玩焰，沒想到半個(gè)月后牢裳，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體篷店，經(jīng)...
沈念sama閱讀 43,558評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,007評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年酝锅，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了闹炉。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蒿赢。...
茶點(diǎn)故事閱讀 38,117評(píng)論 1贊 334
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖渣触，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情壹若，我是刑警寧澤嗅钻，帶...
沈念sama閱讀 33,756評(píng)論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站店展，受9級(jí)特大地震影響养篓，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜赂蕴，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,324評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一柳弄、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦碧注、人聲如沸嚣伐。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,315評(píng)論 0贊 19
一樁弒父案萍丐，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)轩端。三九已至，卻和暖如春逝变，著一層夾襖步出監(jiān)牢的瞬間基茵，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,539評(píng)論 1贊 262
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工壳影，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留拱层，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,578評(píng)論 2贊 355
代替公主和親
正文我出身青樓宴咧，卻偏偏與公主長(zhǎng)得像舱呻，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子悠汽，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,877評(píng)論 2贊 345

08六大回歸模型預(yù)測(cè)航班票價(jià)

導(dǎo)入庫(kù)

數(shù)據(jù)基本信息

缺失值處理

時(shí)間相關(guān)字段處理

字段編碼

不同字段編碼處理

建模數(shù)據(jù)

特征選擇

評(píng)價(jià)指標(biāo)

建模

模型調(diào)優(yōu)

兩種常見求解r2方式

推薦閱讀更多精彩內(nèi)容