接下來按照原計(jì)劃,開始由易到難的打卡kaggle 機(jī)器學(xué)習(xí)項(xiàng)目
# coding: utf-8
# # 項(xiàng)目的背景
# 科技已經(jīng)改變交通
# 提供了一個(gè)城市一年20130701-20140630 的運(yùn)行軌跡的數(shù)據(jù)
#
# trip_id: 每一個(gè)性成的標(biāo)識(shí)符號(hào)
# call_type: A 預(yù)定taxi的時(shí)候胰锌,電話打到 taxi 管理中心秃症,B電話接聽指定地點(diǎn)范圍內(nèi)的司機(jī) C電話連接到隨機(jī)任意一個(gè)線路上
# origin call:當(dāng)call_type=A的時(shí)候抬纸,它標(biāo)識(shí)旅行的客戶先口,否則為空
# origin stand: call_type='B',標(biāo)識(shí)的是旅行的起點(diǎn)
# taxi_id:出租車司機(jī)的唯一標(biāo)識(shí)符號(hào)
# daytype:三種類型壮锻,B表示節(jié)假日樟澜,C表示節(jié)假日的前一天误窖,A則表示普通日,或者工作日
# missing_data:False 表示GPS經(jīng)緯度坐標(biāo)存在秩贰,反之?dāng)?shù)據(jù)丟失
# Polyline: 每15s行程的一對(duì)經(jīng)緯度坐標(biāo)
#
# 本次比賽的總行程時(shí)間(本次比賽的預(yù)測目標(biāo))定義為(點(diǎn)數(shù)-1)×15秒霹俺。例如,在POLYLINE中包含101個(gè)數(shù)據(jù)點(diǎn)的行程長度為(101-1)* 15 = 1500秒毒费。有些trip在POLYLINE中缺少數(shù)據(jù)點(diǎn)(由MISSING_DATA列表示)丙唧,如何利用這些知識(shí)是一個(gè)挑戰(zhàn)。
#
# 這是一個(gè)回歸問題觅玻,我們需要預(yù)測總共的時(shí)間想际,從出發(fā)到到達(dá)目的地
#
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from numpy import sqrt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as mse
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
# In[2]:
df=pd.read_csv(r'C:\\Users\\Lenovo\\Desktop\\hands-on-machine-learning\\ml of finding job\\train.csv')
df.head(10)
df.isnull().sum(axis=0)##origin call 和stand 的數(shù)據(jù)丟失比較嚴(yán)重,一般并不在車站或者通過打taxi 中心的電話約車
# In[16]:
df.info()#可以看到每一列的數(shù)據(jù)類型
# In[19]:
##當(dāng)數(shù)據(jù)類型為numeric 的時(shí)候溪厘,那么就會(huì)返回count,mean,std,min,胡本,1/4分位,1/2畸悬,3/4等侧甫,
"""
include
來限定數(shù)據(jù)類型
object---categories,就是'a','b','c'等
"""
df.describe(include=['object'])#出現(xiàn)的不同值的數(shù)量,只出現(xiàn)一次的值的個(gè)數(shù)蹋宦,出現(xiàn)次數(shù)最高的是披粟,出現(xiàn)最高的值出現(xiàn)的次數(shù)
"""
結(jié)果顯示,關(guān)于出行時(shí)間
數(shù)據(jù)顯示都是在工作日
而且5901 沒法算其出行的時(shí)間
"""
# In[3]:
##將時(shí)間轉(zhuǎn)變?yōu)槟暝氯盏男问?學(xué)習(xí)fromtimesstamp的用法
df.sort_values(by='TIMESTAMP',axis=0,inplace=True)
df['year'] = df['TIMESTAMP'].apply(lambda x :datetime.datetime.fromtimestamp(x).year)
df['month'] = df['TIMESTAMP'].apply(lambda x :datetime.datetime.fromtimestamp(x).month)
df['month_day'] = df['TIMESTAMP'].apply(lambda x :datetime.datetime.fromtimestamp(x).day)
df['hour'] = df['TIMESTAMP'].apply(lambda x :datetime.datetime.fromtimestamp(x).hour)
df['week_day'] = df['TIMESTAMP'].apply(lambda x :datetime.datetime.fromtimestamp(x).weekday())
df.head(10)
type(df["year"].value_counts())# series類型
df['year'].value_counts().keys()#注意兩者連用
# In[36]:
##餅形圖
plt.figure(figsize=(10,10))
plt.pie(df["year"].value_counts(),labels=df["year"].value_counts().keys(),autopct='%.1f%%')
plt.show()#數(shù)據(jù)每年各占一半
# In[40]:
plt.figure(figsize=(5,5))
plt.title('count of trips of per day of week')
sns.countplot(x="week_day",data=df)#第一個(gè)參數(shù)為x,或者y,表示顯示在x軸上還是y軸上冷冗,是數(shù)據(jù)的列名守屉,data 為傳入的數(shù)據(jù)
#一般為dataFrame 類型
plt.xlabel("the day of week")
plt.ylabel("counts")
plt.show()
# In[41]:
plt.figure(figsize = (10,10))
plt.title('Count of trips per month')
sns.countplot(y = 'month', data = df)
plt.xlabel('Count')
plt.ylabel('Month')
##On an average we can say that every month has atleast 120000 taxi trips planned.
# In[42]:
plt.figure(figsize = (10,10))
plt.title('Count of trips per hour')
sns.countplot(x = 'hour', data = df)
plt.xlabel('Count')
plt.ylabel('Hours')
# In[4]:
##來看一下丟失的數(shù)據(jù)有多少
df["MISSING_DATA"].value_counts()
"""
10個(gè)數(shù)據(jù)丟失
"""
#將這些所在的行去掉
"""
關(guān)于 drop 的參數(shù)的使用
labels : single label or list-like
Index or column labels to drop.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index : single label or list-like
Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).#去除一些行,肯定用index 索引蒿辙,
或者label指定索引
New in version 0.21.0.
columns : single label or list-like
Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
#去除一些列(一行中的某些字段)胸梆,肯定用列名 索引敦捧,
或者label指定列名
"""
df.drop(df[df["MISSING_DATA"]==True].index,axis=0,inplace=True)
df["MISSING_DATA"].unique()
df.drop(df[df["POLYLINE"]=='[]'].index,axis=0,inplace=True)
df["POLYLINE"].value_counts()
# In[8]:
##下面將形成換算成時(shí)間
df["polyline length"]=df["POLYLINE"].apply(lambda x :len(eval(x))-1)##一共列表中的有幾組數(shù)據(jù)
df["trip_time"]=df["polyline length"].apply(lambda x:x*15)
df.head(10)
# In[5]:
#one hot encoidng for call type
df = pd.get_dummies(df, columns=['CALL_TYPE'])#先轉(zhuǎn)化,然后在和原來的表進(jìn)行連接,并不會(huì)影響行數(shù)碰镜,其實(shí)就是將一列的值轉(zhuǎn)化為多列
"""
Desktop/到的結(jié)果是增加了3列
CALL_TYPE_A,B,C
CALL_TYPE_A ,對(duì)應(yīng)的三列值為【0兢卵,0,1】绪颖,B為【0秽荤,1,0】柠横,C為【1窃款,0,0】
"""
df.head(10)### 可以將非向量形式的object 轉(zhuǎn)化為向量形式便于研究
# In[7]:
df
"""
存在3個(gè)參數(shù)
subset 指定要去重的列名牍氛,列表的形式傳入晨继,比如['A','B'],就是表示 A,B兩類列重復(fù)的去掉,默認(rèn)為全部列
inplace 是否在原來的基礎(chǔ)上修改
keep first,last False 三個(gè)值搬俊,表示留下重復(fù)列的第一個(gè)紊扬,最后一個(gè),全部刪除
"""
df.head(10)
# In[6]:
####構(gòu)建機(jī)器學(xué)習(xí)模型
x=df[['polyline length','CALL_TYPE_A','CALL_TYPE_B','CALL_TYPE_C']]
y=df['trip_time']
# data standization
s=StandardScaler()
x=s.fit_transform(x)##采用的是z-score的方法唉擂,每一個(gè)參數(shù)減去其平均值 除以相應(yīng)的標(biāo)準(zhǔn)差
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
print("The size of training input is", x_train.shape)
print("The size of training output is", y_train.shape)
print(50 *'*')
print("The size of testing input is", x_test.shape)
print("The size of testing output is", y_test.shape)
x_train
# In[31]:
#下面提供了集中預(yù)測模型
"""
第一種利用均值進(jìn)行預(yù)測
r2_score, 1-MSE/(實(shí)際值-平均值)^2 x 1/(樣本個(gè)數(shù)),值越接近于1餐屎,說明模型越好
"""
y_train_pred=np.ones(x_train.shape[0])*y_train.mean() #以平均值作為預(yù)測值,訓(xùn)練集的預(yù)測
y_test_pred=np.ones(x_train.shape[0])*y_test.mean() #以平均值作為預(yù)測值,測試集的預(yù)測
print("Train Results for Baseline Model:")
print(50 * '-')
print("root mean squared error as rmse", sqrt(mse(y_train.values,y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))
# In[1]:
##knn 算法
"""
存在兩種分類請(qǐng)器玩祟,一種為 kneighborclassifier 原理如下
1)計(jì)算測試數(shù)據(jù)與各個(gè)訓(xùn)練數(shù)據(jù)之間的距離腹缩;
2)按照距離的遞增關(guān)系進(jìn)行排序;
3)選取距離最小的K個(gè)點(diǎn)空扎;(k的值藏鹊,采用GridSearchCV的交叉驗(yàn)證)
4)確定前K個(gè)點(diǎn)所在類別的出現(xiàn)頻率;
5)返回前K個(gè)點(diǎn)中出現(xiàn)頻率最高的類別作為測試數(shù)據(jù)的預(yù)測分類
另外 一種為 kneighborregressor 主要思想:選取樣本的K個(gè)近鄰樣本转锈,用近鄰樣本的響應(yīng)值(y)的均值作為該樣本的預(yù)測值
同樣的 也有 DecisionTreeRegressor, SVM 中的SVR
"""
"""
GridSearchCV
可以一次性的限定預(yù)測模型以及打分標(biāo)準(zhǔn)(score 參數(shù))
GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid=’warn’,
refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’,
error_score=’raise-deprecating’, return_train_score=False)
其中 param_grid 可以傳入一個(gè)列表或者字典
字典的話以參數(shù)為鍵名稱
KNeighborsRegressor(n_neighbors=5, weights=’uniform’, algorithm=’auto’,
leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)
"""
k_range=list(range(1,30))
param=dict(n_neighbors=k_range)#以KNeighborsRegressor 的參數(shù) n_neighbors 為鍵名
knn_regressor=GridSearchCV(KNeighborsRegressor(),param,cv=10)
knn_regressor.fit(x_train,y_train)
print (knn_regressor.best_estimator_)
print(knn_regressor.best_params_)
# ## 交叉驗(yàn)證
#
# <span class="mark">第一種是簡單交叉驗(yàn)證伙判,所謂的簡單,是和其他交叉驗(yàn)證方法相對(duì)而言的黑忱。首先宴抚,我們隨機(jī)的將樣本數(shù)據(jù)分為兩部分(比如: 70%的訓(xùn)練集,30%的測試集)甫煞,然后用訓(xùn)練集來訓(xùn)練模型菇曲,在測試集上驗(yàn)證模型及參數(shù)。接著抚吠,我們?cè)侔褬颖敬騺y常潮,重新選擇訓(xùn)練集和測試集,繼續(xù)訓(xùn)練數(shù)據(jù)和檢驗(yàn)?zāi)P涂ΑW詈笪覀冞x擇損失函數(shù)評(píng)估最優(yōu)的模型和參數(shù)喊式。</span>
#
# <span class="burk">第二種是S折交叉驗(yàn)證(S-Folder Cross Validation)孵户。和第一種方法不同,S折交叉驗(yàn)證會(huì)把樣本數(shù)據(jù)隨機(jī)的分成S份岔留,每次隨機(jī)的選擇S-1份作為訓(xùn)練集夏哭,剩下的1份做測試集。當(dāng)這一輪完成后献联,重新隨機(jī)選擇S-1份來訓(xùn)練數(shù)據(jù)竖配。若干輪(小于S)之后,選擇損失函數(shù)評(píng)估最優(yōu)的模型和參數(shù)里逆。
#
# 第三種是留一交叉驗(yàn)證(Leave-one-out Cross Validation)进胯,它是第二種情況的特例,此時(shí)S等于樣本數(shù)N原押,這樣對(duì)于N個(gè)樣本胁镐,每次選擇N-1個(gè)樣本來訓(xùn)練數(shù)據(jù),留一個(gè)樣本來驗(yàn)證模型預(yù)測的好壞诸衔。此方法主要用于樣本量非常少的情況盯漂,比如對(duì)于普通適中問題,N小于50時(shí)署隘,我一般采用留一交叉驗(yàn)證宠能。</span>
# In[ ]:
params ={'alpha' :[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
ridge_regressor =GridSearchCV(Ridge(), params ,cv =5,scoring = 'neg_mean_absolute_error', n_jobs =-1)
ridge_regressor.fit(X_train ,y_train)
y_train_pred =ridge_regressor.predict(X_train) ##Predict train result
y_test_pred =ridge_regressor.predict(X_test) ##Predict test result
print("Train Results for Ridge Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))
# In[ ]:
##Lasso Regression
params ={'alpha' :[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
lasso_regressor =GridSearchCV(Lasso(), params ,cv =15,scoring = 'neg_mean_absolute_error', n_jobs =-1)
lasso_regressor.fit(X_train ,y_train)
y_train_pred =lasso_regressor.predict(X_train) ##Predict train result
y_test_pred =lasso_regressor.predict(X_test) ##Predict test result
print("Train Results for Lasso Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))
# In[ ]:
#Decision Tree Regression
depth =list(range(3,30))
param_grid =dict(max_depth =depth)
tree =GridSearchCV(DecisionTreeRegressor(),param_grid,cv =10)
tree.fit(X_train,y_train)
y_train_pred =tree.predict(X_train) ##Predict train result
y_test_pred =tree.predict(X_test) ##Predict test result
print("Train Results for Decision Tree Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))
# In[ ]:
tuned_params = {'max_depth': [1, 2, 3, 4, 5], 'learning_rate': [0.01, 0.05, 0.1], 'n_estimators': [100, 200, 300, 400, 500], 'reg_lambda': [0.001, 0.1, 1.0, 10.0, 100.0]}
model = RandomizedSearchCV(XGBRegressor(), tuned_params, n_iter=20, scoring = 'neg_mean_absolute_error', cv=5, n_jobs=-1)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("Train Results for XGBoost Regressor Model:")
print(50 * '-')
print("Root mean squared error: ", sqrt(mse(y_train.values, y_train_pred)))
print("R-squared: ", r2_score(y_train.values, y_train_pred))
數(shù)據(jù)下載及原項(xiàng)目地址:https://www.kaggle.com/akshaychavan123/taxi-trip-time-prediction
感興趣的朋友可以動(dòng)手去實(shí)現(xiàn)一下亚隙,這個(gè)項(xiàng)目也是入門級(jí)別的實(shí)操項(xiàng)目