Kaggle|Exercise8|Pipelines

管道最方便的地方就是pipeline 實現(xiàn)了對全部步驟的流式化封裝和管理(streaming workflows with pipelines)杏死,可以很方便地使參數(shù)集在新數(shù)據(jù)集(比如測試集)上被重復使用变骡。
參見https://zhuanlan.zhihu.com/p/42368821
In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")

You will work with data from the Housing Prices Competition for Kaggle Learn Users.

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.數(shù)據(jù)預處理和建模

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])  

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model 
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

#輸出為
MAE: 17861.780102739725

The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.

Step 1: Improve the performance

Part A

Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

  • numerical_transformer
  • categorical_transformer
  • model

To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.

# Preprocessing for numerical data 數(shù)值型數(shù)據(jù)的預處理 不就是填補缺失值嗎听隐?什么意思?伤为?
numerical_transformer = SimpleImputer(strategy='median') # Your code here 采用中值填補數(shù)值型變量

# Preprocessing for categorical data 分類型數(shù)據(jù)的預處理庄新,分類數(shù)據(jù)處理有兩部分:填補和編碼广恢,可以用管道捆綁
categorical_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]) # Your code here 加了parse=False

# Bundle preprocessing for numerical and categorical data 用ColumnTransformer捆綁數(shù)值型和分類型數(shù)據(jù)的預處理 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0) # Your code here

# Check your answer
step_1.a.check()

TO BE CONTINUED====================

Part B

Run the code cell below without changes.

To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Check your answer
step_1.b.check()

Step 2: Generate test predictions

Now, you'll use your trained model to generate predictions with the test data.

# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test) # Your code here 管道最方便的地方就是可以直接對測試集進行和訓練集等一樣的操作而不需要重復代碼。

# Check your answer
step_2.check()

Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末馋没,一起剝皮案震驚了整個濱河市昔逗,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌篷朵,老刑警劉巖勾怒,帶你破解...
    沈念sama閱讀 212,454評論 6 493
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異声旺,居然都是意外死亡笔链,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,553評論 3 385
  • 文/潘曉璐 我一進店門腮猖,熙熙樓的掌柜王于貴愁眉苦臉地迎上來鉴扫,“玉大人,你說我怎么就攤上這事澈缺∑捍矗” “怎么了?”我有些...
    開封第一講書人閱讀 157,921評論 0 348
  • 文/不壞的土叔 我叫張陵姐赡,是天一觀的道長莱预。 經(jīng)常有香客問我,道長项滑,這世上最難降的妖魔是什么依沮? 我笑而不...
    開封第一講書人閱讀 56,648評論 1 284
  • 正文 為了忘掉前任,我火速辦了婚禮枪狂,結(jié)果婚禮上危喉,老公的妹妹穿的比我還像新娘。我一直安慰自己摘完,他們只是感情好姥饰,可當我...
    茶點故事閱讀 65,770評論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著孝治,像睡著了一般列粪。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上谈飒,一...
    開封第一講書人閱讀 49,950評論 1 291
  • 那天岂座,我揣著相機與錄音,去河邊找鬼杭措。 笑死费什,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的手素。 我是一名探鬼主播鸳址,決...
    沈念sama閱讀 39,090評論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼瘩蚪,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了稿黍?” 一聲冷哼從身側(cè)響起疹瘦,我...
    開封第一講書人閱讀 37,817評論 0 268
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎巡球,沒想到半個月后言沐,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,275評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡酣栈,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,592評論 2 327
  • 正文 我和宋清朗相戀三年险胰,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片矿筝。...
    茶點故事閱讀 38,724評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡起便,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出窖维,到底是詐尸還是另有隱情缨睡,我是刑警寧澤,帶...
    沈念sama閱讀 34,409評論 4 333
  • 正文 年R本政府宣布陈辱,位于F島的核電站奖年,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏沛贪。R本人自食惡果不足惜陋守,卻給世界環(huán)境...
    茶點故事閱讀 40,052評論 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望利赋。 院中可真熱鬧水评,春花似錦、人聲如沸媚送。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,815評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽塘偎。三九已至疗涉,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間吟秩,已是汗流浹背咱扣。 一陣腳步聲響...
    開封第一講書人閱讀 32,043評論 1 266
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留涵防,地道東北人闹伪。 一個月前我還...
    沈念sama閱讀 46,503評論 2 361
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親偏瓤。 傳聞我的和親對象是個殘疾皇子杀怠,可洞房花燭夜當晚...
    茶點故事閱讀 43,627評論 2 350

推薦閱讀更多精彩內(nèi)容