管道最方便的地方就是pipeline 實現(xiàn)了對全部步驟的流式化封裝和管理(streaming workflows with pipelines)杏死,可以很方便地使參數(shù)集在新數(shù)據(jù)集(比如測試集)上被重復使用变骡。
參見https://zhuanlan.zhihu.com/p/42368821
In this exercise, you will use pipelines to improve the efficiency of your machine learning code.
Setup
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")
You will work with data from the Housing Prices Competition for Kaggle Learn Users.
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y,
train_size=0.8, test_size=0.2,
random_state=0)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.數(shù)據(jù)預處理和建模
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
clf.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
print('MAE:', mean_absolute_error(y_valid, preds))
#輸出為
MAE: 17861.780102739725
The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.
Step 1: Improve the performance
Part A
Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:
numerical_transformer
categorical_transformer
model
To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.
# Preprocessing for numerical data 數(shù)值型數(shù)據(jù)的預處理 不就是填補缺失值嗎听隐?什么意思?伤为?
numerical_transformer = SimpleImputer(strategy='median') # Your code here 采用中值填補數(shù)值型變量
# Preprocessing for categorical data 分類型數(shù)據(jù)的預處理庄新,分類數(shù)據(jù)處理有兩部分:填補和編碼广恢,可以用管道捆綁
categorical_transformer = Pipeline(steps=[
('imputer',SimpleImputer(strategy='most_frequent')),
('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]) # Your code here 加了parse=False
# Bundle preprocessing for numerical and categorical data 用ColumnTransformer捆綁數(shù)值型和分類型數(shù)據(jù)的預處理
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0) # Your code here
# Check your answer
step_1.a.check()
TO BE CONTINUED====================
Part B
Run the code cell below without changes.
To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
# Check your answer
step_1.b.check()
Step 2: Generate test predictions
Now, you'll use your trained model to generate predictions with the test data.
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test) # Your code here 管道最方便的地方就是可以直接對測試集進行和訓練集等一樣的操作而不需要重復代碼。
# Check your answer
step_2.check()
Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)