來自kaggle官網(wǎng)的標(biāo)準(zhǔn)化機(jī)器學(xué)習(xí)流程涝婉。
Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.
Setup
The questions will give you feedback on your work. Run the following cell to set up the feedback system.
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")
In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.
Step 1: Preliminary investigation
Run the code cell below without changes.
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
(1168, 36)
LotFrontage 212
MasVnrArea 6
GarageYrBlt 58
dtype: int64
Part A
Use the above output to answer the questions below.
# Fill in the line below: How many rows are in the training data?
num_rows = 1168
# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = 3
# Fill in the line below: How many missing entries are contained in
# all of the training data?
tot_missing = 276
# Check your answers
step_1.a.check()
Part B
Considering your answers above, what do you think is likely the best approach to dealing with the missing values?
針對數(shù)據(jù)的情況稿存,應(yīng)該如何選擇處理缺失值的策略漱抓?
數(shù)據(jù)集是有很多缺失值,還是只有一少部分蚕礼?如果我們忽略缺失值,是否會丟失大量的有效信息?
針對這份數(shù)據(jù)集秀仲,共有1168行,36列壶笼,缺失特征分布于3列神僵,總?cè)笔?shù)為276
由于本數(shù)據(jù)相對缺失值較少(缺失值最高的列缺失缺失數(shù)少于其總數(shù)的20%(212<1168*20%),可以預(yù)見刪除列并不會有好的效果。這是因?yàn)槲覀儠G掉很多有價(jià)值的數(shù)據(jù)覆劈,因此使用估值法可能會更好保礼。
To compare different approaches to dealing with missing values, you'll use the same score_dataset()
function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Step 2: Drop columns with missing values
In this step, you'll preprocess the data in X_train
and X_valid
to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train
and reduced_X_valid
, respectively.
To be continued