By encoding categorical variables, you'll obtain your best results thus far!
Setup
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex3 import *
print("Setup Complete")
In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data 讀取數(shù)據(jù)
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors 刪去包含缺失目標(biāo)變量的樣本,從數(shù)據(jù)集中分離出目標(biāo)變量y
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll drop columns with missing values為了使模型簡(jiǎn)單,使用刪除缺失值方法
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
# Break off validation set from training data 劃分訓(xùn)練集驗(yàn)證集
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2,
random_state=0)
Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.
To compare different models, you'll use the same score_dataset()
function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Step 1: Drop columns with categorical data
You'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train
and X_valid
to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train
and drop_X_valid
, respectively.
# Fill in the lines below: drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude = ["object"])
drop_X_valid = X_valid.select_dtypes(exclude = ["object"])
# Check your answers
step_1.check()
Hint: Use the select_dtypes()
method to drop all columns with the object
dtype.
DataFrame.select_dtypes(include=None, exclude=None)
獲得該方法的評(píng)分
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
Step 2: Label encoding
Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2'
column. The code cell below prints the unique entries in both the training and validation sets.在我們進(jìn)行標(biāo)簽編碼之前奇唤,先研究一下數(shù)據(jù)集。特別低,我們將觀察Condition2列,下面的代碼將返回訓(xùn)練和驗(yàn)證集中指定列的唯一值。
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())
#Output
Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']
其中 .unique()
是pandas中的唯一值函數(shù),相當(dāng)于把list變成set腻贰,輸出變量中包含的不重復(fù)類別。
If you now write code to:
- fit a label encoder to the training data, and then
- use it to transform both the training and validation data,
you'll get an error. Can you see why this is the case? (You'll need to use the above output to answer this question.)
如果現(xiàn)在就進(jìn)行標(biāo)簽編碼的訓(xùn)練和轉(zhuǎn)換装诡,會(huì)報(bào)錯(cuò)银受,原因是——
【是否有變量存在于驗(yàn)證集中卻不存在于訓(xùn)練集中?】
使用標(biāo)簽編碼器擬合訓(xùn)練集上的每一列鸦采,為訓(xùn)練集特征中出現(xiàn)的每一個(gè)唯一類別創(chuàng)建一個(gè)對(duì)應(yīng)的整數(shù)值標(biāo)簽宾巍。如果驗(yàn)證集包含的特征未出現(xiàn)在訓(xùn)練集中,編碼器將報(bào)錯(cuò)渔伯。因?yàn)檫@些特征不會(huì)被分配整數(shù)顶霞。請(qǐng)注意,驗(yàn)證數(shù)據(jù)中的“ Condition2”列包含值“ RRAn”和“ RRNn”锣吼,但它們未出現(xiàn)在訓(xùn)練數(shù)據(jù)中选浑,因此,如果我們嘗試將標(biāo)簽編碼器與scikit-learn一起使用玄叠,代碼將出錯(cuò)古徒。
This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.
Run the code cell below to save the problematic columns to a Python list bad_label_cols
. Likewise, columns that can be safely label encoded are stored in good_label_cols
.運(yùn)行下面的代碼單元,將有問題的列保存在列表bad_label_cols中读恃,同樣地隧膘,將正確的編碼的列儲(chǔ)存在good_label_cols中。
# All categorical columns依然是經(jīng)典的[for in if]表達(dá)式
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
# Columns that can be safely label encoded 這里要比較真實(shí)元素寺惫,要想當(dāng)用集合
good_label_cols = [col for col in object_cols if
set(X_train[col]) == set(X_valid[col])]
# Problematic columns that will be dropped from the dataset 所有分類變量除去好的就是有問題的疹吃,補(bǔ)集運(yùn)算
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
才寫了一半不到,To be continued
Use the next code cell to label encode the data in X_train
and X_valid
. Set the preprocessed DataFrames to label_X_train
and label_X_valid
, respectively.
- We have provided code below to drop the categorical columns in
bad_label_cols
from the dataset. - You should label encode the categorical columns in
good_label_cols
.
下面開始進(jìn)行標(biāo)簽編碼西雀,這里我們直接將bad_label_cols中的類從數(shù)據(jù)集中刪除
from sklearn.preprocessing import LabelEncoder
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
# Apply label encoder 這里請(qǐng)重點(diǎn)注意 是包含一個(gè)遍歷循環(huán)的
label_encoder = LabelEncoder()
for col in set(good_label_cols):
label_X_train[col] = label_encoder.fit_transform(label_X_train[col])
label_X_valid[col] = label_encoder.transform(label_X_valid[col])
# Your code here
# Check your answer
step_2.b.check()
Run the next code cell to get the MAE for this approach.
print("MAE from Approach 2 (Label Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
MAE from Approach 2 (Label Encoding):
17575.291883561644
Step 3: Investigating cardinality
So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.對(duì)分類數(shù)據(jù)進(jìn)行編碼比直接刪除分類變量有更好的效果
Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
.nuique()
pandas函數(shù)萨驶,用于獲取唯一值的統(tǒng)計(jì)次數(shù),即有幾個(gè)唯一類別艇肴。
輸出為
[('Street', 2),
('Utilities', 2),
('CentralAir', 2),
('LandSlope', 3),
('PavedDrive', 3),
('LotShape', 4),
('LandContour', 4),
('ExterQual', 4),
('KitchenQual', 4),
('MSZoning', 5),
('LotConfig', 5),
('BldgType', 5),
('ExterCond', 5),
('HeatingQC', 5),
('Condition2', 6),
('RoofStyle', 6),
('Foundation', 6),
('Heating', 6),
('Functional', 6),
('SaleCondition', 6),
('RoofMatl', 7),
('HouseStyle', 8),
('Condition1', 9),
('SaleType', 9),
('Exterior1st', 15),
('Exterior2nd', 16),
('Neighborhood', 25)]
The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street'
column in the training data has two unique values: 'Grvl'
and 'Pave'
, corresponding to a gravel road and a paved road, respectively.
We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street'
variable has cardinality 2.
Use the output above to answer the questions below.
# Fill in the line below: How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = 3
# Fill in the line below: How many columns are needed to one-hot encode the
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25
# Check your answers
step_3.a.check()
To one-hot encode a variable, we need one column for each unique entry.
For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.
As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.
- If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
- If we instead replace the column with the label encoding, how many entries are added?
對(duì)于具有多行的大型數(shù)據(jù)集腔呜,OH編碼會(huì)大幅度增加數(shù)據(jù)集的規(guī)模,因此再悼,我們通常只對(duì)categorical較少的列進(jìn)行一次OH編碼核畴。對(duì)于包含較多分類的列,可以刪去或使用label encoding.
Use your answers to fill in the lines below.
Hint: To calculate how many entries are added to the dataset through the one-hot encoding, begin by calculating how many entries are needed to encode the categorical variable (by multiplying the number of rows by the number of columns in the one-hot encoding). Then, to obtain how many entries are added to the dataset, subtract the number of entries in the original column.提示:要計(jì)算通過一次編碼將多少項(xiàng)添加到數(shù)據(jù)集中帮哈,請(qǐng)先計(jì)算需要多少項(xiàng)來對(duì)分類變量進(jìn)行編碼(通過在一次編碼中將行數(shù)乘以列數(shù)) )。 然后锰镀,要獲得添加到數(shù)據(jù)集中的條目數(shù)量娘侍,請(qǐng)減去原始列中的條目數(shù)量咖刃。
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a one-hot encoding?
OH_entries_added = 1e4*100-1e4
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a label encoding?
label_entries_added = 0
# Check your answers
step_3.b.check()
Step 4: One-hot encoding
In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.(僅對(duì)于類別數(shù)小于10的列進(jìn)行OH編碼)
Run the code cell below without changes to set low_cardinality_cols
to a Python list containing the columns that will be one-hot encoded. 將被OH編碼 Likewise, high_cardinality_cols
contains a list of categorical columns that will be dropped from the dataset.將被從數(shù)據(jù)集分類列中刪除
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']
Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior2nd', 'Exterior1st']
Use the next code cell to one-hot encode the data in X_train
and X_valid
. Set the preprocessed DataFrames to OH_X_train
and OH_X_valid
, respectively.
- The full list of categorical columns in the dataset can be found in the Python list
object_cols
. - You should only one-hot encode the categorical columns in
low_cardinality_cols
. All other categorical columns should be dropped from the dataset.
The next code cell is VERY IMPORTANT!
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
#Apply one-hot encoder to each column with categorical data
onehotencoder = OneHotEncoder(handle_unknown = 'ignore',sparse = False)
OH_cols_train = pd.DataFrame(onehotencoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(onehotencoder.transform(X_valid[low_cardinality_cols]))
#不要忘記OH會(huì)刪除索引,一定要牢記恢復(fù)索引 One-hot encoding removes index憾筏,put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
#獲得數(shù)值型變量列 刪去分類列即是 Remove categorical columns (will replace with one-hot encoding)
num_cols_train = X_train.drop(object_cols,axis=1)
num_cols_valid = X_valid.drop(object_cols,axis=1)
#拼接OH編碼過的分類列和數(shù)值列 注意pd.concat函數(shù)的用法
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([OH_cols_train,num_cols_train],axis=1) # Your code here
OH_X_valid = pd.concat([OH_cols_valid,num_cols_valid],axis=1) # Your code here
# Check your answer
step_4.check()
The code cell above is VERY IMPORTANT!
Run the next code cell to get the MAE for this approach.
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
Step 5: Generate test predictions and submit your results
After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.
This step is completely optional, and you do not need to submit results to the leaderboard to successfully complete the exercise.
Check out the previous exercise if you need help with remembering how to join the competition or save your results to CSV. Once you have generated a file with your results, follow the instructions below:
測(cè)試集預(yù)處理代碼 To be continued
- Begin by clicking on the blue Save Version button in the top right corner of this window. This will generate a pop-up window.
- Ensure that the Save and Run All option is selected, and then click on the blue Save button.
- This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- Click on the Output tab on the right of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.
You have now successfully submitted to the competition!
- If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.
Keep going
With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines.
Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.