Kaggle|Exercise7:Categorical Variables[Very Important]

By encoding categorical variables, you'll obtain your best results thus far!

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex3 import *
print("Setup Complete")

In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data 讀取數(shù)據(jù)
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors 刪去包含缺失目標(biāo)變量的樣本,從數(shù)據(jù)集中分離出目標(biāo)變量y
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values為了使模型簡(jiǎn)單,使用刪除缺失值方法
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data 劃分訓(xùn)練集驗(yàn)證集
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

To compare different models, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 1: Drop columns with categorical data

You'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train and X_valid to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train and drop_X_valid, respectively.

# Fill in the lines below: drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude = ["object"])
drop_X_valid = X_valid.select_dtypes(exclude = ["object"])

# Check your answers
step_1.check()

Hint: Use the select_dtypes() method to drop all columns with the object dtype.

DataFrame.select_dtypes(include=None, exclude=None)

獲得該方法的評(píng)分

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

Step 2: Label encoding

Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.在我們進(jìn)行標(biāo)簽編碼之前奇唤,先研究一下數(shù)據(jù)集。特別低,我們將觀察Condition2列,下面的代碼將返回訓(xùn)練和驗(yàn)證集中指定列的唯一值。

print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

#Output
Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']

其中 .unique()是pandas中的唯一值函數(shù),相當(dāng)于把list變成set腻贰,輸出變量中包含的不重復(fù)類別

If you now write code to:

  • fit a label encoder to the training data, and then
  • use it to transform both the training and validation data,

you'll get an error. Can you see why this is the case? (You'll need to use the above output to answer this question.)
如果現(xiàn)在就進(jìn)行標(biāo)簽編碼的訓(xùn)練和轉(zhuǎn)換装诡,會(huì)報(bào)錯(cuò)银受,原因是——
【是否有變量存在于驗(yàn)證集中卻不存在于訓(xùn)練集中?】
使用標(biāo)簽編碼器擬合訓(xùn)練集上的每一列鸦采,為訓(xùn)練集特征中出現(xiàn)的每一個(gè)唯一類別創(chuàng)建一個(gè)對(duì)應(yīng)的整數(shù)值標(biāo)簽宾巍。如果驗(yàn)證集包含的特征未出現(xiàn)在訓(xùn)練集中,編碼器將報(bào)錯(cuò)渔伯。因?yàn)檫@些特征不會(huì)被分配整數(shù)顶霞。請(qǐng)注意,驗(yàn)證數(shù)據(jù)中的“ Condition2”列包含值“ RRAn”和“ RRNn”锣吼,但它們未出現(xiàn)在訓(xùn)練數(shù)據(jù)中选浑,因此,如果我們嘗試將標(biāo)簽編碼器與scikit-learn一起使用玄叠,代碼將出錯(cuò)古徒。

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.運(yùn)行下面的代碼單元,將有問題的列保存在列表bad_label_cols中读恃,同樣地隧膘,將正確的編碼的列儲(chǔ)存在good_label_cols中。

# All categorical columns依然是經(jīng)典的[for in if]表達(dá)式
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded 這里要比較真實(shí)元素寺惫,要想當(dāng)用集合
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset 所有分類變量除去好的就是有問題的疹吃,補(bǔ)集運(yùn)算
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

才寫了一半不到,To be continued

Use the next code cell to label encode the data in X_train and X_valid. Set the preprocessed DataFrames to label_X_train and label_X_valid, respectively.

  • We have provided code below to drop the categorical columns in bad_label_cols from the dataset.
  • You should label encode the categorical columns in good_label_cols.
    下面開始進(jìn)行標(biāo)簽編碼西雀,這里我們直接將bad_label_cols中的類從數(shù)據(jù)集中刪除
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 這里請(qǐng)重點(diǎn)注意 是包含一個(gè)遍歷循環(huán)的
label_encoder = LabelEncoder()
for col in set(good_label_cols): 
    label_X_train[col] = label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col] = label_encoder.transform(label_X_valid[col])
 # Your code here
    
# Check your answer
step_2.b.check()

Run the next code cell to get the MAE for this approach.

print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Label Encoding):
17575.291883561644

Step 3: Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.對(duì)分類數(shù)據(jù)進(jìn)行編碼比直接刪除分類變量有更好的效果

Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

.nuique()pandas函數(shù)萨驶,用于獲取唯一值的統(tǒng)計(jì)次數(shù),即有幾個(gè)唯一類別艇肴。

輸出為

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

Use the output above to answer the questions below.

# Fill in the line below: How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = 3

# Fill in the line below: How many columns are needed to one-hot encode the 
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25

# Check your answers
step_3.a.check()

To one-hot encode a variable, we need one column for each unique entry.

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

  • If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
  • If we instead replace the column with the label encoding, how many entries are added?
    對(duì)于具有多行的大型數(shù)據(jù)集腔呜,OH編碼會(huì)大幅度增加數(shù)據(jù)集的規(guī)模,因此再悼,我們通常只對(duì)categorical較少的列進(jìn)行一次OH編碼核畴。對(duì)于包含較多分類的列,可以刪去或使用label encoding.
    Use your answers to fill in the lines below.

Hint: To calculate how many entries are added to the dataset through the one-hot encoding, begin by calculating how many entries are needed to encode the categorical variable (by multiplying the number of rows by the number of columns in the one-hot encoding). Then, to obtain how many entries are added to the dataset, subtract the number of entries in the original column.提示:要計(jì)算通過一次編碼將多少項(xiàng)添加到數(shù)據(jù)集中帮哈,請(qǐng)先計(jì)算需要多少項(xiàng)來對(duì)分類變量進(jìn)行編碼(通過在一次編碼中將行數(shù)乘以列數(shù)) )。 然后锰镀,要獲得添加到數(shù)據(jù)集中的條目數(shù)量娘侍,請(qǐng)減去原始列中的條目數(shù)量咖刃。

# Fill in the line below: How many entries are added to the dataset by 
# replacing the column with a one-hot encoding?
OH_entries_added = 1e4*100-1e4

# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a label encoding?
label_entries_added = 0

# Check your answers
step_3.b.check()

Step 4: One-hot encoding

In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.(僅對(duì)于類別數(shù)小于10的列進(jìn)行OH編碼)

Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. 將被OH編碼 Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.將被從數(shù)據(jù)集分類列中刪除

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior2nd', 'Exterior1st']

Use the next code cell to one-hot encode the data in X_train and X_valid. Set the preprocessed DataFrames to OH_X_train and OH_X_valid, respectively.

  • The full list of categorical columns in the dataset can be found in the Python list object_cols.
  • You should only one-hot encode the categorical columns in low_cardinality_cols. All other categorical columns should be dropped from the dataset.

The next code cell is VERY IMPORTANT!

from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
#Apply one-hot encoder to each column with categorical data
onehotencoder = OneHotEncoder(handle_unknown = 'ignore',sparse = False)
OH_cols_train = pd.DataFrame(onehotencoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(onehotencoder.transform(X_valid[low_cardinality_cols]))

#不要忘記OH會(huì)刪除索引,一定要牢記恢復(fù)索引 One-hot encoding removes index憾筏,put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

#獲得數(shù)值型變量列 刪去分類列即是 Remove categorical columns (will replace with one-hot encoding)
num_cols_train = X_train.drop(object_cols,axis=1)
num_cols_valid = X_valid.drop(object_cols,axis=1)

#拼接OH編碼過的分類列和數(shù)值列 注意pd.concat函數(shù)的用法
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([OH_cols_train,num_cols_train],axis=1)   # Your code here
OH_X_valid = pd.concat([OH_cols_valid,num_cols_valid],axis=1) # Your code here

# Check your answer
step_4.check()

The code cell above is VERY IMPORTANT!

Run the next code cell to get the MAE for this approach.

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

Step 5: Generate test predictions and submit your results

After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

This step is completely optional, and you do not need to submit results to the leaderboard to successfully complete the exercise.

Check out the previous exercise if you need help with remembering how to join the competition or save your results to CSV. Once you have generated a file with your results, follow the instructions below:

測(cè)試集預(yù)處理代碼 To be continued

  1. Begin by clicking on the blue Save Version button in the top right corner of this window. This will generate a pop-up window.
  2. Ensure that the Save and Run All option is selected, and then click on the blue Save button.
  3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
  4. Click on the Output tab on the right of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

  1. If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

Keep going

With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines.

Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末嚎杨,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子氧腰,更是在濱河造成了極大的恐慌枫浙,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,290評(píng)論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件古拴,死亡現(xiàn)場(chǎng)離奇詭異箩帚,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)黄痪,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,107評(píng)論 2 385
  • 文/潘曉璐 我一進(jìn)店門紧帕,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人桅打,你說我怎么就攤上這事是嗜。” “怎么了挺尾?”我有些...
    開封第一講書人閱讀 156,872評(píng)論 0 347
  • 文/不壞的土叔 我叫張陵鹅搪,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我遭铺,道長(zhǎng)丽柿,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,415評(píng)論 1 283
  • 正文 為了忘掉前任掂僵,我火速辦了婚禮航厚,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘锰蓬。我一直安慰自己幔睬,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,453評(píng)論 6 385
  • 文/花漫 我一把揭開白布芹扭。 她就那樣靜靜地躺著麻顶,像睡著了一般。 火紅的嫁衣襯著肌膚如雪舱卡。 梳的紋絲不亂的頭發(fā)上辅肾,一...
    開封第一講書人閱讀 49,784評(píng)論 1 290
  • 那天,我揣著相機(jī)與錄音轮锥,去河邊找鬼矫钓。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的新娜。 我是一名探鬼主播赵辕,決...
    沈念sama閱讀 38,927評(píng)論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼概龄!你這毒婦竟也來了还惠?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,691評(píng)論 0 266
  • 序言:老撾萬榮一對(duì)情侶失蹤私杜,失蹤者是張志新(化名)和其女友劉穎蚕键,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體衰粹,經(jīng)...
    沈念sama閱讀 44,137評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡锣光,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,472評(píng)論 2 326
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了寄猩。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片嫉晶。...
    茶點(diǎn)故事閱讀 38,622評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖田篇,靈堂內(nèi)的尸體忽然破棺而出替废,到底是詐尸還是另有隱情,我是刑警寧澤泊柬,帶...
    沈念sama閱讀 34,289評(píng)論 4 329
  • 正文 年R本政府宣布椎镣,位于F島的核電站,受9級(jí)特大地震影響兽赁,放射性物質(zhì)發(fā)生泄漏状答。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,887評(píng)論 3 312
  • 文/蒙蒙 一刀崖、第九天 我趴在偏房一處隱蔽的房頂上張望惊科。 院中可真熱鬧,春花似錦亮钦、人聲如沸馆截。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,741評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽蜡娶。三九已至,卻和暖如春映穗,著一層夾襖步出監(jiān)牢的瞬間窖张,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評(píng)論 1 265
  • 我被黑心中介騙來泰國(guó)打工蚁滋, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留宿接,地道東北人赘淮。 一個(gè)月前我還...
    沈念sama閱讀 46,316評(píng)論 2 360
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像睦霎,于是被迫代替她去往敵國(guó)和親拥知。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,490評(píng)論 2 348