Kaggle|Exercise6:Missing Values[To be continued]

來自kaggle官網(wǎng)的標(biāo)準(zhǔn)化機(jī)器學(xué)習(xí)流程涝婉。
Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

Setup

The questions will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")

In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.

Step 1: Preliminary investigation

Run the code cell below without changes.

# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64

Part A

Use the above output to answer the questions below.

# Fill in the line below: How many rows are in the training data?
num_rows = 1168

# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = 3

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = 276

# Check your answers
step_1.a.check()

Part B

Considering your answers above, what do you think is likely the best approach to dealing with the missing values?
針對數(shù)據(jù)的情況稿存,應(yīng)該如何選擇處理缺失值的策略漱抓?
數(shù)據(jù)集是有很多缺失值,還是只有一少部分蚕礼?如果我們忽略缺失值,是否會丟失大量的有效信息?

針對這份數(shù)據(jù)集秀仲,共有1168行,36列壶笼,缺失特征分布于3列神僵,總?cè)笔?shù)為276

由于本數(shù)據(jù)相對缺失值較少(缺失值最高的列缺失缺失數(shù)少于其總數(shù)的20%(212<1168*20%),可以預(yù)見刪除列并不會有好的效果。這是因?yàn)槲覀儠G掉很多有價(jià)值的數(shù)據(jù)覆劈,因此使用估值法可能會更好保礼。

To compare different approaches to dealing with missing values, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 2: Drop columns with missing values

In this step, you'll preprocess the data in X_train and X_valid to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train and reduced_X_valid, respectively.

To be continued

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市责语,隨后出現(xiàn)的幾起案子炮障,更是在濱河造成了極大的恐慌,老刑警劉巖坤候,帶你破解...
    沈念sama閱讀 212,454評論 6 493
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件胁赢,死亡現(xiàn)場離奇詭異,居然都是意外死亡铐拐,警方通過查閱死者的電腦和手機(jī)徘键,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,553評論 3 385
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來遍蟋,“玉大人吹害,你說我怎么就攤上這事⌒榍啵” “怎么了它呀?”我有些...
    開封第一講書人閱讀 157,921評論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長棒厘。 經(jīng)常有香客問我纵穿,道長,這世上最難降的妖魔是什么奢人? 我笑而不...
    開封第一講書人閱讀 56,648評論 1 284
  • 正文 為了忘掉前任谓媒,我火速辦了婚禮,結(jié)果婚禮上何乎,老公的妹妹穿的比我還像新娘句惯。我一直安慰自己土辩,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,770評論 6 386
  • 文/花漫 我一把揭開白布抢野。 她就那樣靜靜地躺著拷淘,像睡著了一般。 火紅的嫁衣襯著肌膚如雪指孤。 梳的紋絲不亂的頭發(fā)上启涯,一...
    開封第一講書人閱讀 49,950評論 1 291
  • 那天,我揣著相機(jī)與錄音恃轩,去河邊找鬼结洼。 笑死,一個(gè)胖子當(dāng)著我的面吹牛叉跛,可吹牛的內(nèi)容都是我干的补君。 我是一名探鬼主播,決...
    沈念sama閱讀 39,090評論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼昧互,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了伟桅?” 一聲冷哼從身側(cè)響起敞掘,我...
    開封第一講書人閱讀 37,817評論 0 268
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎楣铁,沒想到半個(gè)月后玖雁,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,275評論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡盖腕,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,592評論 2 327
  • 正文 我和宋清朗相戀三年赫冬,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片溃列。...
    茶點(diǎn)故事閱讀 38,724評論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡劲厌,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出听隐,到底是詐尸還是另有隱情补鼻,我是刑警寧澤,帶...
    沈念sama閱讀 34,409評論 4 333
  • 正文 年R本政府宣布雅任,位于F島的核電站风范,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏沪么。R本人自食惡果不足惜硼婿,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,052評論 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望禽车。 院中可真熱鬧寇漫,春花似錦刊殉、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,815評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至陋葡,卻和暖如春亚亲,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背腐缤。 一陣腳步聲響...
    開封第一講書人閱讀 32,043評論 1 266
  • 我被黑心中介騙來泰國打工捌归, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人岭粤。 一個(gè)月前我還...
    沈念sama閱讀 46,503評論 2 361
  • 正文 我出身青樓惜索,卻偏偏與公主長得像,于是被迫代替她去往敵國和親剃浇。 傳聞我的和親對象是個(gè)殘疾皇子巾兆,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,627評論 2 350

推薦閱讀更多精彩內(nèi)容