機(jī)器學(xué)習(xí) Linear Regression Tricks 實例詳解-V1.2

本文對 A study on Regression applied to the Ames dataset 中的代碼進(jìn)行詳解漩符。

該段代碼的主體介紹如下：
Introduction
該 kernel 將使用各種技巧來全面體現(xiàn) Linear Regression 的作用帘不，包括預(yù)處理和 regularization( a process of introducing additional information in order to prevent overfitting).

具體算法流程

1. 導(dǎo)入數(shù)據(jù)

（如需要數(shù)據(jù)集的同事炬灭，可在網(wǎng)頁鏈接下載）

import 工具包秤掌。
matplotlib 是最著名的 Python 圖表繪制擴(kuò)展庫，它支持輸出多種格式的圖形圖像玄帕，并且可以使用多種 GUI（圖形用戶界面驹愚，即 Graphical User Interface）界面庫交互式地顯示圖表。使用 %matplotlib 命令可以將 matplotlib 的圖表直接嵌入到 Notebook 之中峡捡，或者使用指定的界面庫顯示圖表击碗，參數(shù)指定 matplotlib 圖表的顯示方式。inline 表示將圖表嵌入到Notebook中们拙。

問題： IPython 內(nèi)置了一套非常強(qiáng)大的指令系統(tǒng)稍途，又被稱作魔法命令，使得在IPython環(huán)境中的操作更加得心應(yīng)手砚婆。魔法命令都以%或者%%開頭械拍，以%開頭的成為行命令，%%開頭的稱為單元命令。行命令只對命令所在的行有效殊者，而單元命令則必須出現(xiàn)在單元的第一行与境，對整個單元的代碼進(jìn)行處理。MORE TO SEE...

讀入數(shù)據(jù)猖吴。
建議可以將 .csv 文件用 excel 打開摔刁，后續(xù)可作數(shù)據(jù)上的人為調(diào)整，觀察不同的調(diào)整對代碼結(jié)果的影響海蔽。
train = pd.read_csv("../input/train.csv")
設(shè)置數(shù)據(jù)格式：
如下為為設(shè)置顯示的浮點(diǎn)數(shù)位數(shù)共屈，保留三位小數(shù)。
pd.set_option('display.float_format', lambda x: '%.3f' % x)
檢查是否有重復(fù)數(shù)據(jù)党窜，并去除 ID 列拗引。
檢查是否有重復(fù)數(shù)據(jù)：
```
 idsUnique = len(set(train.Id))
 idsTotal = train.shape[0]
 idsDupli = idsTotal - idsUnique
 print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
```
其中, train.shape 返回的是（1460，80）幌衣，即行列的數(shù)目矾削； train.shape[0] 返回的是行的數(shù)目。

在本數(shù)據(jù)集中沒有重復(fù)數(shù)據(jù)豁护，然而如果在其他數(shù)據(jù)集中要做重復(fù)數(shù)據(jù)的處理哼凯，建議使用 DataFrame.drop_duplicates(subset=None, keep='first', inplace=False) Return DataFrame with duplicate rows removed, optionally only considering certain columns.

其中，如上幾個參數(shù)解釋如下：
subset : column label or sequence of labels, optional.Only consider certain columns for identifying duplicates, by default use all of the columns楚里。選擇要作用于的列断部。
keep : {‘first’, ‘last’, False}, default ‘first’.first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates.
inplace : boolean, default False. Whether to drop duplicates in place or to return a copy. 如果選的是 True 則在原來 dataframe 上直接修改，否則就返回一個刪減后的 copy班缎。

去除 ID 列數(shù)據(jù)：
train.drop("Id", axis = 1, inplace = True)

其中幾個參數(shù)的意思分別是：
“ID” 為列名蝴光。
axis = 1 表明是列；如果是 0 达址，則表明是行蔑祟。
inplace = True：凡是會對原數(shù)組作出修改并返回一個新數(shù)組的，往往都有一個 inplace可選參數(shù)苏携。如果手動設(shè)定為True（默認(rèn)為False）做瞪，那么原數(shù)組直接就被替換对粪。

2. Pre-processing

2.1 去除異常值 Potential Pitfalls/Outliers

image.png

train = train[train.GrLivArea < 4000] # 去除右側(cè)的異常點(diǎn)

2.2 Take log 以消減誤差

取對數(shù)是為了均衡誤差對不同價位房屋的價格預(yù)測值的影響右冻。

train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

這里取 log 時采用的是 log1p 即 log(1+x).

兩個問題：

take log 的作用:
Small values that are close together are spread further out.
Large values that are spread out are brought closer together.

natural log.png

take log(1+x) 的作用：
樸素貝葉斯中，防止變量之前從未出現(xiàn)的時候著拭，出現(xiàn)的概率為 0 纱扭，出現(xiàn)數(shù)學(xué)計算的錯誤。

2.3 Handle missing values

處理不可使用中位數(shù)或平均數(shù)或 most common value 進(jìn)行填充的 feature 的缺失值儡遮。

替換數(shù)據(jù)的依據(jù)：
根據(jù) label 判斷該 feature 下缺失值最有可能是什么乳蛾，就填入什么。

具體來說，要深入去看原始數(shù)據(jù)集：

如果 values 中有等級劃分（優(yōu)劣差等各等級肃叶；2蹂随、1、0 或 Y/N）因惭，一般選擇最低等級的一類作為填充值岳锁。
如果 values 中為類型劃分，則選擇該 features 下最經(jīng)常出現(xiàn)的值作為填充值蹦魔。
```
train.loc[:, "Alley"] = train.loc[:, "Alley"].fillna("None")
```

其中 train.loc[:, "Alley"] means select every row of column "alley".激率， .fillna(XX) means fill na cell with XX.

2.4 特殊數(shù)據(jù)的處理

2.4.1 將 numerical features 轉(zhuǎn)為 categories

Some numerical features 事實上是類型值, 要把它們轉(zhuǎn)化成類別。比如月份的數(shù)字本身無任何數(shù)值意義勿决，所以轉(zhuǎn)換為英文縮寫乒躺。

train = train.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45",  50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75",  80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120",  150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"}, 
"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}})

2.4.2 將 category features 轉(zhuǎn)為 ordered numbers

將一些 categorical features 轉(zhuǎn)換為 ordered numbers，

有明確分級的 feature低缩，這些數(shù)值的順序本身是有信息的嘉冒，比如，"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}咆繁；
分類關(guān)系中分級關(guān)系可以比較明確區(qū)分出來健爬，如 Alley ，大多數(shù)人都偏好鋪好的平整路面么介，而不是碎石路娜遵；LotShape，大多數(shù)人都偏好規(guī)整的 LotShape壤短。反例則是比如 neighborhood 雖然可以反應(yīng)出分級，畢竟大多數(shù)人喜歡的社區(qū)還是相似的久脯，但是很難區(qū)分纳胧。

    train = train.replace({"Alley" : {"Grvl" : 1, "Pave" : 2},
                   "BsmtCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                ……）

3. Create new features

Then we will create new features, in 3 ways :

Simplifications of existing features
Combinations of existing features
Polynomials on the top 10 existing features

之所以要對 features 作進(jìn)一步的處理，我認(rèn)為原因是：1. 簡化后續(xù)計算帘撰，專注核心的 features跑慕。

3.1 簡化 features 1* Simplifications of existing features

第一種簡化 features 的方法，即簡化已有的 features 數(shù)據(jù)層級摧找，比如如下將原來 9 級的數(shù)據(jù)（1–9）可以劃分為 3 級（1–3）核行。

train["SimplOverallQual"] = train.OverallQual.replace(
                                                  {1 : 1, 2 : 1, 3 : 1, # bad
                                                   4 : 2, 5 : 2, 6 : 2, # average
                                                   7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                  })

3.2 簡化 features 2* Combinations of existing features

第二種簡化 features 的方法，即將若干種緊密相關(guān)的 features 合并在一起蹬耘。

可能用到的講解視頻 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University芝雪。要注意的是，采用這種方法要格外注意 scaling综苔。

Features Choice.png

具體語法（示例）：

train["OverallGrade"] = train["OverallQual"] * train["OverallCond"]

3.3 簡化 features 3* Polynomials on the top 10 existing features

3.3.1 尋找重要 features

Find most important features relative to target. 按照其他 features 和 SalePrice 的相關(guān)度 correlation 降序排列惩系。

corr = train.corr()
corr.sort_values(["SalePrice"], ascending = False, inplace = True)

3.3.2 Features and Polynomial Regressions

關(guān)于這一步位岔，個人的理解是：先將重要的 features 挑選出來，然后為了更好地擬合某個模型堡牡，將這些重要的模型做了一個 Polynomial Regressions 的處理抒抬。

關(guān)于何時使用 polynomial，有如下三個 situations：

理論需求晤柄。即作者假設(shè)這里會由曲線構(gòu)成瞧剖。
人為觀察變量。在做回歸分析之前可免，要先做單變量或二變量觀察抓于。可以通過畫簡單的散點(diǎn)圖來檢驗是否有曲線關(guān)系浇借。
對殘差的觀察捉撮。如果你試圖將線性模型應(yīng)用在曲線關(guān)系的數(shù)據(jù)上，那么散點(diǎn)圖中間會在中間區(qū)域有整塊的正值殘差妇垢，然后在橫坐標(biāo)（X 軸即
predictor）一末端有整塊的負(fù)值殘差巾遭；或者反之。這就說明了線性模型是不適用的闯估。但我個人覺得灼舍，第三種方式只是不適用線性模型的若干種情況之一，非充要條件涨薪。

代碼示例：

train["OverallQual-s2"] = train["OverallQual"] ** 2
train["OverallQual-s3"] = train["OverallQual"] ** 3
train["OverallQual-Sq"] = np.sqrt(train["OverallQual"])

同樣的骑素，用到的講解視頻 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University。要注意的是刚夺，采用這種方法要格外注意 scaling献丑。

有趣的是，執(zhí)行了上述代碼之后侠姑，重新將影響 SalePrice 的 features 排序后创橄，新生成的 features 進(jìn)入了 influential features top 10，可見 polynomial 是有意義的莽红。

Features and Polynomial Regressions.png

4. Create Features 后的數(shù)據(jù)再處理

4.1 處理仍遺留的缺失數(shù)據(jù)

4.1.1 區(qū)分出 numerical 和 categorical features

除了目標(biāo) feature 即 SalePrice 之外妥畏，區(qū)分出 numerical 和 categorical features 。

categorical_features = train.select_dtypes(include = ["object"]).columns
numerical_features = train.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")

其中 object 是指 categorical features 的數(shù)據(jù)類型安吁。

4.1.2 缺失數(shù)據(jù)填充

對于 numerical features 中的缺失值醉蚁，使用中位數(shù)作為填充值。

train_num = train_num.fillna(train_num.median())

4.2 Take Log

對 skewed numerical features 取 Log 變換可以弱化異常值的影響柳畔。

Inspired by Alexandru Papiu's script.
As a general rule of thumb, a skewness with an absolute value > 0.5 is considered at least moderately skewed.

skewness = train_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
train_num[skewed_features] = np.log1p(train_num[skewed_features])

4.3 Create dummy features for categorical values

Create dummy features for categorical values via one-hot encoding.

OneHotEncoder.png

在回歸分析中馍管，dummy 變量（也被稱為 indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable）取值為 0 或 1，意味著是否會有可能影響輸出的 categorical effect薪韩。Dummy 變量常用于分類互斥分類（比如吸煙者/非吸煙者）确沸。

train_cat = [pd.get_dummies(train_cat)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)

get_dummies 作用于 categorical features，舉例來說明 get_dummies 的作用：

在執(zhí)行 get_dummies 操作之前俘陷，每一列都是一個feature罗捎，這一列內(nèi)取值都是原始數(shù)據(jù)或處理后的原始數(shù)據(jù)，對于 feature MSSubClass 的取值可分為 SC20/SC60/ SC70...SC120...等拉盾，其中第 23 行的數(shù)據(jù)記錄為 SC120桨菜。
在執(zhí)行 get_dummies 操作之后，每一列的列名依次變?yōu)槊恳粋€原來 feature 的取值捉偏，比如原來的 MSSubClass 列會拓展為 SC20/SC60/ SC70...SC120...等倒得。以 SC120 舉例，原來第 23 行記錄為 SC120, 那么對應(yīng)修改后新增的 SC120 列中第 23 行值為 1夭禽；原來若干行記錄不是 SC120 的霞掺，對應(yīng)變換后值為 0.

5. Modeling

5.1 數(shù)據(jù)整合與 validation

5.1.1 數(shù)據(jù)整合

首先進(jìn)行數(shù)據(jù)的整合，將原來分別處理完畢的 numerical 和 categorical features 進(jìn)行合并讹躯。

train = pd.[concat([train_num, train_cat], axis = 1)](http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html)

5.1.2 分割數(shù)據(jù)集

將數(shù)據(jù)集分割為 train set 和 validation set菩彬。事實上，在事后作者也說潮梯，對于 cross-validation 而言骗灶，沒必要提前分割數(shù)據(jù)集，因為 cross-validation 會自動分割秉馏。

X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)

關(guān)于 train_test_split(train, y, test_size = 0.3, random_state = 0)耙旦，它的作用是隨機(jī)分割 arrays or matrices 為 train and test subsets。

5.1.3 standardize numerical features

Reason why we need standardization for numerical features: For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

翻譯如下：我們需要對 numerical features 進(jìn)行 standardization 的原因是：舉例來說萝究，很多機(jī)器學(xué)習(xí)算法(such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models)會假設(shè)所有的 features 均值為0母廷，方差都是在同一 order。如果一個 feature 的方差的 order 在數(shù)量級上大于其他 features糊肤，那么它將主導(dǎo)影響這個算法方程琴昆，使得預(yù)測器不能按照預(yù)期也從其他 features 中學(xué)習(xí)。

stdSc = StandardScaler()    
X_train.loc[:, numerical_features] = stdSc.fit_transform(X_train.loc[:, numerical_features])
X_test.loc[:, numerical_features] = stdSc.transform(X_test.loc[:, numerical_features])

其中馆揉，StandardScaler()： Standardize features by removing the mean and scaling to unit variance业舍，返回的是執(zhí)行完 standardization 之后的數(shù)據(jù)。

fit.png

fit transform.png

transform.png

image.png

簡言之升酣，fit_transform = fit + transform舷暮。 fit 函數(shù)會計算 X_train 的平均值和方差，然后根據(jù) fit 的計算結(jié)果噩茄，transform 函數(shù)會執(zhí)行 standardization 操作下面。此時的平均值和方差結(jié)果已經(jīng)記錄下來，第二次對 X_test 操作時绩聘，不新計算 X_test 的平均值和方差沥割，而沿用 X_train 的平均值和方差耗啦，直接調(diào)用 transform 函數(shù)進(jìn)行 standardization 操作。

要注意的是：standardization （fit 函數(shù)）不能在數(shù)據(jù)分割（training 與 test/validation set）之前使用机杜，因為我們不希望使得 StandardScaler 也計算 test set 的平均值和方差帜讲，我們希望 test set 使用和 train set 相同的平均值和方差。

5.1.4 Define error measure

Define error measure for official scoring : RMSE

scorer = make_scorer(mean_squared_error, greater_is_better = False)

其中椒拗，make_scorer makes a scorer from a performance metric or loss function. 第一個參數(shù)確定的是 function 的類型（這里指定的是一個 loss function）, 第二個參數(shù)指明的是第一個參數(shù)是 score function（default=True）或 loss function（=False）似将。

這里作者采用了兩個方程式分別定義 train set rmse 和 test set rmse，事實上蚀苛，我個人認(rèn)為這是作者的謬誤在验。換言之，對于 cross-validation 而言堵未，讀入的整個數(shù)據(jù)集在不同的 iteration 中分別都充當(dāng)了 train 及 test set腋舌，沒必要分別計算 train set rmse 和 test set rmse。

def rmse_cv_train(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring = scorer, cv = 10)
return(rmse)

def rmse_cv_test(model):
rmse= np.sqrt(-cross_val_score(model, X_test, y_test, scoring = scorer, cv = 10))
return(rmse)

其中兴溜，此處的 cross_val_score 中的 scorer 是 loss function侦厚，返回值應(yīng)該是負(fù)值，所以執(zhí)行 sqrt 之前要加一個負(fù)號使得 sign-flip the outcome of the scorer拙徽；

如果此處的 scorer 是 score function刨沦，則返回值是 scores : array of float, shape=(len(list(cv)),)，即 Array of scores of the estimator for each run of the cross validation.

Cross Validation K-Fold Instrucions.png

BTW, Trust your CV score, and not LB score. The leaderboard score is scored only on a small percentage of the full test set. In some cases, it’s only a few hundred test cases. Your cross-validation score will be much more reliable in general.

5.2 *1. Linear Regression without regularization

lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

其中膘怕， LinearRegression()：Ordinary least squares Linear Regression.

通過畫圖可以查看預(yù)測結(jié)果：畫 residulas想诅、畫原始預(yù)測值

plt.scatter(y_train_pred, y_train_pred - y_train, c = "blue", marker = "s", label = "Training data") # scatter 是散點(diǎn)圖
plt.scatter(y_test_pred, y_test_pred - y_test, c = "lightgreen", marker = "s", label = "Validation data")plt.scatter(y_train_pred, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_pred, y_test, c = "lightgreen", marker = "s", label = "Validation data")

5.3 *2. Linear Regression with Ridge regularization (L2 penalty)

Regularization is a very useful method to handle collinearity, filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights.The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels.

image.png

L1 penalty：
absolute sum of weights

L2 penalty：

Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to our cost function.

對于 L1 penalty and L2 penalty 此處的區(qū)別, 可進(jìn)一步參見該文章中的 As Regularization/loss function 部分. 林軒田機(jī)器學(xué)習(xí)基石 14-4 General Regularizers（13-28）有具體講解。

5.3.1 尋找合適的 Ridge Regression Model

第一次尋找

ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60])     
ridge.fit(X_train, y_train)
alpha = ridge.alpha_ # Estimated regularization parameter.

其中岛心，RidgeCV(): Ridge regression with built-in cross-validation.

** 第二次尋找**
要注意的一點(diǎn)是：在第一次得到 alpha 后要在用該 alpha 做一個中間值再取一次 alpha来破。

ridge = RidgeCV(
alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,  alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4],  cv = 10)
ridge.fit(X_train, y_train)
alpha = ridge.alpha_

問題：為何第二次 RidgeCV 新增了參數(shù) cv=10，而第一次沒有忘古。

5.3.2 用合適的 Ridge Model 計算預(yù)測值

先檢測 RMSE徘禁。我個人認(rèn)為，如果這里的 RMSE.mean() 值不合適髓堪，應(yīng)該會回退到之前的步驟進(jìn)行微調(diào)送朱。但具體值范圍多少是不合格則需要查看原始數(shù)據(jù)的 range，比如 For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore干旁，所以事實上并沒有一個絕對確定的值域驶沼。

如果 RMSE 不合適，我個人認(rèn)為應(yīng)首先回退調(diào)整 alpha 等其他參數(shù)争群，如果仍不合適印机，則要查驗數(shù)據(jù)座韵，比如 polynomial 項蔽氨、features 的合并簡化等。在調(diào)整的過程中翔试，一次只應(yīng)調(diào)整一個或有很強(qiáng)關(guān)系的系列參數(shù)。

print("Ridge RMSE on Training set :", rmse_cv_train(ridge).mean())
print("Ridge RMSE on Test set :", rmse_cv_test(ridge).mean())

如果 RMSE 合適抹凳，計算預(yù)測值遏餐。

y_train_rdg = ridge.predict(X_train)
y_test_rdg = ridge.predict(X_test)

5.3.3 畫圖檢驗 Ridge Model 結(jié)果

畫 residulas（預(yù)測值減去真實值）：同時畫 y_train_rdg, y_test_rdg.

plt.scatter(y_train_rdg, y_train_rdg - y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test_rdg - y_test, c = "lightgreen", marker = "s", label = "Validation data")

事實上伦腐，我們要讀懂 residulas 圖赢底，一個好的 residuals 圖有如下特征：

(1) they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot，呈現(xiàn)對稱分布的特點(diǎn)柏蘑，趨向于往圖片中間聚集幸冻。
(2) they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150) 聚集在 y 軸的小數(shù)值區(qū)域。
(3) in general there aren’t clear patterns 一般不會呈現(xiàn)特定圖形模式咳焚。

residulas.png

Interpreting residual plots to improve your regression 這篇較為詳細(xì)地描述了 residual plot 背后的含義洽损，并給出了 how to fix 的建議。

為了表明數(shù)據(jù)是否貼近于我們所選的模型革半，我們常使用的一個概念是 R-Squared. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean. In general, the higher the R-squared(取值范圍是 [0, 1]), the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

具體來說：

如果是 Y-axis Unbalanced碑定，

Y-axis Unbalanced.png

解決方法是：
The solution to this is almost always to transform your data(一般指 take log), typically your response variable.
It’s also possible that your model lacks a variable.

如果是呈現(xiàn)了非線性，

non-linear.png

要注意的是：
如果稍好一些的模型擬合如下又官，

suboptimal.png

對于這樣的模型延刘，If you’re getting a quick understanding of the relationship, your straight line is a pretty decent approximation. If you’re going to use this model for prediction and not explanation, the most accurate possible model would probably account for that curve.

解決方法是：
Sometimes patterns like this indicate that a variable needs to be transformed.
If the pattern is actually as clear as these examples, you probably need to create a nonlinear model (it’s not as hard as that sounds).
Or, as always, it’s possible that the issue is a missing variable.

outliers

outliers.png

解決方案：

It’s possible that this is a measurement or data entry error, where the outlier is just wrong, in which case you should delete it.
It’s possible that what appears to be just a couple outliers is in fact a power distribution. Consider transforming the variable if one of your variables has an asymmetric distribution (that is, it’s not remotely bell-shaped).
If it is indeed a legitimate outlier, you should assess the impact of the outlier.

Large Y-axis Datapoints

Large Y-axis Datapoints.png

解決方案：
Even though this approach wouldn’t work in the specific example above, it’s almost always worth looking around to see if there’s an opportunity to usefully transform a variable.
If that doesn’t work, though, you probably need to deal with your missing variable problem.

X-axis Unbalanced

X-axis Unbalanced.png

這種圖形不一定說明模型的預(yù)測能力不佳，可以 look at the Predicted vs Actual plot 六敬，如果擬合良好碘赖，這也是有可能的（residuals are unbalanced but predictions are accurate）；如果采用一些步驟微調(diào)之后外构，預(yù)測能力變得更差也是有可能的普泡。

解決方案
The solution to this is almost always to transform your data, typically an explanatory variable. (Note that the example shown below will reference transforming your reponse variable, but the same process will be helpful here.)
It’s also possible that your model lacks a variable.

直接畫預(yù)測值：同時畫 y_train_rdg, y_test_rdg.

plt.scatter(y_train_rdg, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test, c = "lightgreen", marker = "s", label = "Validation data")

畫重要的參數(shù)：As with other linear models, Ridge
will take in its fit method arrays X, y and will store the coefficients [w] of the linear model in its coef_ member:

coefs = pd.Series(ridge.coef_, index = X_train.columns) #ridge.coef_ 即為 w
imp_coefs = pd.concat([coefs.sort_values().head(10),
                 coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")

5.4 * 3. Linear Regression with Lasso regularization (L1 penalty)

LASSO 是 Least Absolute Shrinkage and Selection Operator 的縮寫。這是另一種可選的 regularization 方式审编，我們可以將 Ridge 方法中取 weights 的平方和替換為取 weights 的絕對值和撼班。不同于 L2 regularization， L1 regularization 輸出 sparse feature vectors垒酬，即大多數(shù)的 feature weights 都是 0砰嘁。Sparsity 這種特性（大多數(shù)的 feature weights 都是 0）在實際中是十分有用的，尤其對于有很多維度且很多 features 之間無關(guān)聯(lián)的數(shù)據(jù)集伤溉。

5.4.1 尋找合適的 Lasso Regression Model

同 Ridge Model般码，在尋找合適的 Lasso Regression Model 也需要進(jìn)行兩次尋找。

** 第一次尋找**

lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 
                      0.3, 0.6, 1], 
            max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_

第二次尋找

問題：同 Riege Model乱顾，在第二次尋找合適的模型時也新增了 cv=10 這個參數(shù)板祝；不同的是，在使用 Lasso Model 時走净，額外新增了 max_iter=5000券时。

lasso = LassoCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8,  alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05,  alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4], max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_

5.4.2 用合適的 Lasso Model 計算預(yù)測值

與 Ridge 模型相同孤里，先 print 出 RMSE。

print("Lasso RMSE on Training set :", rmse_cv_train(lasso).mean())
print("Lasso RMSE on Test set :", rmse_cv_test(lasso).mean())

計算預(yù)測值

y_train_las = lasso.predict(X_train)
y_test_las = lasso.predict(X_test)

5.4.3 畫圖檢驗 Lasso Model 結(jié)果

同 Ridge Model橘洞，此處也分為三步

畫 residuals（預(yù)測值減去真實值）：同時畫 y_train_las, y_test_las.
直接畫出預(yù)測值：同時畫 y_train_las, y_test_las.
畫重要的參數(shù)

總結(jié)比較 Lasso 和 Ridge Model： Lasso 的 RMSE 結(jié)果子 training 和 test sets 上都表現(xiàn)更好捌袜。值得注意的是： Lasso 僅僅用了可用 features 的三分之一；另一點(diǎn)值得注意的是炸枣， Lasso 似乎給 neighborhood categories 這個 feature 更大的權(quán)重比虏等，而直覺來說，neighborhood 的確對房屋售價騎著非常關(guān)鍵的作用适肠。

5.5 * 4. Linear Regression with ElasticNet regularization (L1 and L2 penalty)

ElasticNet 是 Ridge 和 Lasso regression 的折中方案霍衫。它有 L1 penalty 來形成 sparsity，也有 L2 penalty 來客服 lasso 的一些限制侯养，比如變量個數(shù)（(Lasso can't select more features than it has observations, but it's not the case here anyway）

5.5.1 尋找合適的 ElasticNet Model

第一次尋找

elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                      alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

其中 l1_ratio 是 float between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties).
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
This parameter can be a list, in which case the different values are tested by cross-validation and the one giving the best prediction score is used. Note that a good choice of list of values for l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7, .9, .95, .99, 1]

第二次尋找(針對 ratio)

暫時利用之間的 alpha 值敦跌，在上一次尋找得到的 ratio 值的上下一定范圍內(nèi)取 ratio 值，進(jìn)行第二次尋找逛揩。

elasticNet = ElasticNetCV(l1_ratio = [ratio *(乘以) .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
                      alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

其中有一些細(xì)節(jié)要注意：elasticNet.l1_ratio_ 的取值范圍應(yīng)該是 [0, 1]柠傍，所以不可能超過這個范圍。如果超過辩稽，要折回最接近的取值范圍內(nèi)惧笛，即置為 0 或 1。

第三次尋找(針對 alpha)

利用已經(jīng)得到的 ratio 值搂誉，在第一次尋找到的 alpha 的上下一定范圍內(nèi)取 alpha 值徐紧，進(jìn)行第三次尋找。

elasticNet = ElasticNetCV(l1_ratio = ratio,
                     alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, 
                                alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, 
                                alpha * 1.35, alpha * 1.4], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

同上一步尋找合適的 Model炭懊，有一些細(xì)節(jié)要注意：elasticNet.l1_ratio_ 的取值范圍應(yīng)該是 [0, 1]并级，所以不可能超過這個范圍。如果超過侮腹，要折回最接近的取值范圍內(nèi)嘲碧，即置為 0 或 1。

5.5.2 用合適的 ElasticNetCV Model 計算預(yù)測值

與 Ridge父阻、Lasso 模型相同愈涩，先 print 出 ElasticNetCV 的 RMSE。

print("ElasticNet RMSE on Training set :", rmse_cv_train(elasticNet).mean())
print("ElasticNet RMSE on Test set :", rmse_cv_test(elasticNet).mean())

計算預(yù)測值

y_train_ela = elasticNet.predict(X_train)
y_test_ela = elasticNet.predict(X_test)

5.5.3 畫圖檢驗 ElasticNetCV Model 結(jié)果

同 Ridge加矛、Lasso Model履婉，此處也分為三步

畫 residuals（預(yù)測值減去真實值）：同時畫 y_train_las, y_test_las.
直接畫出預(yù)測值：同時畫 y_train_las, y_test_las.
畫重要的參數(shù)

總結(jié)：ElasticNetCV 選擇的較好的 L1 ratio 為 1，即使用 Lasso regressor斟览。事實上毁腿，該模型不需要任何的 L2 regularization 來克服 L1 的缺點(diǎn)。

結(jié)論

Linear Regression 在認(rèn)真整理數(shù)據(jù)集并且優(yōu)化 regularization 的情況下會得到不錯的預(yù)測，這比使用之前 kaggle 比賽中表現(xiàn)不錯的算法要好得多已烤。

附錄

原始學(xué)習(xí)數(shù)據(jù)說明：

File descriptions

train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
Data fields

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class 等級鸠窗，有順序意義。
MSZoning: The general zoning classification 分類胯究，無順序意義稍计。
LotFrontage: Linear feet of street connected to property 數(shù)字型。
LotArea: Lot size in square feet
Street: Type of road access 分類裕循。
Alley: Type of alley access 分類臣嚣。
LotShape: General shape of property 分類。
LandContour 等高線: Flatness of the property 分類费韭。
Utilities: Type of utilities available 分類
LotConfig: Lot configuration 分類
LandSlope: Slope of property 分類茧球，順序有意義
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

最后編輯于：2017.12.07 07:55:10

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末庭瑰，一起剝皮案震驚了整個濱河市星持，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌弹灭，老刑警劉巖督暂，帶你破解...
沈念sama閱讀 216,402評論 6贊 499
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異穷吮，居然都是意外死亡逻翁，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,377評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門捡鱼，熙熙樓的掌柜王于貴愁眉苦臉地迎上來八回，“玉大人，你說我怎么就攤上這事驾诈〔纾” “怎么了？”我有些...
開封第一講書人閱讀 162,483評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵乍迄，是天一觀的道長管引。經(jīng)常有香客問我，道長闯两，這世上最難降的妖魔是什么褥伴？我笑而不...
開封第一講書人閱讀 58,165評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮漾狼，結(jié)果婚禮上重慢，老公的妹妹穿的比我還像新娘。我一直安慰自己逊躁，他們只是感情好似踱，可當(dāng)我...
茶點(diǎn)故事閱讀 67,176評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般屯援。火紅的嫁衣襯著肌膚如雪猛们。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,146評論 1贊 297
城市分裂傳說
那天狞洋，我揣著相機(jī)與錄音弯淘，去河邊找鬼。笑死吉懊，一個胖子當(dāng)著我的面吹牛庐橙，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播借嗽，決...
沈念sama閱讀 40,032評論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼态鳖，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了恶导？” 一聲冷哼從身側(cè)響起浆竭，我...
開封第一講書人閱讀 38,896評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎惨寿，沒想到半個月后邦泄，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,311評論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡裂垦，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,536評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年顺囊，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蕉拢。...
茶點(diǎn)故事閱讀 39,696評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡特碳，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出晕换，到底是詐尸還是另有隱情午乓，我是刑警寧澤，帶...
沈念sama閱讀 35,413評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布届巩，位于F島的核電站硅瞧，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏恕汇。R本人自食惡果不足惜腕唧，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,008評論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望瘾英。院中可真熱鬧枣接，春花似錦、人聲如沸缺谴。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至膀曾，卻和暖如春县爬，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背添谊。一陣腳步聲響...
開封第一講書人閱讀 32,815評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工财喳，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人斩狱。一個月前我還...
沈念sama閱讀 47,698評論 2贊 368
代替公主和親
正文我出身青樓耳高，卻偏偏與公主長得像，于是被迫代替她去往敵國和親所踊。傳聞我的和親對象是個殘疾皇子泌枪，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,592評論 2贊 353

機(jī)器學(xué)習(xí) Linear Regression Tricks 實例詳解-V1.2

具體算法流程

1. 導(dǎo)入數(shù)據(jù)

2. Pre-processing

2.1 去除異常值 Potential Pitfalls/Outliers

2.2 Take log 以消減誤差

2.3 Handle missing values

2.4 特殊數(shù)據(jù)的處理

2.4.1 將 numerical features 轉(zhuǎn)為 categories

2.4.2 將 category features 轉(zhuǎn)為 ordered numbers

3. Create new features

3.1 簡化 features 1* Simplifications of existing features

3.2 簡化 features 2* Combinations of existing features

3.3 簡化 features 3* Polynomials on the top 10 existing features

3.3.1 尋找重要 features

3.3.2 Features and Polynomial Regressions

4. Create Features 后的數(shù)據(jù)再處理

4.1 處理仍遺留的缺失數(shù)據(jù)

4.1.1 區(qū)分出 numerical 和 categorical features

4.1.2 缺失數(shù)據(jù)填充

4.2 Take Log

4.3 Create dummy features for categorical values

5. Modeling

5.1 數(shù)據(jù)整合與 validation

5.1.1 數(shù)據(jù)整合

5.1.2 分割數(shù)據(jù)集

5.1.3 standardize numerical features

5.1.4 Define error measure

5.2 *1. Linear Regression without regularization

5.3 *2. Linear Regression with Ridge regularization (L2 penalty)

5.3.1 尋找合適的 Ridge Regression Model

5.3.2 用合適的 Ridge Model 計算預(yù)測值

5.3.3 畫圖檢驗 Ridge Model 結(jié)果

5.4 * 3. Linear Regression with Lasso regularization (L1 penalty)

5.4.1 尋找合適的 Lasso Regression Model

5.4.2 用合適的 Lasso Model 計算預(yù)測值

5.4.3 畫圖檢驗 Lasso Model 結(jié)果

5.5 * 4. Linear Regression with ElasticNet regularization (L1 and L2 penalty)

5.5.1 尋找合適的 ElasticNet Model

5.5.2 用合適的 ElasticNetCV Model 計算預(yù)測值

5.5.3 畫圖檢驗 ElasticNetCV Model 結(jié)果

結(jié)論

附錄

原始學(xué)習(xí)數(shù)據(jù)說明：

推薦閱讀更多精彩內(nèi)容