> library(pacman)
> p_load(dplyr, readr, caret)
以上一節(jié)中未去除離群值的MSE為3619.029悔常,修正R2為0.8603和去除離群值后的MSE為2690.545,修正R2為0.8706為基準(zhǔn),以及兩個(gè)模型在測(cè)試集上的MSE分別為2914.014和1672.859,對(duì)模型進(jìn)行改進(jìn)零截。
> results <- tribble(~ model, ~ mse, ~ r_square, ~ test_mse,
+ "original", 3619.029, 0.8603, 2914.014,
+ "remove_out", 2690.545, 0.8706, 1672.859)
> results
## # A tibble: 2 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
1锭汛、數(shù)據(jù)預(yù)處理
> machine <- read_csv("data_set/machine.data", col_names = F)
> names(machine) <- c("vendor", "model", "myct", "mmin", "mmax",
+ "cach", "chmin", "chmax", "prp", "erp")
> machine <- machine[, 3:9]
>
> set.seed(123)
> ind <- createDataPartition(machine$prp, p = 0.85, list = F)
>
> dtrain <- machine[ind, ]
> dtest <- machine[-ind, ]
2、縮減特征集
> ct <- trainControl(preProcOptions = list(cutoff = 0.75))
> set.seed(123)
> fit.step <- train(prp ~ ., data = dtrain, method = "lmStepAIC",
+ trControl = ct, preProcess = c("corr"), trace = F)
>
> summary(fit.step$finalModel)
##
## Call:
## lm(formula = .outcome ~ myct + mmin + mmax + cach + chmax, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163.94 -29.68 3.25 28.52 355.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.024e+01 8.909e+00 -6.762 2.01e-10 ***
## myct 5.550e-02 1.998e-02 2.777 0.006084 **
## mmin 1.476e-02 2.006e-03 7.358 7.20e-12 ***
## mmax 5.725e-03 6.919e-04 8.275 3.33e-14 ***
## cach 5.693e-01 1.443e-01 3.944 0.000116 ***
## chmax 1.683e+00 2.301e-01 7.313 9.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.33 on 173 degrees of freedom
## Multiple R-squared: 0.8644, Adjusted R-squared: 0.8605
## F-statistic: 220.6 on 5 and 173 DF, p-value: < 2.2e-16
> compute_mse <- function(prediction, actual) {
+ mean((prediction - actual) ^ 2)
+ }
>
> compute_mse(fit.step$finalModel$fitted.values, dtrain$prp)
## [1] 3634.847
> compute_mse(predict(fit.step, newdata = dtest), dtest$prp)
## [1] 2785.94
使用逐步回歸模型的結(jié)果為:
> results <- bind_rows(results,
+ tibble(model = "step",
+ mse = 3634.847,
+ r_square = 0.8605,
+ test_mse = 2785.94))
> results
## # A tibble: 3 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
去掉離群值后管呵,再次逐步回歸:
> dtrain.new <- dtrain[!(rownames(dtrain)) %in% c(173), ]
> set.seed(123)
> fit.step.out <- train(prp ~ ., data = dtrain.new, method = "lmStepAIC",
+ trControl = ct, preProcess = c("corr"), trace = F)
>
> summary(fit.step.out$finalModel)
##
## Call:
## lm(formula = .outcome ~ myct + mmin + mmax + cach + chmax, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.560 -23.668 2.268 21.691 271.120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.474e+01 7.930e+00 -5.643 6.78e-08 ***
## myct 4.193e-02 1.731e-02 2.422 0.0165 *
## mmin 1.697e-02 1.752e-03 9.690 < 2e-16 ***
## mmax 4.629e-03 6.125e-04 7.557 2.35e-12 ***
## cach 5.968e-01 1.244e-01 4.797 3.48e-06 ***
## chmax 1.168e+00 2.090e-01 5.588 8.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.85 on 172 degrees of freedom
## Multiple R-squared: 0.8702, Adjusted R-squared: 0.8664
## F-statistic: 230.6 on 5 and 172 DF, p-value: < 2.2e-16
> compute_mse(fit.step.out$finalModel$fitted.values, dtrain.new$prp)
## [1] 2698.78
> compute_mse(predict(fit.step.out, newdata = dtest), dtest$prp)
## [1] 1812.763
> results <- bind_rows(results,
+ tibble(model = "step_out",
+ mse = 2698.78,
+ r_square = 0.8664,
+ test_mse = 1812.763))
> results
## # A tibble: 4 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
刪減特征后梳毙,模型在訓(xùn)練集上的mse都增大了,但在測(cè)試集上的mse卻減小了捐下。去除離群值后账锹,mse都有所增大。
3坷襟、正則化
嶺回歸和lasso都值得嘗試奸柬,當(dāng)依賴于輸入特征的某個(gè)子集的模型時(shí)往往用lasso表現(xiàn)更好;但當(dāng)很多不同變量的系數(shù)具有較大分散度的模型則往往在嶺回歸下有更好的表現(xiàn)婴程。
3.1 嶺回歸
但數(shù)據(jù)集維度很高時(shí)廓奕,尤其是和能獲得的觀測(cè)數(shù)據(jù)的數(shù)量相比很大時(shí),線性回歸往往會(huì)表現(xiàn)出非常高的方差档叔。
嶺回歸是一種通過(guò)其約束條件引入偏誤但能有效地減小模型方差的方法桌粉。
> set.seed(123)
> fit.ridge <- train(prp ~ ., data = dtrain, method = "ridge",
+ trControl = ct, preProcess = c("corr"))
>
> fit.ridge$bestTune
## lambda
## 3 0.1
> fit.ridge$results$Rsquared[3]
## [1] 0.8058767
> compute_mse(predict(fit.ridge, newdata = dtrain), dtrain$prp)
## [1] 3730.474
> compute_mse(predict(fit.ridge, newdata = dtest), dtest$prp)
## [1] 2958.191
> results <- bind_rows(results,
+ tibble(model = "ridge",
+ mse = 3730.474,
+ r_square = 0.8059,
+ test_mse = 2958.191))
> results
## # A tibble: 5 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
## 5 ridge 3730. 0.806 2958.
3.2 lasso回歸
lasso是嶺回歸的一種替代正則化方法。它們之間的差別體現(xiàn)在懲罰項(xiàng)里衙四,嶺回歸是將有效的將系數(shù)壓縮到更小的值铃肯,而lasso最小化的是系數(shù)的絕對(duì)值之和,由于lasso會(huì)把某些系數(shù)完全收縮到0传蹈,所以它兼具了選擇和收縮的功能押逼,這個(gè)是嶺回歸是不具備的步藕。
在模型中,當(dāng)alpha參數(shù)取值為0時(shí)是嶺回歸宴胧,alpha取值為1時(shí)是lasso漱抓。
> set.seed(123)
> fit.lasso <- train(prp ~ ., data = dtrain, method = "lasso",
+ trControl = ct, preProcess = c("corr"))
>
> fit.lasso$bestTune
## fraction
## 3 0.9
> fit.lasso$results$Rsquared[3]
## [1] 0.7996164
> compute_mse(predict(fit.lasso, newdata = dtrain), dtrain$prp)
## [1] 3664.031
> compute_mse(predict(fit.lasso, newdata = dtest), dtest$prp)
## [1] 2628.372
最終選擇的是alpha=0,即嶺回歸模型。
> results <- bind_rows(results,
+ tibble(model = "lasso",
+ mse = 3664.031,
+ r_square = 0.7996,
+ test_mse = 2628.372))
> results
## # A tibble: 6 x 4
## model mse r_square test_mse
## <chr> <dbl> <dbl> <dbl>
## 1 original 3619. 0.860 2914.
## 2 remove_out 2691. 0.871 1673.
## 3 step 3635. 0.860 2786.
## 4 step_out 2699. 0.866 1813.
## 5 ridge 3730. 0.806 2958.
## 6 lasso 3664. 0.800 2628.
綜合對(duì)比恕齐,數(shù)據(jù)在去除離群值的線性回歸模型上性能最優(yōu)乞娄。