回歸分析是統(tǒng)計(jì)學(xué)的核心癞蚕,通指那些用一個(gè)或多個(gè)預(yù)測變量(自變量)來預(yù)測相應(yīng)變量(因變量)的方法
OLS回歸法又稱普通最小二乘回歸法,主要包括簡單線性回歸萌庆,多項(xiàng)式回歸斥扛,多元線性回歸
如果要了解更多OLS回歸發(fā)的原理,可自行百度
這里主要學(xué)習(xí)如何在R上建立回歸模型
用lm()擬合回歸模型
在R中食茎,擬合線性模型最基本的函數(shù)就是lm()蒂破,調(diào)用格式為:
lm(formula,data=)
其中formula是一個(gè)公式,data是數(shù)據(jù)框别渔,包含了用于擬合模型的數(shù)據(jù)
表達(dá)式的形式如下:
Y~X1+X2+X3·····
Y是因變量附迷,X1,X2哎媚,X3是自變量喇伯,公式的意思是用X1,X2,X3來預(yù)測Y的值
其中公式常用的符號(hào)有以下幾個(gè)
- ~ 分割符號(hào) 左邊為因變量,右邊為自變量
- 加號(hào) 分割自變量
- :表示預(yù)測變量的交互項(xiàng)拨与,比如用x,z,x和z的交互項(xiàng)來預(yù)測y y~x+z+x:z
- ^ 表示交互項(xiàng)達(dá)到的次數(shù) y(x+z+w)^展開為yx+z+w+x:z+x:w+z:w
- I() 從算術(shù)的角度來解釋括號(hào)中的元素 yx+I((z+w)^2)展開為yx+h,其中h是一個(gè)由z和w的平方和創(chuàng)建的新變量
簡單線性回歸
當(dāng)回歸模型中只含有一個(gè)自變量和一個(gè)因變量時(shí)稻据,成為簡單線性回歸
示例的數(shù)據(jù)來源于R語言內(nèi)置數(shù)據(jù)集women,其中包含了15個(gè)年齡在30~39歲之間女性的身高和體重?cái)?shù)據(jù)
我們這里想通過身高來預(yù)測體重
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
> fit<-lm(weight~height,data = women)
> fit
Call:
lm(formula = weight ~ height, data = women)
Coefficients:#可以看出回歸截距為-87.52买喧,回歸系數(shù)為3.45
(Intercept) height
-87.52 3.45
> summary(fit)#展示擬合模型的詳細(xì)結(jié)果
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom#殘差標(biāo)準(zhǔn)誤
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
> fitted(fit)#顯示預(yù)測值
1 2 3 4 5 6 7 8 9 10 11
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 143.6333 147.0833
12 13 14 15
150.5333 153.9833 157.4333 160.8833
> residuals(fit)#顯示殘差捻悯,殘差=預(yù)測值-實(shí)際值
1 2 3 4 5 6 7 8
2.41666667 0.96666667 0.51666667 0.06666667 -0.38333333 -0.83333333 -1.28333333 -1.73333333
9 10 11 12 13 14 15
-1.18333333 -1.63333333 -1.08333333 -0.53333333 0.01666667 1.56666667 3.11666667
> plot(women$height,women$weight)#繪圖
> abline(fit)#添加擬合模型曲線
通過上述操作可以得到預(yù)測公式
weight=-87.52+3.45*height
多項(xiàng)式回歸
上面的結(jié)果圖表明可以添加一個(gè)二次項(xiàng)來得到一個(gè)彎曲的曲線來提高預(yù)測的精度
當(dāng)只有一個(gè)自變量匆赃,但同時(shí)包含變量的冪(x2,x3)時(shí),成為多項(xiàng)式回歸
> fit<-lm(weight~height+I(height^2),data = women)
> fit
Call:
lm(formula = weight ~ height + I(height^2), data = women)
Coefficients:
(Intercept) height I(height^2)
261.87818 -7.34832 0.08306
> summary(fit)
Call:
lm(formula = weight ~ height + I(height^2), data = women)
Residuals:
Min 1Q Median 3Q Max
-0.50941 -0.29611 -0.00941 0.28615 0.59706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
height -7.34832 0.77769 -9.449 6.58e-07 ***
I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
> plot(women$height,women$weight)
> lines(women$height,fitted(fit))
可以看到新的擬合模型較舊擬合模型更加貼近實(shí)際值
因此可以得到預(yù)測公式為
weight=261.87818-7.34832height+0.083063height^2
多元線性回歸
當(dāng)預(yù)測變量(自變量)不止一個(gè)時(shí),簡單線性回歸就變成了多元線性回歸
以基礎(chǔ)包中的state.x77數(shù)據(jù)集為例今缚,探究犯罪率和其他因素的關(guān)系算柳,包括人口,文盲率姓言,收入瞬项,結(jié)霜天數(shù)
lm()函數(shù)需要輸入數(shù)據(jù)框,因此我們要對(duì)原始數(shù)據(jù)集進(jìn)行轉(zhuǎn)化
> class(state.x77)
[1] "matrix"
> state<-as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> class(state)
[1] "data.frame"
> state
Murder Population Illiteracy Income Frost
Alabama 15.1 3615 2.1 3624 20
Alaska 11.3 365 1.5 6315 152
Arizona 7.8 2212 1.8 4530 15
Arkansas 10.1 2110 1.9 3378 65
California 10.3 21198 1.1 5114 20
Colorado 6.8 2541 0.7 4884 166
Connecticut 3.1 3100 1.1 5348 139
Delaware 6.2 579 0.9 4809 103
Florida 10.7 8277 1.3 4815 11
Georgia 13.9 4931 2.0 4091 60
Hawaii 6.2 868 1.9 4963 0
Idaho 5.3 813 0.6 4119 126
Illinois 10.3 11197 0.9 5107 127
Indiana 7.1 5313 0.7 4458 122
Iowa 2.3 2861 0.5 4628 140
Kansas 4.5 2280 0.6 4669 114
Kentucky 10.6 3387 1.6 3712 95
Louisiana 13.2 3806 2.8 3545 12
Maine 2.7 1058 0.7 3694 161
Maryland 8.5 4122 0.9 5299 101
Massachusetts 3.3 5814 1.1 4755 103
Michigan 11.1 9111 0.9 4751 125
Minnesota 2.3 3921 0.6 4675 160
Mississippi 12.5 2341 2.4 3098 50
Missouri 9.3 4767 0.8 4254 108
Montana 5.0 746 0.6 4347 155
Nebraska 2.9 1544 0.6 4508 139
Nevada 11.5 590 0.5 5149 188
New Hampshire 3.3 812 0.7 4281 174
New Jersey 5.2 7333 1.1 5237 115
New Mexico 9.7 1144 2.2 3601 120
New York 10.9 18076 1.4 4903 82
North Carolina 11.1 5441 1.8 3875 80
North Dakota 1.4 637 0.8 5087 186
Ohio 7.4 10735 0.8 4561 124
Oklahoma 6.4 2715 1.1 3983 82
Oregon 4.2 2284 0.6 4660 44
Pennsylvania 6.1 11860 1.0 4449 126
Rhode Island 2.4 931 1.3 4558 127
South Carolina 11.6 2816 2.3 3635 65
South Dakota 1.7 681 0.5 4167 172
Tennessee 11.0 4173 1.7 3821 70
Texas 12.2 12237 2.2 4188 35
Utah 4.5 1203 0.6 4022 137
Vermont 5.5 472 0.6 3907 168
Virginia 9.5 4981 1.4 4701 85
Washington 4.3 3559 0.6 4864 32
West Virginia 6.7 1799 1.4 3617 100
Wisconsin 3.0 4589 0.7 4468 149
Wyoming 6.9 376 0.6 4566 173
在多元回歸分析中何荚,最好先檢查一下變量之間的相關(guān)性
> cor(state)
Murder Population Illiteracy Income Frost
Murder 1.0000000 0.3436428 0.7029752 -0.2300776 -0.5388834
Population 0.3436428 1.0000000 0.1076224 0.2082276 -0.3321525
Illiteracy 0.7029752 0.1076224 1.0000000 -0.4370752 -0.6719470
Income -0.2300776 0.2082276 -0.4370752 1.0000000 0.2262822
Frost -0.5388834 -0.3321525 -0.6719470 0.2262822 1.0000000
可以看出囱淋,謀殺率隨著人口數(shù)量和文盲率的增加而增加,收入和結(jié)霜天數(shù)的增加而下降
car包中的scatterplotMatrix()函數(shù)會(huì)生成散點(diǎn)圖矩陣兽泣,可以很容易的繪制二元關(guān)系圖
> library(car)
載入需要的程輯包:carData
> scatterplotMatrix(state)
確定了相關(guān)性后绎橘,就可以使用lm()函數(shù)擬合多元線性回歸模型
> fit<-lm(Murder~Population+Illiteracy+Income+Frost,data = state)
> fit
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
data = state)
Coefficients:
(Intercept) Population Illiteracy Income Frost
1.235e+00 2.237e-04 4.143e+00 6.442e-05 5.813e-04
> summary(fit)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
data = state)
Residuals:
Min 1Q Median 3Q Max
-4.7960 -1.6495 -0.0811 1.4815 7.6210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+00 3.866e+00 0.319 0.7510
Population 2.237e-04 9.052e-05 2.471 0.0173 *
Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
Income 6.442e-05 6.837e-04 0.094 0.9253
Frost 5.813e-04 1.005e-02 0.058 0.9541
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
可以看出謀殺率和文盲率呈極顯著線性相關(guān)
當(dāng)有多個(gè)自變量時(shí),回歸系數(shù)的含義為:一個(gè)預(yù)測變量增加一個(gè)單位唠倦,其他預(yù)測變量不變時(shí)称鳞,相應(yīng)變量將要增加的數(shù)目
有交互項(xiàng)的多元線性回歸
示例數(shù)據(jù)來源于mtcars數(shù)據(jù)集,通過汽車重量(wt)和馬力(hp)來預(yù)測汽車的每加侖行駛英里數(shù)(mpg)
> class(mtcars)
[1] "data.frame"
> fit<-lm(mpg~wt+hp+wt:hp,data = mtcars)
> fit
Call:
lm(formula = mpg ~ wt + hp + wt:hp, data = mtcars)
Coefficients:
(Intercept) wt hp wt:hp
49.80842 -8.21662 -0.12010 0.02785
> summary(fit)
Call:
lm(formula = mpg ~ wt + hp + wt:hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.0632 -1.6491 -0.7362 1.4211 4.5513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
wt -8.21662 1.26971 -6.471 5.20e-07 ***
hp -0.12010 0.02470 -4.863 4.04e-05 ***
wt:hp 0.02785 0.00742 3.753 0.000811 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
可以看出馬力與汽車重量的交互項(xiàng)是顯著的
這說明這兩個(gè)自變量之間存在著相互的影響