R語言學(xué)習(xí)-OLS回歸

回歸分析是統(tǒng)計(jì)學(xué)的核心癞蚕，通指那些用一個(gè)或多個(gè)預(yù)測變量（自變量）來預(yù)測相應(yīng)變量（因變量）的方法
OLS回歸法又稱普通最小二乘回歸法，主要包括簡單線性回歸萌庆，多項(xiàng)式回歸斥扛，多元線性回歸
如果要了解更多OLS回歸發(fā)的原理，可自行百度
這里主要學(xué)習(xí)如何在R上建立回歸模型

用lm()擬合回歸模型

在R中食茎，擬合線性模型最基本的函數(shù)就是lm()蒂破，調(diào)用格式為：
lm(formula,data=)
其中formula是一個(gè)公式，data是數(shù)據(jù)框别渔，包含了用于擬合模型的數(shù)據(jù)
表達(dá)式的形式如下：
Y~X1+X2+X3·····
Y是因變量附迷，X1，X2哎媚，X3是自變量喇伯，公式的意思是用X1,X2,X3來預(yù)測Y的值
其中公式常用的符號(hào)有以下幾個(gè)

~ 分割符號(hào) 左邊為因變量，右邊為自變量
加號(hào) 分割自變量
：表示預(yù)測變量的交互項(xiàng)拨与，比如用x,z,x和z的交互項(xiàng)來預(yù)測y y~x+z+x:z
^ 表示交互項(xiàng)達(dá)到的次數(shù) y_{(x+z+w)^展開為y}x+z+w+x:z+x:w+z:w
I() 從算術(shù)的角度來解釋括號(hào)中的元素 y_{x+I((z+w)^2)展開為y}x+h,其中h是一個(gè)由z和w的平方和創(chuàng)建的新變量

簡單線性回歸

當(dāng)回歸模型中只含有一個(gè)自變量和一個(gè)因變量時(shí)稻据，成為簡單線性回歸
示例的數(shù)據(jù)來源于R語言內(nèi)置數(shù)據(jù)集women，其中包含了15個(gè)年齡在30~39歲之間女性的身高和體重?cái)?shù)據(jù)
我們這里想通過身高來預(yù)測體重

> women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164
> fit<-lm(weight~height,data = women)
> fit

Call:
lm(formula = weight ~ height, data = women)

Coefficients:#可以看出回歸截距為-87.52买喧，回歸系數(shù)為3.45
(Intercept)       height  
     -87.52         3.45  
> summary(fit)#展示擬合模型的詳細(xì)結(jié)果

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom#殘差標(biāo)準(zhǔn)誤
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
> fitted(fit)#顯示預(yù)測值
       1        2        3        4        5        6        7        8        9       10       11 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 143.6333 147.0833 
      12       13       14       15 
150.5333 153.9833 157.4333 160.8833 
> residuals(fit)#顯示殘差捻悯，殘差=預(yù)測值-實(shí)際值
          1           2           3           4           5           6           7           8 
 2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 -1.28333333 -1.73333333 
          9          10          11          12          13          14          15 
-1.18333333 -1.63333333 -1.08333333 -0.53333333  0.01666667  1.56666667  3.11666667
> plot(women$height,women$weight)#繪圖
> abline(fit)#添加擬合模型曲線

image.png

通過上述操作可以得到預(yù)測公式
weight=-87.52+3.45*height

多項(xiàng)式回歸

上面的結(jié)果圖表明可以添加一個(gè)二次項(xiàng)來得到一個(gè)彎曲的曲線來提高預(yù)測的精度
當(dāng)只有一個(gè)自變量匆赃，但同時(shí)包含變量的冪（x^2,x3）時(shí),成為多項(xiàng)式回歸

> fit<-lm(weight~height+I(height^2),data = women)
> fit

Call:
lm(formula = weight ~ height + I(height^2), data = women)

Coefficients:
(Intercept)       height  I(height^2)  
  261.87818     -7.34832      0.08306  
> summary(fit)

Call:
lm(formula = weight ~ height + I(height^2), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,    Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
> plot(women$height,women$weight)
> lines(women$height,fitted(fit))

image.png

可以看到新的擬合模型較舊擬合模型更加貼近實(shí)際值
因此可以得到預(yù)測公式為
weight=261.87818-7.34832height+0.083063height^2

多元線性回歸

當(dāng)預(yù)測變量（自變量）不止一個(gè)時(shí)，簡單線性回歸就變成了多元線性回歸
以基礎(chǔ)包中的state.x77數(shù)據(jù)集為例今缚，探究犯罪率和其他因素的關(guān)系算柳，包括人口，文盲率姓言，收入瞬项，結(jié)霜天數(shù)
lm()函數(shù)需要輸入數(shù)據(jù)框，因此我們要對(duì)原始數(shù)據(jù)集進(jìn)行轉(zhuǎn)化

> class(state.x77)
[1] "matrix"
> state<-as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> class(state)
[1] "data.frame"
> state
               Murder Population Illiteracy Income Frost
Alabama          15.1       3615        2.1   3624    20
Alaska           11.3        365        1.5   6315   152
Arizona           7.8       2212        1.8   4530    15
Arkansas         10.1       2110        1.9   3378    65
California       10.3      21198        1.1   5114    20
Colorado          6.8       2541        0.7   4884   166
Connecticut       3.1       3100        1.1   5348   139
Delaware          6.2        579        0.9   4809   103
Florida          10.7       8277        1.3   4815    11
Georgia          13.9       4931        2.0   4091    60
Hawaii            6.2        868        1.9   4963     0
Idaho             5.3        813        0.6   4119   126
Illinois         10.3      11197        0.9   5107   127
Indiana           7.1       5313        0.7   4458   122
Iowa              2.3       2861        0.5   4628   140
Kansas            4.5       2280        0.6   4669   114
Kentucky         10.6       3387        1.6   3712    95
Louisiana        13.2       3806        2.8   3545    12
Maine             2.7       1058        0.7   3694   161
Maryland          8.5       4122        0.9   5299   101
Massachusetts     3.3       5814        1.1   4755   103
Michigan         11.1       9111        0.9   4751   125
Minnesota         2.3       3921        0.6   4675   160
Mississippi      12.5       2341        2.4   3098    50
Missouri          9.3       4767        0.8   4254   108
Montana           5.0        746        0.6   4347   155
Nebraska          2.9       1544        0.6   4508   139
Nevada           11.5        590        0.5   5149   188
New Hampshire     3.3        812        0.7   4281   174
New Jersey        5.2       7333        1.1   5237   115
New Mexico        9.7       1144        2.2   3601   120
New York         10.9      18076        1.4   4903    82
North Carolina   11.1       5441        1.8   3875    80
North Dakota      1.4        637        0.8   5087   186
Ohio              7.4      10735        0.8   4561   124
Oklahoma          6.4       2715        1.1   3983    82
Oregon            4.2       2284        0.6   4660    44
Pennsylvania      6.1      11860        1.0   4449   126
Rhode Island      2.4        931        1.3   4558   127
South Carolina   11.6       2816        2.3   3635    65
South Dakota      1.7        681        0.5   4167   172
Tennessee        11.0       4173        1.7   3821    70
Texas            12.2      12237        2.2   4188    35
Utah              4.5       1203        0.6   4022   137
Vermont           5.5        472        0.6   3907   168
Virginia          9.5       4981        1.4   4701    85
Washington        4.3       3559        0.6   4864    32
West Virginia     6.7       1799        1.4   3617   100
Wisconsin         3.0       4589        0.7   4468   149
Wyoming           6.9        376        0.6   4566   173

在多元回歸分析中何荚，最好先檢查一下變量之間的相關(guān)性

> cor(state)
               Murder Population Illiteracy     Income      Frost
Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000

可以看出囱淋，謀殺率隨著人口數(shù)量和文盲率的增加而增加，收入和結(jié)霜天數(shù)的增加而下降
car包中的scatterplotMatrix()函數(shù)會(huì)生成散點(diǎn)圖矩陣兽泣，可以很容易的繪制二元關(guān)系圖

> library(car)
載入需要的程輯包：carData
> scatterplotMatrix(state)

image.png

確定了相關(guān)性后绎橘，就可以使用lm()函數(shù)擬合多元線性回歸模型

> fit<-lm(Murder~Population+Illiteracy+Income+Frost,data = state)
> fit

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = state)

Coefficients:
(Intercept)   Population   Illiteracy       Income        Frost  
  1.235e+00    2.237e-04    4.143e+00    6.442e-05    5.813e-04  

> summary(fit)

Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = state)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567, Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

可以看出謀殺率和文盲率呈極顯著線性相關(guān)
當(dāng)有多個(gè)自變量時(shí)，回歸系數(shù)的含義為：一個(gè)預(yù)測變量增加一個(gè)單位唠倦，其他預(yù)測變量不變時(shí)称鳞，相應(yīng)變量將要增加的數(shù)目

有交互項(xiàng)的多元線性回歸

示例數(shù)據(jù)來源于mtcars數(shù)據(jù)集，通過汽車重量（wt）和馬力（hp）來預(yù)測汽車的每加侖行駛英里數(shù)（mpg）

> class(mtcars)
[1] "data.frame"
> fit<-lm(mpg~wt+hp+wt:hp,data = mtcars)
> fit

Call:
lm(formula = mpg ~ wt + hp + wt:hp, data = mtcars)

Coefficients:
(Intercept)           wt           hp        wt:hp  
   49.80842     -8.21662     -0.12010      0.02785  

> summary(fit)

Call:
lm(formula = mpg ~ wt + hp + wt:hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt:hp        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,    Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

可以看出馬力與汽車重量的交互項(xiàng)是顯著的
這說明這兩個(gè)自變量之間存在著相互的影響