在客觀世界中普遍存在著變量之間的關(guān)系崎弃,一般說來可以分為確定性的因果關(guān)系以及非確定性的相關(guān)關(guān)系州邢≡颍回歸分析是研究相關(guān)關(guān)系的一種數(shù)學(xué)工具,它能幫助我們從一個(gè)變量取得的值去估計(jì)另一個(gè)變量所取得的值惧浴。
回歸的典故
(達(dá)爾文的表兄)高爾頓最著名的發(fā)現(xiàn)之一是他發(fā)現(xiàn)了父親的身高和兒子的身高之間存在著某種給定的關(guān)系,他通過進(jìn)一步的研究發(fā)現(xiàn)了:事實(shí)上子輩的平均身高是其父輩平均身高以及他們所處族群平均身高的加權(quán)平均和锡溯。
他把這種趨勢平均化的現(xiàn)象寫到了自己1886年的論文中赶舆。論文的全名叫:Regression towards Mediocrity in Hereditary Stature. 這篇論文當(dāng)年被發(fā)在了大不列顛以及愛爾蘭人類研究學(xué)院期刊上。我們現(xiàn)今把論文中的這種回歸現(xiàn)象稱為:均值回歸或者平庸回歸(reversion to the mean/reversion to mediocrity)祭饭。背后的意義是說:哪怕單看一組父親和孩子的身高芜茵,兩個(gè)人的身高可能差異很大,但是從整個(gè)人群上來看倡蝙,父親和孩子的身高分布應(yīng)該是很相近的九串。
什么叫線性模型
在統(tǒng)計(jì)學(xué)中,線性回歸(Linear regression)是利用稱為線性回歸方程的最小二乘函數(shù)對(duì)一個(gè)或多個(gè)自變量和因變量之間關(guān)系進(jìn)行建模的一種回歸分析寺鸥。這種函數(shù)是一個(gè)或多個(gè)稱為回歸系數(shù)的模型參數(shù)的線性組合猪钮。只有一個(gè)自變量的情況稱為簡單回歸,大于一個(gè)自變量情況的叫做多元回歸胆建。
在線性回歸中烤低,數(shù)據(jù)使用線性預(yù)測函數(shù)來建模,并且未知的模型參數(shù)也是通過數(shù)據(jù)來估計(jì)笆载。這些模型被叫做線性模型扑馁。最常用的線性回歸建模是給定X值的y的條件均值是X的仿射函數(shù)涯呻。不太一般的情況,線性回歸模型可以是一個(gè)中位數(shù)或一些其他的給定X的條件下y的條件分布的分位數(shù)作為X的線性函數(shù)表示腻要。像所有形式的回歸分析一樣复罐,線性回歸也把焦點(diǎn)放在給定X值的y的條件概率分布,而不是X和y的聯(lián)合概率分布(多元分析領(lǐng)域)雄家。
給一個(gè)隨機(jī)樣本效诅,一個(gè)線性回歸模型假設(shè)回歸子
和回歸量
之間的關(guān)系是除了
的影響以外,還有其他的變量存在趟济。我們加入一個(gè)誤差項(xiàng)
(也是一個(gè)隨機(jī)變量)來捕獲除了
之外任何對(duì)
的影響乱投。所以一個(gè)多變量線性回歸模型表示為以下的形式:
其他的模型可能被認(rèn)定成非線性模型。一個(gè)線性回歸模型不需要是自變量的線性函數(shù)咙好。線性在這里表示的條件均值在參數(shù)
里是線性的篡腌。例如:模型
在
和
里是線性的,但在
里是非線性的勾效,它是
的非線性函數(shù)嘹悼。
t檢驗(yàn)與一元線性回歸的等價(jià)性
還拿我們總磷(TP)濃度來比較兩個(gè)總體的t檢驗(yàn)。
source("D:\\Users\\Administrator\\Desktop\\RStudio\\Environmental_and_Ecological_Statistics_with_R\\DataCode\\Code\\FrontMatter.R")
### linear models ####
##################################
## Chapter 5 linear regression ##
##################################
plotDIRch5 <- paste(plotDIR, "chapter5", "figures", sep="/")
##t.test and linear models
TP.reference<-read.csv("..//Data//WCA2TP.csv",header = T)
TP.reference$Month<-sapply(as.character(TP.reference$Date),function(x){strsplit(x,"/")[[1]][1]})
subI<-(TP.reference$SITE=='E4'|TP.reference$SITE=='F4')&TP.reference$Year==1994
subR<-TP.reference$Year==1994&TP.reference$SITE!='E5'&TP.reference$SITE!='F5'
x<-log(TP.reference$RESULT[subI])
y<-log(TP.reference$RESULT[subR])
t.test(x,y,alternative = "greater",var.equal =T)
Two Sample t-test
data: x and y
t = -2.599, df = 157, p-value = 0.9949
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-1.102025 Inf
sample estimates:
mean of x mean of y
-4.016832 -3.343483
同樣地层宫,我們可以用公式界面來表示t檢驗(yàn):
two.sample<-data.frame(TP=c(TP.reference$RESULT[subI],TP.reference$RESULT[subR]),
x=c(rep("I",sum(subI)),
rep("R",sum(subR))
))
t.test(log(TP)~x,data = two.sample,,var.equal =T)
Two Sample t-test
data: log(TP) by x
t = -2.599, df = 157, p-value = 0.01024
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1850781 -0.1616198
sample estimates:
mean in group I mean in group R
-4.016832 -3.343483
公式在R中被約定為一個(gè)模型杨伙,符號(hào)
將響應(yīng)變量與預(yù)測變量(解釋變量)分開。在t檢驗(yàn)中萌腿,函數(shù)
知道預(yù)測量是用來將數(shù)據(jù)分為兩組的限匣。但是,我們也會(huì)把公式的應(yīng)用解釋為把估計(jì)出的均值看做某種預(yù)測毁菱。例如我們可以利用估計(jì)出的均值預(yù)測參考站點(diǎn)未來的一個(gè)樣本米死。因此雙樣本t檢驗(yàn)問題可以看作是一個(gè)一元線性回歸模型,其形式為:
模型系數(shù)就是估計(jì)出的參照站點(diǎn)TP濃度的對(duì)數(shù)均值(-3.343483 )贮庞,也就是說峦筒,
。假設(shè)
(參考站點(diǎn))就等價(jià)于
窗慎。當(dāng)
(受影響的站點(diǎn))物喷,模型等價(jià)于
。因此遮斥,
是受影響站點(diǎn)與參考站點(diǎn)對(duì)數(shù)均值的差值(-3.343483-(-4.0168)=0.673317)峦失。
two.sample<-data.frame(y=c(TP.reference$RESULT[subI],TP.reference$RESULT[subR]),
x=c(rep(0,sum(subI)),
rep(1,sum(subR))
))
t2lm<-lm(log(y)~x,data = two.sample)
summary(t2lm)
Call:
lm(formula = log(y) ~ x, data = two.sample)
Residuals:
Min 1Q Median 3Q Max
-2.1780 -0.8562 0.0466 0.7262 3.5690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.0168 0.2414 -16.642 <2e-16 ***
x 0.6733 0.2591 2.599 0.0102 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.106 on 157 degrees of freedom
Multiple R-squared: 0.04125, Adjusted R-squared: 0.03514
F-statistic: 6.755 on 1 and 157 DF, p-value: 0.01024
正如我們預(yù)料的那樣截距(Intercept,)為 -4.0168术吗,
為0.6733尉辑。雙樣本t檢驗(yàn)問題只是一個(gè)特例,當(dāng)預(yù)測變量為連續(xù)變量時(shí)较屿,我們就看到了熟悉的線性回歸問題隧魄。
作為線性模型的ANOVA
ANOVA模型中的計(jì)算工作包括各組均值的計(jì)算实幕。假設(shè)我們對(duì)Everglandes濕地中的TP濃度的時(shí)間變化感興趣,想看看采樣6年中每年的濃度均值是否存在差異堤器。為方便起見,我們假定每年的濃度是相互獨(dú)立的末贾。于是我們可以這樣來構(gòu)建模型:
anvoa.data<-data.frame(y=TP.reference$RESULT,
x2=ifelse(TP.reference$Year==1995,1,0),
x3=ifelse(TP.reference$Year==1996,1,0),
x4=ifelse(TP.reference$Year==1997,1,0),
x5=ifelse(TP.reference$Year==1998,1,0),
x6=ifelse(TP.reference$Year==1999,1,0))
anova.lm<-lm(log(y)~x2+x3+x4+x5+x6,data=anvoa.data)
summary(anova.lm)
Call:
lm(formula = log(y) ~ x2 + x3 + x4 + x5 + x6, data = anvoa.data)
Residuals:
Min 1Q Median 3Q Max
-2.0171 -0.9545 -0.0989 0.8610 4.5633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.50436 0.08894 -39.399 < 2e-16 ***
x2 -0.36943 0.11170 -3.307 0.000968 ***
x3 -0.16319 0.11478 -1.422 0.155345
x4 -0.17488 0.11896 -1.470 0.141811
x5 -0.27044 0.11323 -2.389 0.017062 *
x6 -0.15415 0.12500 -1.233 0.217751
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.114 on 1272 degrees of freedom
Multiple R-squared: 0.01004, Adjusted R-squared: 0.00615
F-statistic: 2.58 on 5 and 1272 DF, p-value: 0.0248
標(biāo)記為Intercept的系數(shù)就是的估計(jì)值闸溃。換個(gè)做法我們可以將變量Year換成一個(gè)因子變量,R將自動(dòng)創(chuàng)建這些預(yù)測變量拱撵。
anova.lm<-lm(log(RESULT)~factor(Year),data=TP.reference)
summary(anova.lm)
Call:
lm(formula = log(RESULT) ~ factor(Year), data = TP.reference)
Residuals:
Min 1Q Median 3Q Max
-2.0171 -0.9545 -0.0989 0.8610 4.5633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.50436 0.08894 -39.399 < 2e-16 ***
factor(Year)1995 -0.36943 0.11170 -3.307 0.000968 ***
factor(Year)1996 -0.16319 0.11478 -1.422 0.155345
factor(Year)1997 -0.17488 0.11896 -1.470 0.141811
factor(Year)1998 -0.27044 0.11323 -2.389 0.017062 *
factor(Year)1999 -0.15415 0.12500 -1.233 0.217751
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.114 on 1272 degrees of freedom
Multiple R-squared: 0.01004, Adjusted R-squared: 0.00615
F-statistic: 2.58 on 5 and 1272 DF, p-value: 0.0248
要查看檢驗(yàn)的結(jié)果辉川,可以:
anova(anova.lm)
Analysis of Variance Table
Response: log(RESULT)
Df Sum Sq Mean Sq F value Pr(>F)
factor(Year) 5 16.03 3.2051 2.5805 0.0248 *
Residuals 1272 1579.88 1.2420
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
以上R輸出只給出了1994年的均值,即因子預(yù)測量的第一個(gè)等級(jí)拴测。要強(qiáng)制輸出各個(gè)年份的均值乓旗,我們可以通過在線性模型公式右側(cè)加一個(gè)“-1”來不讓R去擬合intercept。
> anova.lm<-lm(log(RESULT)~factor(Year)-1,data=TP.reference)
> summary(anova.lm)
Call:
lm(formula = log(RESULT) ~ factor(Year) - 1, data = TP.reference)
Residuals:
Min 1Q Median 3Q Max
-2.0171 -0.9545 -0.0989 0.8610 4.5633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor(Year)1994 -3.50436 0.08894 -39.40 <2e-16 ***
factor(Year)1995 -3.87379 0.06757 -57.33 <2e-16 ***
factor(Year)1996 -3.66754 0.07255 -50.55 <2e-16 ***
factor(Year)1997 -3.67923 0.07900 -46.57 <2e-16 ***
factor(Year)1998 -3.77480 0.07007 -53.88 <2e-16 ***
factor(Year)1999 -3.65850 0.08783 -41.65 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.114 on 1272 degrees of freedom
Multiple R-squared: 0.9178, Adjusted R-squared: 0.9174
F-statistic: 2367 on 6 and 1272 DF, p-value: < 2.2e-16
> anova(anova.lm)
Analysis of Variance Table
Response: log(RESULT)
Df Sum Sq Mean Sq F value Pr(>F)
factor(Year) 6 17637.9 2939.65 2366.8 < 2.2e-16 ***
Residuals 1272 1579.9 1.24
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
可見集索,公式把第一個(gè)變量作為一個(gè)基準(zhǔn)來對(duì)待了屿愚。
簡單和多元線性回歸模型
模型已定,其實(shí)主要的工作就是估計(jì)模型系數(shù)务荆,最小平方法和最大似然法是兩種最常用的方法妆距。我們通過密歇根胡中魚體內(nèi)多氯聯(lián)苯(Polychlorinated biphenyls; Polychlorodiphenyls別名:氯化聯(lián)苯,PCB)濃度趨勢的例子來闡述函匕。分析魚體內(nèi)PCB的目的在于:A評(píng)估魚體內(nèi)的PCB濃度隨時(shí)間的變化趨勢娱据,以確定有意義的濃度降低是否還在發(fā)生;B為魚類消費(fèi)顧問提供依據(jù)以警示公眾食用被污染的魚可能存在的風(fēng)險(xiǎn)盅惜。
為了實(shí)現(xiàn)這目的中剩,采用了線性(對(duì)數(shù)線性)模型。不同年份間鮭魚體內(nèi)的PCB濃度逐年下降抒寂。
## PCB in fish
##### Lake trout ####
laketrout <- read.csv(paste(dataDIR,"laketrout2.csv", sep="/"),
header=T)
head(laketrout)
## dividing by 2.54 converts 60 cm to inches
## single predictor
laketrout$length <- laketrout$length*2.54
laketrout$size<- "small"
laketrout$size[laketrout$length>60] <- "large"
laketrout$large<- 0
laketrout$large[laketrout$length>60] <- 1
laketrout$len.c <- laketrout$length-mean(laketrout$length, na.rm=T)
laketrout$inches<-laketrout$length/2.54
laketrout <- laketrout[laketrout$pcb>exp(-2)&laketrout$length>0,]
head(laketrout)
length pcb n lgpcb year y size large len.c inches
1 75.946 31.3 1 3.443618 1974 0 large 1 13.432539 29.9
2 74.930 7.9 1 2.066863 1974 0 large 1 12.416539 29.5
3 68.580 26.7 1 3.284664 1974 0 large 1 6.066539 27.0
4 66.802 8.3 1 2.116256 1974 0 large 1 4.288539 26.3
5 69.596 11.3 1 2.424803 1974 0 large 1 7.082539 27.4
6 69.088 12.6 1 2.533697 1974 0 large 1 6.574539 27.2
## 60 cm is a threshold,
##
(pcb.loess <- loess.smooth (y=log(laketrout$pcb), x=laketrout$year, degree=1, span=0.75))
par(mar=c(3,3,1,1), mgp=c(1.75,0.125,0), tck=0.01,las=1)
plot(pcb ~ year, data=laketrout, log="y", ylab="PCB Concentrations (mg/kg)",
xlab="Year")
lines(exp(pcb.loess$y)~pcb.loess$x, lwd=2, col="gray")
dev.off()
因?yàn)槌叽绱笠恍┑聂~往往年齡也比較大结啼,所以PCB濃度與魚的尺寸的關(guān)系如下:
(pcb.loess <- loess.smooth(y=log(laketrout$pcb), x=laketrout$length, degree=1, span=0.75))
par(mar=c(3,3,1,1), mgp=c(1.75,0.125,0), tck=0.01,las=1)
plot(pcb ~ length, data=laketrout, log="y", ylab="PCB Concentrations (mg/kg)",
xlab="Fish Length (cm)")
lines(exp(pcb.loess$y)~pcb.loess$x, lwd=2, col="gray")
dev.off()
用一個(gè)預(yù)測變量來回歸
用于評(píng)價(jià)時(shí)間變化趨勢的簡單線性回歸模型是對(duì)數(shù)線性模型:
可以用R的函數(shù)lm()來實(shí)現(xiàn):
> ## simple linear regression
> lake.lm1 <- lm(log(pcb) ~ year, data=laketrout)
> display(lake.lm1, 4)
lm(formula = log(pcb) ~ year, data = laketrout)
coef.est coef.se
(Intercept) 119.8467 10.9689
year -0.0599 0.0055
---
n = 631, k = 2
residual sd = 0.8784, R-Squared = 0.16
> summary(lake.lm1)
Call:
lm(formula = log(pcb) ~ year, data = laketrout)
Residuals:
Min 1Q Median 3Q Max
-2.60801 -0.65501 -0.01038 0.57926 2.69862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.846680 10.968944 10.93 <2e-16 ***
year -0.059903 0.005525 -10.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8784 on 629 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.1575, Adjusted R-squared: 0.1561
F-statistic: 117.6 on 1 and 629 DF, p-value: < 2.2e-16
> (lm1.coef<-coef(lake.lm1))
(Intercept) year
119.84667967 -0.05990328
> anova(lake.lm1)
Analysis of Variance Table
Response: log(pcb)
Df Sum Sq Mean Sq F value Pr(>F)
year 1 90.71 90.712 117.57 < 2.2e-16 ***
Residuals 629 485.30 0.772
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
估計(jì)出的(截距,Intercept)是119.846680為預(yù)測變量為零時(shí)響應(yīng)變量的期望值蓬推,
(斜率)是-0.059903 為每年單位變化對(duì)應(yīng)的PCB對(duì)數(shù)濃度變化妆棒。擬合后的模型分為兩部分,確定的部分是
沸伏,是任意給定年份的PCB對(duì)數(shù)期望或均值糕珊。隨機(jī)部分是
描述的是波動(dòng)或者不確定性,描述了個(gè)體的變異毅糟。讀者朋友可以嘗試plot(lake.lm1)來查看擬合結(jié)果殘差的響應(yīng)红选。
給出擬合線的置信區(qū)間值
par(mar=c(3,3,0.25,0.25), mgp=c(1.25,.125,0), las=1, tck=0.01)
plot(pcb ~ year, data=laketrout, log="y",
ylab="PCB (mg/kg)", xlab="Year", cex=0.5, col=grey(0.5))
curve(exp(lm1.coef[1] + lm1.coef[2]*(x) +0.87^2/2), add=T, lwd=2)
curve(qlnorm(0.025, lm1.coef[1] + lm1.coef[2]*(x), 0.87 ), add=T, lty=2)
curve(qlnorm(0.975, lm1.coef[1] + lm1.coef[2]*(x), 0.87 ), add=T, lty=2)
dev.off()
多元回歸
> lake.lm2 <- lm(log(pcb) ~ length+I(year-1974), data=laketrout)
> display(lake.lm2, 3)
lm(formula = log(pcb) ~ length + I(year - 1974), data = laketrout)
coef.est coef.se
(Intercept) -1.834 0.120
length 0.060 0.002
I(year - 1974) -0.086 0.004
---
n = 631, k = 3
residual sd = 0.555, R-Squared = 0.66
> summary(lake.lm2)
Call:
lm(formula = log(pcb) ~ length + I(year - 1974), data = laketrout)
Residuals:
Min 1Q Median 3Q Max
-2.57511 -0.33308 0.01208 0.34307 1.97533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.834251 0.120335 -15.24 <2e-16 ***
length 0.059608 0.001934 30.83 <2e-16 ***
I(year - 1974) -0.086193 0.003590 -24.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5545 on 628 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.6648, Adjusted R-squared: 0.6637
F-statistic: 622.7 on 2 and 628 DF, p-value: < 2.2e-16
> (lm2.coef<-coef(lake.lm2))
(Intercept) length I(year - 1974)
-1.83425076 0.05960843 -0.08619336
> anova(lake.lm2)
Analysis of Variance Table
Response: log(pcb)
Df Sum Sq Mean Sq F value Pr(>F)
length 1 205.70 205.703 668.98 < 2.2e-16 ***
I(year - 1974) 1 177.21 177.210 576.32 < 2.2e-16 ***
Residuals 628 193.10 0.307
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(lake.lm2)
為了是魚的長度為零時(shí)公式依然是有生態(tài)學(xué)意義的,需要對(duì)長度做一下線性變化姆另。
(mn.length <- mean(laketrout$length, na.rm=T))
(laketrout$len.c <- laketrout$length-mean(laketrout$length, na.rm=T))
lake.lm3 <- lm(log(pcb) ~ len.c+I(year-1974), data=laketrout)
display(lake.lm3, 3)
(lm3.coef<-coef(lake.lm3))
summary(lake.lm3)
anova(lake.lm3)
plot(lake.lm3)
par(mar=c(3,3,0.25,0.25), mgp=c(1.25,.125,0), las=1, tck=0.01)
plot(pcb ~ year, data=laketrout,log="y",
ylab="PCB (mg/kg)", xlab="Year", cex=0.5, col=grey(0.5))
curve(exp( lm3.coef[1] + lm3.coef[2]*0 + lm3.coef[3]*(x-1974) +0.543^2/2),
add=T, lwd=2)
curve(exp( lm3.coef[1] + lm3.coef[2]*(-2.6)+lm3.coef[3]*(x-1974) +0.543^2/2),
add=T, lty=2, lwd=2)
curve(exp( lm3.coef[1] + lm3.coef[2]*(3.4) + lm3.coef[3]*(x-1974) +0.543^2/2),
add=T, lty="13", lwd=2)
legend(x=1990, y=40, legend=c("large fish","average fish","small fish"),
lty=c(3, 1, 2), lwd=rep(2, 3), cex=0.75, bty="n")
dev.off()
交互作用
當(dāng)我們用Year和len.c作為預(yù)測變量擬合多元回歸模型時(shí)喇肋,一個(gè)非常重要的假設(shè)就是年份的影響(年份的斜率)不受魚大小的影響坟乾,并且魚大小的影響在研究時(shí)段內(nèi)是相同的。也就是說這兩個(gè)預(yù)測變量是獨(dú)立的蝶防。這對(duì)于多元回歸模型是一個(gè)加和效應(yīng)的假設(shè)甚侣。我們知道魚的大小會(huì)隨著時(shí)間的變化而變化(因?yàn)轸~會(huì)長大的嘛),也就是魚的大小會(huì)隨著時(shí)間的變化而變化间学,兩者并不是獨(dú)立的變量殷费,這時(shí)就要用到兩者的相互作用來擬合了。在模型中加入第三個(gè)變量低葫,Year和len.c的乘積详羡。
lake.lm4 <- lm(log(pcb) ~ len.c*I(year-1974), data=laketrout)
> display(lake.lm4, 4)
lm(formula = log(pcb) ~ len.c * I(year - 1974), data = laketrout)
coef.est coef.se
(Intercept) 1.8907 0.0465
len.c 0.0510 0.0038
I(year - 1974) -0.0874 0.0036
len.c:I(year - 1974) 0.0008 0.0003
---
n = 631, k = 4
residual sd = 0.5520, R-Squared = 0.67
> (lm4.coef<-coef(lake.lm4))
(Intercept) len.c I(year - 1974) len.c:I(year - 1974)
1.8907181316 0.0510365746 -0.0873925026 0.0008475978
> anova(lake.lm4)
Analysis of Variance Table
Response: log(pcb)
Df Sum Sq Mean Sq F value Pr(>F)
len.c 1 205.703 205.703 675.0042 < 2e-16 ***
I(year - 1974) 1 177.210 177.210 581.5055 < 2e-16 ***
len.c:I(year - 1974) 1 2.027 2.027 6.6523 0.01013 *
Residuals 627 191.074 0.305
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(lake.lm4)
加入交互作用項(xiàng)len.c:I(year - 1974) 后,模型可以描述為:
(全模型公式)
由于乘積項(xiàng)的加入嘿悬,模型不再是線性的实柠。經(jīng)居中調(diào)整后的長度(len.c)的斜率和年份(Year)的斜率不再是常數(shù)。盡管如此善涨,人們還是把有交互作用的回歸分析放在線性回歸中去講解窒盐。
殘差和模型評(píng)估
對(duì)模型的解釋是很容易的,但是就模型的形式提出問題(如钢拧,線性模型準(zhǔn)確嗎登钥?)殘差分析,即模型預(yù)測值與觀測值之間差異娶靡,是回答模型合適數(shù)據(jù)這個(gè)問題的最有效的方法牧牢。在對(duì)數(shù)據(jù)和可能影響魚體內(nèi)PCB含量的因素進(jìn)行初步檢查后擬合出了全模型公式。使用了這個(gè)對(duì)數(shù)對(duì)數(shù)線性模型就意味著我們假定PCB濃度隨時(shí)間降低是指數(shù)式的姿锭,并且認(rèn)為湖中魚尺寸增加一個(gè)單位時(shí)體內(nèi)PCB濃度增加值相同塔鳍。但是這些假設(shè)并不具備理論的支撐,如何在數(shù)據(jù)的基礎(chǔ)上對(duì)這些進(jìn)行假設(shè)檢驗(yàn)?zāi)厣氪耍恳卮疬@個(gè)問題就要明白我們建模型的目的:因果推斷和預(yù)測轮纫。
好的預(yù)測模型應(yīng)該是簡單而準(zhǔn)確的。用R函數(shù)lm()擬合好一個(gè)模型后焚鲜,所有必須的模型總結(jié)和診斷信息都在R的結(jié)果對(duì)象中掌唾。要對(duì)模型做總體評(píng)估,我們常常用和一個(gè)假設(shè)檢驗(yàn)(F檢驗(yàn))來比較擬合后的模型和一個(gè)沒有預(yù)測變量的模型(
)忿磅。要評(píng)估每個(gè)單獨(dú)的預(yù)測變量是否有存在的必要糯彬,就用t檢驗(yàn)來評(píng)估模型變量的斜率是否與零有差異。如以上我們見到的summary()函數(shù)可以完成這項(xiàng)工作葱她。
> summary(lake.lm4)
Call:
lm(formula = log(pcb) ~ len.c * I(year - 1974), data = laketrout)
Residuals:
Min 1Q Median 3Q Max
-2.47961 -0.34105 0.01972 0.33870 1.97111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8907181 0.0464646 40.692 <2e-16 ***
len.c 0.0510366 0.0038407 13.288 <2e-16 ***
I(year - 1974) -0.0873925 0.0036045 -24.246 <2e-16 ***
len.c:I(year - 1974) 0.0008476 0.0003286 2.579 0.0101 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.552 on 627 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.6683, Adjusted R-squared: 0.6667
F-statistic: 421.1 on 3 and 627 DF, p-value: < 2.2e-16
其中撩扒,n為樣本數(shù),p為預(yù)測變量的個(gè)數(shù)吨些。因?yàn)樵谀P椭写曜唬黾佣鄠€(gè)變量炒辉,即使事實(shí)上無關(guān)的變量,也會(huì)小幅度條R平方的值泉手,當(dāng)時(shí)其是無意義黔寇,所以我們調(diào)整了下,降低R平方的值斩萌。用r square的時(shí)候啡氢,不斷添加變量能讓模型的效果提升,而這種提升可能是虛假的术裸。利用adjusted r square,能對(duì)添加的非顯著變量給出懲罰亭枷,也就是說隨意添加一個(gè)變量不一定能讓模型擬合度上升袭艺。那么被解釋的變化量
是否具有統(tǒng)計(jì)顯著性呢?就看下面一行的F-statistic或者 p-value叨粘。如果
是一個(gè)負(fù)值猾编,表明預(yù)測變量的解釋能力還不如零模型(比如隨機(jī)生成的正態(tài)分布)的解釋變量。
進(jìn)行模型比較時(shí)升敲,ANOVA是一種分離總方差的重要技術(shù)答倡。它可以用來比較具有少量預(yù)測變量的模型(簡化模型)和擁有大量預(yù)測變量的模型(全模型)。不同模型比較用于推斷一個(gè)預(yù)測變量是否應(yīng)該包含在模型中驴党。例如我們可以僅靠統(tǒng)計(jì)學(xué)來確定是否要把length加入第二個(gè)預(yù)測變量瘪撇。那就是比較模型:
和模型:
簡單的方差分析為:
> summary.aov(lake.lm1)
Df Sum Sq Mean Sq F value Pr(>F)
year 1 90.7 90.71 117.6 <2e-16 ***
Residuals 629 485.3 0.77
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15 observations deleted due to missingness
> summary.aov(lake.lm2)
Df Sum Sq Mean Sq F value Pr(>F)
length 1 205.7 205.70 669.0 <2e-16 ***
I(year - 1974) 1 177.2 177.21 576.3 <2e-16 ***
Residuals 628 193.1 0.31
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15 observations deleted due to missingness
在模型lake.lm1中,有485.3 的殘差平方和沒有被預(yù)測變量year所解釋港庄,模型lake.lm2加入第二個(gè)預(yù)測變量len.c后倔既,len.c解釋了其中的292,從而獲得很大的F值和很小的p值鹏氧,說明加入len.c可以很大程度改善模型渤涌。
查看每個(gè)預(yù)測值的解釋量(承擔(dān)的方差)
library(hier.part)
mypch1.lm<-lm(formula=log(pcb) ~ len.c+year+ large, data=laketrout,na.action=na.omit)
summary(mypch1.lm)
Call:
lm(formula = log(pcb) ~ len.c + year + large, data = laketrout,
na.action = na.omit)
Residuals:
Min 1Q Median 3Q Max
-2.5819 -0.3416 0.0097 0.3409 1.9737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 172.100794 7.134927 24.121 <2e-16 ***
len.c 0.060671 0.003212 18.888 <2e-16 ***
year -0.086215 0.003593 -23.994 <2e-16 ***
large -0.032229 0.077795 -0.414 0.679
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5549 on 627 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.6649, Adjusted R-squared: 0.6633
F-statistic: 414.6 on 3 and 627 DF, p-value: < 2.2e-16
env1<-laketrout[, c("len.c","year","large"), drop = F]
colnames(env1)
hier.p<-hier.part(log(laketrout$pcb),env1,fam = "gaussian", gof = "Rsqu")
hier.p$IJ
I J Total
len.c 0.2881845 0.06892997 0.3571145
year 0.2468852 -0.08940342 0.1574817
large 0.1297850 0.10114237 0.2309274
hier.p$IJ對(duì)應(yīng)的方差之和等于mypch1.lm的r.squared
> sum(hier.p$IJ[,1])
[1] 0.6648547
> summary(mypch1.lm)$r.squared
[1] 0.6648547
殘差的圖形分析
對(duì)于匯總統(tǒng)計(jì)(尤其是假設(shè)檢驗(yàn)的結(jié)果),殘差分析應(yīng)該是獨(dú)立的把还,并且近似服從均值為0实蓬,標(biāo)準(zhǔn)差為常數(shù)的正態(tài)分布:。殘差的圖形分布不可避免地應(yīng)該加入到三個(gè)階段:正態(tài)性吊履、獨(dú)立性安皱、常數(shù)標(biāo)準(zhǔn)差的評(píng)估中。在R中很容易實(shí)現(xiàn)艇炎,就是我們剛才提到的:plot(lake.lm1)练俐。
順時(shí)針來看。第一個(gè)圖是獨(dú)立性檢驗(yàn)的模型殘差和擬合值做的圖冕臭,暗示預(yù)測值均偏低腺晾;第二個(gè)是正態(tài)性檢驗(yàn)的殘差Q-Q圖燕锥;第三個(gè)是檢測模型可能存在潛在影響的杠桿點(diǎn)圖。數(shù)據(jù)點(diǎn)的Cook距離用來度量數(shù)據(jù)點(diǎn)的影響悯蝉。Cook距離大于1就認(rèn)為有影響的归形;第四個(gè)是檢驗(yàn)標(biāo)準(zhǔn)差為常數(shù)的S-L圖,即對(duì)殘差絕對(duì)值的平方根和擬合值做散點(diǎn)圖鼻由。
另外通過殘差擬合展形圖(residual - fitted - spread)我們可以看出m模型擬合值的覆蓋范圍與隨機(jī)誤差(殘差)的覆蓋范圍重合程度暇榴。測量的方差,rfs給出的是由模型所給出的相對(duì)展形蕉世。盡管
表明模型解釋了一定的總變化蔼紧,但該圖形表明擬合值與殘差的覆蓋范圍重疊情況。
rfs(lake.lm1, aspect=1)
非線性嘗試:
(lake.lm5 <- lm(log(pcb) ~ I(year-1974)*len.c + I(len.c^2),
+ data=laketrout))
Call:
lm(formula = log(pcb) ~ I(year - 1974) * len.c + I(len.c^2),
data = laketrout)
Coefficients:
(Intercept) I(year - 1974) len.c I(len.c^2) I(year - 1974):len.c
1.8133187 -0.0862623 0.0590159 0.0005200 0.0004097
> display(lake.lm5, 4)
lm(formula = log(pcb) ~ I(year - 1974) * len.c + I(len.c^2),
data = laketrout)
coef.est coef.se
(Intercept) 1.8133 0.0496
I(year - 1974) -0.0863 0.0036
len.c 0.0590 0.0043
I(len.c^2) 0.0005 0.0001
I(year - 1974):len.c 0.0004 0.0003
---
n = 631, k = 5
residual sd = 0.5452, R-Squared = 0.68
類型預(yù)測變量
很多情況下我們的預(yù)測變量并不是連續(xù)變量狠轻,而是離散的奸例,比如分類變量。在我們的案例中向楼,魚類的大小雖然是連續(xù)的但是在擬合的過程中我們發(fā)現(xiàn)魚的長度大小在60cm兩側(cè)趨勢是不同的查吊。于是我們構(gòu)造一個(gè)類型預(yù)測變量size,作為因子變量,size被轉(zhuǎn)化成取值為0或者1 的名義變量(Dummy)湖蜕。
(lake.lm6 <- lm(log(pcb) ~ I(year-1974)*factor(size) +
+ len.c * factor(size),
+ data=laketrout))
Call:
lm(formula = log(pcb) ~ I(year - 1974) * factor(size) + len.c *
factor(size), data = laketrout)
Coefficients:
(Intercept) I(year - 1974) factor(size)small len.c
1.7394281 -0.0846170 -0.0646686 0.0775689
I(year - 1974):factor(size)small factor(size)small:len.c
0.0001259 -0.0344983
> display(lake.lm6, 4)
lm(formula = log(pcb) ~ I(year - 1974) * factor(size) + len.c *
factor(size), data = laketrout)
coef.est coef.se
(Intercept) 1.7394 0.0667
I(year - 1974) -0.0846 0.0044
factor(size)small -0.0647 0.1197
len.c 0.0776 0.0044
I(year - 1974):factor(size)small 0.0001 0.0074
factor(size)small:len.c -0.0345 0.0063
---
n = 631, k = 6
residual sd = 0.5426, R-Squared = 0.68
簡化模型
> lake.lm7 <- lm(log(pcb) ~ I(year-1974) + len.c * factor(size),
+ data=laketrout)
> display(lake.lm7, 4)
lm(formula = log(pcb) ~ I(year - 1974) + len.c * factor(size),
data = laketrout)
coef.est coef.se
(Intercept) 1.7389 0.0588
I(year - 1974) -0.0846 0.0035
len.c 0.0776 0.0044
factor(size)small -0.0631 0.0779
len.c:factor(size)small -0.0345 0.0062
---
n = 631, k = 5
residual sd = 0.5422, R-Squared = 0.68
估計(jì)出的len.c的斜率和截距并未發(fā)生變化逻卖,要直接給出兩類尺寸類別下的len.c的截距和斜率:
> lake.lm8 <- lm(log(pcb) ~ I(year-1974) +
+ len.c * factor(size)-1-len.c,
+ data=laketrout)
> display(lake.lm8, 4)
lm(formula = log(pcb) ~ I(year - 1974) + len.c * factor(size) -
1 - len.c, data = laketrout)
coef.est coef.se
I(year - 1974) -0.0846 0.0035
factor(size)large 1.7389 0.0588
factor(size)small 1.6758 0.0795
len.c:factor(size)large 0.0776 0.0044
len.c:factor(size)small 0.0431 0.0045
---
n = 631, k = 5
residual sd = 0.5422, R-Squared = 0.83
共線性
為了量化胡泊的營養(yǎng)物質(zhì)輸入和湖內(nèi)浮游植物生長的關(guān)系,線性回歸模型是最常用的方法昭抒。由于磷和氮是藻類生長所需要的兩種營養(yǎng)物質(zhì)评也,他們也常常是藻類生長的限制因子。藻類的生長常用湖泊中葉綠素a的濃度來表示灭返。
這關(guān)系通常表示為雙對(duì)數(shù)線性回歸模型:
載入數(shù)據(jù)看看變量之間的關(guān)系仇参。
### The Finnish Lakes example ###
summer.All <- read.table(paste(dataDIR, "summerAll.csv", sep="/"), sep=",",header=T)
#names(summer.All)
# [1] "totp" "chla" "type" "lake" "year" "totn" "month" "depth" "surfa"
#[10] "color"
summer.All <- summer.All[log(summer.All$chla) > -20 ,]
#> names(summer.All)
# [1] "totp" "chla" "type" "lake" "year" "totn" "month" "depth" "surfa"
#[10] "color"
summer.All$y <- log(summer.All$chla)
summer.All$lxp<- scale(log(summer.All$totp), scale=F)
summer.All$lxn<- scale(log(summer.All$totn), scale=F)
summer.All$type.lake <- paste(summer.All$type, summer.All$lake)
summer.All$npr <- scale(log(summer.All$totn/summer.All$totp), scale=F)
### collinearity example ####
lake1 <- summer.All[summer.All$lake==19600 & summer.All$type==2,]
lake1$lxn <- scale(log(lake1$totn), scale=F)
lake1$lxp <- scale(log(lake1$totp), scale=F)
lake1$npr <- scale(lake1$npr, scale=F)
lake2 <- summer.All[summer.All$lake==1070,]
lake2$lxn <- scale(log(lake2$totn), scale=F)
lake2$lxp <- scale(log(lake2$totp), scale=F)
lake2$npr <- scale(lake2$npr, scale=F)
## lake 2 pairs
pairs(log(chla)~log(totn)+log(totp), data=lake2)
par( mgp=c(1.25,.125,0), las=1, tck=0.01)
pairs(y~lxn+lxp+npr, data=lake2, cex=0.5,
labels=c("log Chla","log TN","log TP","log N:P"))
> Finn.lm1 <- lm(y ~ lxp + lxn, data=lake1)
> display(Finn.lm1)
lm(formula = y ~ lxp + lxn, data = lake1)
coef.est coef.se
(Intercept) 2.35 0.02
lxp 1.01 0.06
lxn -0.09 0.13
---
n = 321, k = 3
residual sd = 0.37, R-Squared = 0.48
>
湖泊內(nèi)TP和TN并不是完全獨(dú)立的就要考慮他們的相互作用,換句話說要考慮兩者的共線性問題婆殿。條件圖是在一個(gè)變量固定的條件下考察另一個(gè)預(yù)測變量與響應(yīng)變量之間的關(guān)系诈乒。
par(mar=c(3,3,3,.25), mgp=c(1.25,0.125,0), tck=0.02)
given.tn <- co.intervals(lake1$lxn, number=4, overlap=.1)
coplot(y ~ lxp | lxn, data = lake1, given.v=given.tn, rows=1,
xlab=c("log TP", "Given: Centered log TN"), ylab="log Chla",
panel=panel.smooth, col="gray", col.smooth=1)
以TN為條件,對(duì)葉綠素和TP作圖婆芦,在TN取值范圍內(nèi)怕磨,繪制葉綠素對(duì)數(shù)和TP對(duì)數(shù)之間的關(guān)系圖。
給定TP條件:
par(mar=c(3,3,3,.25), mgp=c(1.25,0.125,0), tck=0.02)
given.tp <- co.intervals(lake1$lxp, number=4, overlap=.1)
coplot(y ~ lxn | lxp, data = lake1, given.v=given.tp, rows=1,
xlab=c("log TN", "Given: Centered log TP"), ylab="log Chla",
panel=panel.smooth, col="gray", col.smooth=1)
比較以上兩個(gè)條件圖可以看出消约,葉綠素a和TP的關(guān)系是相對(duì)穩(wěn)定的肠鲫,在不同TN條件下,有著相似的斜率或粮,即暗示這是一個(gè)磷限制的湖泊导饲。
> Finn.lm3 <- lm(y ~ lxp * lxn, data=lake1)
> display(Finn.lm3)
lm(formula = y ~ lxp * lxn, data = lake1)
coef.est coef.se
(Intercept) 2.34 0.02
lxp 1.01 0.06
lxn -0.09 0.13
lxp:lxn 0.28 0.31
---
n = 321, k = 4
residual sd = 0.37, R-Squared = 0.48
par(mfrow=c(1,2), mgp=c(1.25, 0.125,0), mar=c(3,3,1, 0.25), las=1, tck=0.02)
plot(y~lxp, data=lake1, axes=F, xlab="TP", ylab="Chla", cex=0.5)
axis(1, at=log(c(1, 2, 5, 10,20,50,100))-attr(lake1$lxp, which="scaled:center"),
labels=c(1, 2, 5, 10,20,50,100))
axis(2, at=log(c(1,2,5,10,20,50)),
labels=c(1,2,5,10,20,50))
box()
curve(coef.lm3[1]+coef.lm3[2]*x+coef.lm3[3]*quant.n[1] +
coef.lm3[4]*x*quant.n[1], add=T, lty=1)
curve(coef.lm3[1]+coef.lm3[2]*x+coef.lm3[3]*quant.n[2] +
coef.lm3[4]*x*quant.n[2], add=T, lty=2)
curve(coef.lm3[1]+coef.lm3[2]*x+coef.lm3[3]*quant.n[3] +
coef.lm3[4]*x*quant.n[3], add=T, lty=3)
curve(coef.lm3[1]+coef.lm3[2]*x+coef.lm3[3]*quant.n[4] +
coef.lm3[4]*x*quant.n[4], add=T, lty=4)
curve(coef.lm3[1]+coef.lm3[2]*x+coef.lm3[3]*quant.n[5] +
coef.lm3[4]*x*quant.n[5], add=T, lty=5)
legend(x=log(30)-attr(lake1$lxp, which="scaled:center"),
y=log(5),
legend=c("2.5\\%","25\\%","50\\%","75\\%","97.5\\%"),
cex=0.5, lty=1:5, bty="n")
plot(y~lxn, data=lake1, axes=F, xlab="TN", ylab="Chla", cex=0.5)
axis(1, at=log(c(500,1000))-attr(lake1$lxn, which="scaled:center"),
labels=c(500,1000))
axis(2, at=log(c(1,2,5,10,20,50)),
labels=c(1,2,5,10,20,50))
box()
curve(coef.lm3[1]+coef.lm3[3]*x+coef.lm3[2]*quant.p[1] +
coef.lm3[4]*x*quant.p[1], add=T, lty=1)
curve(coef.lm3[1]+coef.lm3[3]*x+coef.lm3[2]*quant.p[2] +
coef.lm3[4]*x*quant.p[2], add=T, lty=2)
curve(coef.lm3[1]+coef.lm3[3]*x+coef.lm3[2]*quant.p[3] +
coef.lm3[4]*x*quant.p[3], add=T, lty=3)
curve(coef.lm3[1]+coef.lm3[3]*x+coef.lm3[2]*quant.p[4] +
coef.lm3[4]*x*quant.p[4], add=T, lty=4)
curve(coef.lm3[1]+coef.lm3[3]*x+coef.lm3[2]*quant.p[5] +
coef.lm3[4]*x*quant.p[5], add=T, lty=5)
參考:
一元線性回歸的細(xì)節(jié)
機(jī)器學(xué)習(xí)之線性回歸(linear regression)
恢復(fù)更新】寫給大家看的機(jī)器學(xué)習(xí)書(第七篇)——線性回歸(上)
一元線性回歸分析
回歸分析中的“回歸”是什么意思?
機(jī)器學(xué)習(xí)Chapter-1(線性模型)
【統(tǒng)計(jì)學(xué)習(xí)3】線性回歸:R方(R-squared)及調(diào)整R方(Adjusted R-Square)
加權(quán)最小二乘法與局部加權(quán)線性回歸
Why You Need to Check Your Residual Plots for Regression Analysis: Or, To Err is Human, To Err Randomly is Statistically Divine