系統(tǒng)發(fā)育樹是研究物種進(jìn)化歷史必不可少的信息衰伯,我們可以利用它得到一些重要?dú)v史線索绳锅,如:
- 生物多樣性(物種形成或滅絕)窟她;
- 物種性狀進(jìn)化與多樣化(Character evolution and diversification)码泞;
- 生物地理學(xué)(研究動植物的地理分布);
-
測試復(fù)雜進(jìn)化模型刹泄;
Introduction to phylogenies in R
首先,安裝系統(tǒng)發(fā)育分析所需的軟件包
install CRAN Task View for phylogenetics to run, remove the comment character '#'
# install.packages("ctv")
# library("ctv")
# install.views("Phylogenetics")
# update.views("Phylogenetics")
# install.packages("ape")
# install.packages("geiger")
# install.packages("phytools")
# citation("geiger")
如何使用R進(jìn)行系統(tǒng)發(fā)育樹編輯
tree = "(((((((cow, pig),whale),(bat,(lemur,human))),(robin,iguana)),coelacanth),gold_fish),shark);" # 從字符串中讀取樹文件
vert.tree <- read.tree(text=tree)
plot(vert.tree) # 普通可視化
plot(vert.tree,type="cladogram")
plot(unroot(vert.tree),type="unrooted")
plot(vert.tree,type="fan")
其實(shí)怎爵,此處的樹文件就是一個字符串列表(列表還可以是數(shù)字型)特石。
> str(vert.tree)
List of 3
$ edge : int [1:20, 1:2] 12 13 14 15 16 17 18 18 17 16 ...
$ Nnode : int 10
$ tip.label: chr [1:11] "cow" "pig" "whale" "bat" ...
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
接下來,主要是看一下這些對象是如何存儲在變量中的:
tree<-read.tree(text="(((A,B),(C,D)),E);")
plot(tree,type="cladogram",edge.width=2)
> tree$edge
[,1] [,2]
[1,] 6 7
[2,] 7 8
[3,] 8 1
[4,] 8 2
[5,] 7 9
[6,] 9 3
[7,] 9 4
[8,] 6 5
> tree$tip.label
[1] "A" "B" "C" "D" "E"
> tree$Nnode
[1] 4
plot(tree,edge.width=2,label.offset=0.1,type="cladogram")
nodelabels()
tiplabels()
我們可以看到鳖链,樹中所有分類單元之間的關(guān)系信息都包含在每條邊的起始節(jié)點(diǎn)和結(jié)束節(jié)點(diǎn)中姆蘸,共享共同起始節(jié)點(diǎn)編號的邊是直接共同祖先的后代。
我們還可以測試樹文件芙委,編輯以及去除一些末端枝:
## (I'm going to first set the seed for repeatability)
set.seed(1)
## simulate a birth-death tree using phytools
tree<-pbtree(b=1,d=0.2,n=40)
## stopping criterion is 40 extant species, in this case
plotTree(tree)
nodelabels()
## ok, now extract the clade descended from node #62
tt62<-extract.clade(tree,62)
plotTree(tt62)
## now drop 10 tips from the tree (I'm going to pick them at random)
dtips<-sample(tree$tip.label,10)
dt<-drop.tip(tree,dtips)
plotTree(dt)
## we could also, say, drop all tips that go extinct before the present
## this is a fun way, but not the only way to do this:
plotTree(et<-drop.tip(tree,getExtinct(tree)),cex=0.7)
## rotating nodes
plotTree(tree,node.numbers=T)
## first, rotate about node #50
rt.50<-rotate(tree,50)
plotTree(rt.50)
## now rotate all nodes
rt.all<-rotateNodes(tree,"all")
plotTree(rt.all)
## ok, now let's re-root the tree at node #67
rr.67<-root(tree,node=67)
plotTree(rr.67)
## this creates a trifurcation at the root
## we could instead re-root at along an edge
rr.62<-reroot(tree,62,position=0.5*tree$edge.length[which(tree$edge[,2]==62)])
plotTree(rr.62)
布朗模型(Brownian Motion)
布朗模型:
(1). 連續(xù)型性狀進(jìn)化的一個模型逞敷;
(2). 性狀隨著時間不斷變化;
(3). 經(jīng)過一段時間后灌侣,預(yù)期的狀態(tài)將服從正態(tài)分布推捐;
1. 什么是布朗運(yùn)動?
(1). 有時稱為維納過程侧啼;
(2). 連續(xù)時間隨機(jī)過程牛柒;
(3). 描述連續(xù)型性狀的“隨機(jī)演化”;
2. 布朗運(yùn)動在何時才會發(fā)生痊乾?
(1). BM可以用來描述由大量獨(dú)立的弱力組合而產(chǎn)生的運(yùn)動皮壁;
(2). 加入許多小的自變量后,無論原始分布如何(中心極限定理)哪审,都會產(chǎn)生正態(tài)分布蛾魄;
演化過程近似布朗運(yùn)動
Ⅰ. 遺傳漂變
Ⅱ. 隨機(jī)改變
Ⅲ. 相對于考慮的時間間隔弱的自然選擇
Ⅳ. 隨時間隨機(jī)變化的自然選擇
3. 模擬布朗運(yùn)動
- 模擬布朗運(yùn)動需要從正態(tài)分布中提取值
- 分布的方差取決于σ2和t
- 沿著相鄰分支的值從根添加到末端
系統(tǒng)發(fā)育獨(dú)立差(Phylogenetic Independent Contrasts)
- PIC主要用來分析來自系統(tǒng)發(fā)育樹的數(shù)據(jù)
- 檢測性狀之間進(jìn)化相關(guān)性
- 我們可以將PIC看作是一種創(chuàng)建獨(dú)立數(shù)據(jù)點(diǎn)的統(tǒng)計轉(zhuǎn)換
系統(tǒng)發(fā)育獨(dú)立差(Phylogenetic Independent Contrast, PIC)是去除性狀分析中物種系統(tǒng)發(fā)育關(guān)系的一種方法,是美國進(jìn)化生物學(xué)家 J. Felsenstein于1985年提出的协饲。參考鏈接
當(dāng)時畏腕,有動物學(xué)家檢驗了若干哺乳動物腦容量和體重的相關(guān)性。然而茉稠,所用統(tǒng)計方法都假設(shè)各個種的性狀是相互獨(dú)立的描馅,采集自同一個正態(tài)分布的總體。由于不同種之間存在系統(tǒng)發(fā)育上的聯(lián)系而线,因此不能直接套用常規(guī)統(tǒng)計模型铭污,所以所得結(jié)果也就不可信了。在Felsenstein之前膀篮,一些學(xué)者已經(jīng)意識到這個問題嘹狞,但是都沒能給出理想的解決方案。
Felsenstain認(rèn)真思考了這個問題誓竿,并且認(rèn)為:研究多個不同性狀的相關(guān)性時磅网,必須考慮系統(tǒng)發(fā)育關(guān)系。他提出:假設(shè)物種性狀的進(jìn)化服從布朗運(yùn)動筷屡,那么用系統(tǒng)發(fā)育關(guān)系上最近的分類單元性狀的差值涧偷,再經(jīng)過枝長加權(quán)簸喂,就可以得到一組新的數(shù)據(jù)。這組新的數(shù)據(jù)去掉了進(jìn)化信息的干擾燎潮,因此可以用常規(guī)統(tǒng)計方法檢驗喻鳄。上述方法新獲得的數(shù)據(jù)就稱為系統(tǒng)發(fā)育獨(dú)立差 (The Phylogenetic Independent Contrast, PIC)。
舉例來說确封,如果要研究哺乳動物腦容量和體重的關(guān)系除呵,就要先獲得這些種之間的進(jìn)化關(guān)系,也就是經(jīng)過分子鐘校訂的進(jìn)化樹爪喘。然后再計算腦容量的PIC和體重的PIC颜曾,再用腦容量的PIC和體重的PIC建立線性模型,進(jìn)行相關(guān)性檢驗腥放。
現(xiàn)在讓我們來模擬一下布朗運(yùn)動在樹枝上的移動泛啸。堅持離散時間,我們首先需要一個離散時間的系統(tǒng)演化秃症,我們可以用pbtree在phytools中進(jìn)行模擬。
library(phytools)
t<-100 # total time
n<-30 # total taxa
b<-(log(n)-log(2))/t
tree<-pbtree(b=b,n=n,t=t,type="discrete")
## simulating with both taxa-stop (n) and time-stop (t) is
## performed via rejection sampling & may be slow
##
## 28 trees rejected before finding a tree
plotTree(tree)
現(xiàn)在吕粹,為了在樹的所有分支上進(jìn)行模擬种柑,我們只需要分別在每個分支上模擬,然后根據(jù)每個分支的父分支的結(jié)束狀態(tài)“移動”每個子分支匹耕。我們可以這樣做聚请,記住,因為布朗演化的結(jié)果在每個時間步驟是獨(dú)立于所有其他的時間步驟稳其。
## simulate evolution along each edge
X<-lapply(tree$edge.length,function(x) c(0,cumsum(rnorm(n=x,sd=sqrt(sig2)))))
## reorder the edges of the tree for pre-order traversal
cw<-reorder(tree)
## now simulate on the tree
ll<-tree$edge.length+1
for(i in 1:nrow(cw$edge)){
pp<-which(cw$edge[,2]==cw$edge[i,1])
if(length(pp)>0) X[[i]]<-X[[i]]+X[[pp]][ll[pp]]
else X[[i]]<-X[[i]]+X[[1]][1]
}
## get the starting and ending points of each edge for plotting
H<-nodeHeights(tree)
## plot the simulation
plot(H[1,1],X[[1]][1],ylim=range(X),xlim=range(H),xlab="time",ylab="phenotype")
for(i in 1:length(X)) lines(H[i,1]:H[i,2],X[[i]])
## add tip labels if desired
yy<-sapply(1:length(tree$tip.label),function(x,y) which(x==y),y=tree$edge[,2])
yy<-sapply(yy,function(x,y) y[[x]][length(y[[x]])],y=X)
text(x=max(H),y=yy,tree$tip.label)
## simulate Brownian evolution on a tree with fastBM
x<-fastBM(tree,sig2=sig2,internal=TRUE)
## visualize Brownian evolution on a tree
phenogram(tree,x,spread.labels=TRUE,spread.cost=c(1,0))
x<-sim.rates(tree,setNames(c(1,20),1:2),internal=TRUE)
phenogram(tree,x,colors=setNames(c("blue","red"),1:2),spread.labels=TRUE,spread.cost=c(1,0))
set.seed(100)
## transition matrix for the discrete trait simulation
Q<-matrix(c(-1,1,1,-1),2,2)
## simulated tree
tree<-pbtree(n=30,scale=1)
## simulate discrete character history
tree<-sim.history(tree,Q,anc="1")
## plot discrete character history on the tree
plotSimmap(tree,setNames(c("blue","red"),1:2),pts=F)
## simulate 5 taxon tree
tree<-pbtree(n=5,tip.label=LETTERS[1:5])
## 500 BM simulations
X<-fastBM(tree,nsim=500)
## plot tree
plot(rotateNodes(tree,"all"),edge.width=2,no.margin=TRUE)
## plot distribution across tips from 500 simulations
library(car)
scatterplotMatrix(t(X))
t <- 0:100 # time
sig2 <- 0.01
nsim <- 1000
## we'll simulate the steps from a uniform distribution
## with limits set to have the same variance (0.01) as before
X<-matrix(runif(n=nsim*(length(t)-1),min=-sqrt(3*sig2),max=sqrt(3*sig2)),nsim,length(t)-1)
X <- cbind(rep(0, nsim), t(apply(X, 1, cumsum)))
plot(t, X[1, ], xlab = "time", ylab = "phenotype", ylim = c(-2, 2), type = "l")
apply(X[2:nsim, ], 1, function(x, t) lines(t, x), t = t)
var(X[, length(t)])
hist(X[,length(t)])
plot(density(X[,length(t)]))
nsim=100
X <- matrix(rnorm(mean=0.02,n = nsim * (length(t) - 1), sd = sqrt(sig2/4)), nsim, length(t) - 1)
X <- cbind(rep(0, nsim), t(apply(X, 1, cumsum)))
plot(t, X[1, ], xlab = "time", ylab = "phenotype", ylim = c(-1, 3), type = "l")
apply(X[2:nsim, ], 1, function(x, t) lines(t, x), t = t)
在本教程中驶赏,我們將學(xué)習(xí)應(yīng)用Felsenstein的獨(dú)立對比方法來估計性狀之間的進(jìn)化相關(guān)性。Felsenstein發(fā)現(xiàn)一個以前已經(jīng)認(rèn)識到的問題既鞠,但這一問題沒有得到充分的認(rèn)識:即從統(tǒng)計分析的角度來看煤傍,物種數(shù)據(jù)不能被視為獨(dú)立的數(shù)據(jù)點(diǎn)。
為了學(xué)習(xí)對比方法嘱蛋,我們首先需要學(xué)習(xí)一些用r語言擬合線性模型的基礎(chǔ)知識蚯姆。
## linear regression model is y = b0 + b*x + e
b0<-5.3
b<-0.75
x<-rnorm(n=100,sd=1)
y<-b0+b*x+rnorm(n=100,sd=0.2)
plot(x,y)
fit<-lm(y~x)
abline(fit)
> fit
Call: lm(formula = y ~ x)
Coefficients:
(Intercept) x
5.3123 0.7412
> summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52591 -0.12877 -0.01271 0.12295 0.57436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.33205 0.02129 250.39 <2e-16 ***
## x 0.73766 0.02103 35.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2129 on 98 degrees of freedom
## Multiple R-squared: 0.9263, Adjusted R-squared: 0.9255
## F-statistic: 1231 on 1 and 98 DF, p-value: < 2.2e-16
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 55.804 55.804 1230.9 < 2.2e-16 ***
## Residuals 98 4.443 0.045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## coefficients of the fitted model
fit$coefficients
## (Intercept) x
## 5.3320548 0.7376578
## p-value of the fitted model
anova(fit)[["Pr(>F)"]][1]
## [1] 2.758908e-57
foo<-function(){
x<-rnorm(n=100)
y<-rnorm(n=100)
fit<-lm(y~x)
setNames(c(fit$coefficients[2],anova(fit)[["Pr(>F)"]][1]),c("beta","p"))
}
X<-t(replicate(1000,foo()))
hist(X[,1],xlab="fitted beta",main="fitted beta")
hist(X[,2],xlab="p-value",main="p-value")
## expected value is the nominal type I error rate, 0.05
mean(X[,2]<=0.05)
## [1] 0.056
這個模擬非常簡單地表明,當(dāng)滿足標(biāo)準(zhǔn)統(tǒng)計假設(shè)時洒敏,我們的參數(shù)估計是無偏的龄恋,并且第一類誤差在標(biāo)稱水平或附近。
系統(tǒng)發(fā)育數(shù)據(jù)的困難在于凶伙,對于y的觀測不再是獨(dú)立的和相同的分布:
## load phytools
library(phytools)
## simulate a coalescent shaped tree
tree<-rcoal(n=100)
plotTree(tree,ftype="off")
## simulate uncorrelated Brownian evolution
x<-fastBM(tree)
y<-fastBM(tree)
par(mar=c(5.1,4.1,2.1,2.1))
plot(x,y)
fit<-lm(y~x)
abline(fit)
## this is a projection of the tree into morphospace
phylomorphospace(tree,cbind(x,y),label="off",node.size=c(0.5,0.7))
abline(fit)
從這個例子中我們可以看出郭毕,對于系統(tǒng)發(fā)育來說披粟,誘發(fā)I類型的錯誤并不難淹仑。這是因為緊密相關(guān)的分類群具有高度相似的表型卿操。換句話說排霉,它們并不是關(guān)于樹上x和y的演化過程的獨(dú)立數(shù)據(jù)點(diǎn).
PGLS系統(tǒng)發(fā)育廣義最小二乘法
library(ape)
library(geiger)
library(nlme)
library(phytools)
anoleData <- read.csv("anolisDataAppended.csv", row.names = 1)
anoleTree <- read.tree("anolis.phy")
plot(anoleTree)
name.check(anoleTree, anoleData)
# Extract columns
host <- anoleData[, "hostility"]
awe <- anoleData[, "awesomeness"]
# Give them names
names(host) <- names(awe) <- rownames(anoleData)
# Calculate PICs
hPic <- pic(host, anoleTree)
aPic <- pic(awe, anoleTree)
# Make a model
picModel <- lm(hPic ~ aPic - 1)
# Yes, significant
summary(picModel)
##
## Call:
## lm(formula = hPic ~ aPic - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.105 -0.419 0.010 0.314 4.999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## aPic -0.9776 0.0452 -21.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.897 on 98 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.825
## F-statistic: 469 on 1 and 98 DF, p-value: <2e-16
# plot results
plot(hPic ~ aPic)
abline(a = 0, b = coef(picModel))
# 我們可以直接使用PGLS
pglsModel <- gls(hostility ~ awesomeness, correlation = corBrownian(phy = anoleTree),
data = anoleData, method = "ML")
summary(pglsModel)
## Generalized least squares fit by maximum likelihood
## Model: hostility ~ awesomeness
## Data: anoleData
## AIC BIC logLik
## 191 198.8 -92.49
##
## Correlation Structure: corBrownian
## Formula: ~1
## Parameter estimate(s):
## numeric(0)
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.1506 0.26263 0.573 0.5678
## awesomeness -0.9776 0.04516 -21.648 0.0000
##
## Correlation:
## (Intr)
## awesomeness -0.042
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -0.76020 -0.39057 -0.04942 0.19597 1.07374
##
## Residual standard error: 0.8877
## Degrees of freedom: 100 total; 98 residual
coef(pglsModel)
## (Intercept) awesomeness
## 0.1506 -0.9776
plot(host ~ awe)
abline(a = coef(pglsModel)[1], b = coef(pglsModel)[2])
PGLS還可以使用多個預(yù)測變量:
pglsModel2 <- gls(hostility ~ ecomorph, correlation = corBrownian(phy = anoleTree),
data = anoleData, method = "ML")
anova(pglsModel2)
## Denom. DF: 93
## numDF F-value p-value
## (Intercept) 1 0.01847 0.8922
## ecomorph 6 0.21838 0.9700
coef(pglsModel2)
## (Intercept) ecomorphGB ecomorphT ecomorphTC ecomorphTG ecomorphTW
## 0.4844 -0.6316 -1.0585 -0.8558 -0.4086 -0.4039
pglsModel3 <- gls(hostility ~ ecomorph * awesomeness, correlation = corBrownian(phy = anoleTree),
data = anoleData, method = "ML")
anova(pglsModel3)
## Denom. DF: 86
## numDF F-value p-value
## (Intercept) 1 0.1 0.7280
## ecomorph 6 1.4 0.2090
## awesomeness 1 472.9 <.0001
## ecomorph:awesomeness 6 3.9 0.0017
##還可以固定不同模型
# Does not converge - and this is difficult to fix!
pglsModelLambda <- gls(hostility ~ awesomeness, correlation = corPagel(1, phy = anoleTree,
fixed = FALSE), data = anoleData, method = "ML")
# this is a problem with scale. We can do a quick fix by making the branch
# lengths longer. This will not affect the analysis other than rescaling a
# nuisance parameter
tempTree <- anoleTree
tempTree$edge.length <- tempTree$edge.length * 100
pglsModelLambda <- gls(hostility ~ awesomeness, correlation = corPagel(1, phy = tempTree,
fixed = FALSE), data = anoleData, method = "ML")
summary(pglsModelLambda)
## Generalized least squares fit by maximum likelihood
## Model: hostility ~ awesomeness
## Data: anoleData
## AIC BIC logLik
## 72.56 82.98 -32.28
##
## Correlation Structure: corPagel
## Formula: ~1
## Parameter estimate(s):
## lambda
## -0.1586
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.0612 0.01582 3.872 2e-04
## awesomeness -0.8777 0.03104 -28.273 0e+00
##
## Correlation:
## (Intr)
## awesomeness -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.789463 -0.714775 0.003095 0.785093 2.232151
##
## Residual standard error: 0.371
## Degrees of freedom: 100 total; 98 residual
pglsModelOU <- gls(hostility ~ awesomeness, correlation = corMartins(1, phy = tempTree),
data = anoleData, method = "ML")
summary(pglsModelOU)
## Generalized least squares fit by maximum likelihood
## Model: hostility ~ awesomeness
## Data: anoleData
## AIC BIC logLik
## 96.63 107.1 -44.32
##
## Correlation Structure: corMartins
## Formula: ~1
## Parameter estimate(s):
## alpha
## 4.442
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.1084 0.03953 2.743 0.0072
## awesomeness -0.8812 0.03658 -24.091 0.0000
##
## Correlation:
## (Intr)
## awesomeness -0.269
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.8665 -0.8133 -0.1104 0.6475 2.0919
##
## Residual standard error: 0.3769
## Degrees of freedom: 100 total; 98 residual
系統(tǒng)發(fā)生信號
- Blomberg’s K statistic
- K >1系統(tǒng)發(fā)生信號很強(qiáng)
- K <1系統(tǒng)發(fā)生信號很弱
- Pagel’s lambda
- lambda =1 進(jìn)化方向按照布朗模型
- lambda =0 進(jìn)化方向沒有相關(guān)性
library(geiger)
library(picante)
library(phytools)
anoleData <- read.csv("anolisDataAppended.csv", row.names = 1)
anoleTree <- read.tree("anolis.phy")
plot(anoleTree)
name.check(anoleTree, anoleData)
用體型大小來做兩個主要的系統(tǒng)發(fā)育信號測試。第一個檢驗是Blomberg‘s K煮落,它將PIC的方差與布朗運(yùn)動模型下的方差進(jìn)行比較敞峭。K=1表示親屬在BM下的相似程度,K<1表示在BM下存在較少的“系統(tǒng)發(fā)育信號”蝉仇,而K>1表示存在更多的“系統(tǒng)發(fā)育信號”旋讹。從系統(tǒng)信號返回的顯著p值告訴您存在顯著的系統(tǒng)發(fā)育信號。也就是說轿衔,近親比隨機(jī)配對的物種更相似沉迹。
anoleSize <- anoleData[, 1]
names(anoleSize) <- rownames(anoleData)
phylosignal(anoleSize, anoleTree)
## K PIC.variance.obs PIC.variance.rnd.mean PIC.variance.P
## 1 1.554 0.1389 0.773 0.001
## PIC.variance.Z
## 1 -3.913
phylosig(anoleTree, anoleSize, method = "K", test = T)
## $K
## [1] 1.554
##
## $P
## [1] 0.001
另一種檢測系統(tǒng)發(fā)育信號的方法是Pagel的lambda。Lambda是一種相對于內(nèi)部分支延伸尖端分支的樹變換害驹,使樹越來越像一個完整的多段切除法鞭呕。如果我們估計的λ=0,則推斷這些性狀沒有系統(tǒng)發(fā)育信號宛官。λ=1對應(yīng)于布朗運(yùn)動模型葫松,0<lambda<1介于兩者之間。
# First let's look at what lambda does
anoleTreeLambda0 <- rescale(anoleTree, model = "lambda", 0)
anoleTreeLambda5 <- rescale(anoleTree, model = "lambda", 0.5)
par(mfcol = c(1, 3))
plot(anoleTree)
plot(anoleTreeLambda5)
plot(anoleTreeLambda0)
phylosig(anoleTree, anoleSize, method = "lambda", test = T)
## $lambda
## [1] 1.017
##
## $logL
## [1] -3.81
##
## $logL0
## [1] -60.02
##
## $P
## [1] 2.893e-26
lambdaModel <- fitContinuous(anoleTree, anoleSize, model = "lambda")
brownianModel <- fitContinuous(anoleTree, anoleSize)
nosigModel <- fitContinuous(anoleTreeLambda0, anoleSize)
lambdaModel$opt$aicc
## [1] 15.65
brownianModel$opt$aicc
## [1] 13.52
nosigModel$opt$aicc
## [1] 124.2
# Conclusion: Brownian model is best, no signal model is terrible
brownianModel <- fitContinuous(anoleTree, anoleSize)
OUModel <- fitContinuous(anoleTree, anoleSize, model = "OU")
EBModel <- fitContinuous(anoleTree, anoleSize, model = "EB")
# inspect results
brownianModel
## GEIGER-fitted comparative model of continuous data
## fitted 'BM' model parameters:
## sigsq = 0.136160
## z0 = 4.065918
##
## model summary:
## log-likelihood = -4.700404
## AIC = 13.400807
## AICc = 13.524519
## free parameters = 2
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 1.00
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates
OUModel
## GEIGER-fitted comparative model of continuous data
## fitted 'OU' model parameters:
## alpha = 0.000000
## sigsq = 0.136160
## z0 = 4.065918
##
## model summary:
## log-likelihood = -4.700404
## AIC = 15.400807
## AICc = 15.650807
## free parameters = 3
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 0.76
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates
EBModel
## GEIGER-fitted comparative model of continuous data
## fitted 'EB' model parameters:
## a = -0.736271
## sigsq = 0.233528
## z0 = 4.066519
##
## model summary:
## log-likelihood = -4.285970
## AIC = 14.571939
## AICc = 14.821939
## free parameters = 3
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 0.41
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates
# calculate AIC weights
bmAICC <- brownianModel$opt$aicc
ouAICC <- OUModel$opt$aicc
ebAICC <- EBModel$opt$aicc
aicc <- c(bmAICC, ouAICC, ebAICC)
aiccD <- aicc - min(aicc)
aw <- exp(-0.5 * aiccD)
aiccW <- aw/sum(aw)
aiccW
## [1] 0.5353 0.1849 0.2798
It is important to realize that measurement error can bias your inferences with fitting these models towards OU. Fortunately, we can easily account for that in fitContinuous.
# We measured 20 anoles per species, and the standard deviation within each
# species was, on average, 0.05
seSize <- 0.05/sqrt(20)
# redo with measurement error
brownianModel <- fitContinuous(anoleTree, anoleSize, SE = seSize)
OUModel <- fitContinuous(anoleTree, anoleSize, model = "OU", SE = seSize)
EBModel <- fitContinuous(anoleTree, anoleSize, model = "EB", SE = seSize)
# calculate AIC weights
bmAICC <- brownianModel$opt$aicc
ouAICC <- OUModel$opt$aicc
ebAICC <- EBModel$opt$aicc
aicc <- c(bmAICC, ouAICC, ebAICC)
aiccD <- aicc - min(aicc)
aw <- exp(-0.5 * aiccD)
aiccW <- aw/sum(aw)
aiccW
## [1] 0.5346 0.1846 0.2808
重構(gòu)祖先狀態(tài)
- 在祖先特征重建中底洗,目標(biāo)是估算表型性狀的祖先狀態(tài)-通常在內(nèi)部節(jié)點(diǎn)
- 理想情況下腋么,還應(yīng)該獲得與祖先狀態(tài)估計有關(guān)的不確定性的測量結(jié)果
- 祖先特征估計的第一步是確定我們對分析感興趣的數(shù)據(jù)類型。例如亥揖,我們可以連續(xù)測量數(shù)據(jù)或離散型數(shù)據(jù)珊擂。
-
不僅需要考慮數(shù)據(jù)類型,還要考慮適合數(shù)據(jù)的模型费变。不管是連續(xù)測量數(shù)據(jù)或離散數(shù)據(jù)摧扇,大多數(shù)都是遵循布朗運(yùn)動模型(隨著時間上下波動);或者是通過瞬間跳躍在狀態(tài)之間(Brownian motion is a continuous-time stochastic process)挚歧。
推斷祖先狀態(tài)一直是系統(tǒng)發(fā)育比較生物學(xué)中一個重要的目標(biāo)扛稽,本教程主要介紹如何在BM以及OU模型下重構(gòu)祖先性狀。第一個例子就是蜥蜴祖先體型大小的狀態(tài)重建昼激,這是一個連續(xù)型變量庇绽。
## load libraries
library(phytools)
## read tree from file
tree<-read.tree("anole.tre") ## or
tree<-read.tree("http://www.phytools.org/Ilhabela2015/data/anole.tre")
## plot tree
plotTree(tree,type="fan",ftype="i")
## setEnv=TRUE for this type is experimental. please be patient with bugs
## read data
svl<-read.csv("svl.csv",row.names=1) ## or
svl<-read.csv("http://www.phytools.org/Ilhabela2015/data/svl.csv",
row.names=1)
## change this into a vector
svl<-as.matrix(svl)[,1]
svl
## ahli alayoni alfaroi aliniger
## 4.039125 3.815705 3.526655 4.036557
## allisoni allogus altitudinalis alumina
## 4.375390 4.040138 3.842994 3.588941
## alutaceus angusticeps argenteolus argillaceus
## 3.554891 3.788595 3.971307 3.757869
## armouri bahorucoensis baleatus baracoae
## 4.121684 3.827445 5.053056 5.042780
## barahonae barbatus barbouri bartschi
## 5.076958 5.003946 3.663932 4.280547
## bremeri breslini brevirostris caudalis
## 4.113371 4.051111 3.874155 3.911743
## centralis chamaeleonides chlorocyanus christophei
## 3.697941 5.042349 4.275448 3.884652
## clivicola coelestinus confusus cooki
## 3.758726 4.297965 3.938442 4.091535
## cristatellus cupeyalensis cuvieri cyanopleurus
## 4.189820 3.462014 4.875012 3.630161
## cybotes darlingtoni distichus dolichocephalus
## 4.210982 4.302036 3.928796 3.908550
## equestris etheridgei eugenegrahami evermanni
## 5.113994 3.657991 4.128504 4.165605
## fowleri garmani grahami guafe
## 4.288780 4.769473 4.154274 3.877457
## guamuhaya guazuma gundlachi haetianus
## 5.036953 3.763884 4.188105 4.316542
## hendersoni homolechis imias inexpectatus
## 3.859835 4.032806 4.099687 3.537439
## insolitus isolepis jubar krugi
## 3.800471 3.657088 3.952605 3.886500
## lineatopus longitibialis loysiana lucius
## 4.128612 4.242103 3.701240 4.198915
## luteogularis macilentus marcanoi marron
## 5.101085 3.715765 4.079485 3.831810
## mestrei monticola noblei occultus
## 3.987147 3.770613 5.083473 3.663049
## olssoni opalinus ophiolepis oporinus
## 3.793899 3.838376 3.637962 3.845670
## paternus placidus poncensis porcatus
## 3.802961 3.773967 3.820378 4.258991
## porcus pulchellus pumilis quadriocellifer
## 5.038034 3.799022 3.466860 3.901619
## reconditus ricordii rubribarbus sagrei
## 4.482607 5.013963 4.078469 4.067162
## semilineatus sheplani shrevei singularis
## 3.696631 3.682924 3.983003 4.057997
## smallwoodi strahmi stratulus valencienni
## 5.035096 4.274271 3.869881 4.321524
## vanidicus vermiculatus websteri whitemani
## 3.626206 4.802849 3.916546 4.097479
現(xiàn)在我們可以估計祖先的狀態(tài),還將計算方差&每個節(jié)點(diǎn)的95%置信區(qū)間:
fit<-fastAnc(tree,svl,vars=TRUE,CI=TRUE)
fit
## $ace
## 101 102 103 104 105 106 107 108
## 4.065918 4.045601 4.047993 4.066923 4.036743 4.049514 4.054528 4.045218
## 109 110 111 112 113 114 115 116
## 3.979493 3.952440 3.926814 3.964414 3.995835 3.948034 3.955203 3.977842
## 117 118 119 120 121 122 123 124
## 4.213995 4.240867 4.248450 4.257574 4.239061 4.004120 4.007024 4.015249
## 125 126 127 128 129 130 131 132
## 4.006720 3.992617 3.925848 3.995759 4.049619 3.930793 3.908343 3.901518
## 133 134 135 136 137 138 139 140
## 3.885086 4.040356 4.060970 3.987789 3.951878 3.814014 3.707733 3.926147
## 141 142 143 144 145 146 147 148
## 3.988745 4.167983 4.157021 4.166146 4.146362 4.146132 4.159052 4.113267
## 149 150 151 152 153 154 155 156
## 4.133069 4.222223 4.461376 4.488496 4.437250 4.512071 4.953146 5.033253
## 157 158 159 160 161 162 163 164
## 4.962377 5.009025 5.018389 3.974900 3.880482 3.859268 3.859265 3.871895
## 165 166 167 168 169 170 171 172
## 3.899914 3.807943 3.826619 4.151598 3.760446 3.654324 3.676713 3.825559
## 173 174 175 176 177 178 179 180
## 3.747726 3.813974 3.803077 3.702349 3.566476 3.678236 3.675620 3.662452
## 181 182 183 184 185 186 187 188
## 3.563181 3.645426 4.017460 4.113689 4.229393 4.381609 4.243153 5.041281
## 189 190 191 192 193 194 195 196
## 5.051563 5.051719 5.057591 4.183653 4.151505 4.113279 3.969795 3.905122
## 197 198 199
## 4.191810 4.161419 4.092141
##
## $var
## 101 102 103 104 105 106
## 0.011775189 0.008409641 0.009452076 0.011870028 0.014459672 0.018112713
## 107 108 109 110 111 112
## 0.011685860 0.007268889 0.014487712 0.012814461 0.010852276 0.010048935
## 113 114 115 116 117 118
## 0.006240382 0.009816255 0.008377720 0.006250414 0.015406961 0.015306088
## 119 120 121 122 123 124
## 0.008914737 0.008485896 0.016998370 0.014850114 0.012880197 0.011964472
## 125 126 127 128 129 130
## 0.010525923 0.010528040 0.013248168 0.012789504 0.013568841 0.013971730
## 131 132 133 134 135 136
## 0.010048937 0.009535820 0.008387455 0.008359158 0.010126605 0.012608132
## 137 138 139 140 141 142
## 0.013951936 0.018502968 0.014105896 0.018182598 0.017745869 0.012861889
## 143 144 145 146 147 148
## 0.014828558 0.011891297 0.008732027 0.008604618 0.010009986 0.007660375
## 149 150 151 152 153 154
## 0.006817879 0.012378867 0.015540878 0.014154484 0.012761730 0.011952420
## 155 156 157 158 159 160
## 0.005064274 0.002463170 0.006924118 0.003953467 0.003509809 0.011102798
## 161 162 163 164 165 166
## 0.011378573 0.011045003 0.011682628 0.011560333 0.012242267 0.011462597
## 167 168 169 170 171 172
## 0.008888509 0.014206315 0.014405121 0.006290501 0.005434707 0.014714164
## 173 174 175 176 177 178
## 0.010884130 0.014842361 0.011333759 0.012932194 0.007449222 0.011094079
## 179 180 181 182 183 184
## 0.011015666 0.012850744 0.005326252 0.012876076 0.021615405 0.014608411
## 185 186 187 188 189 190
## 0.013155037 0.021043877 0.012658226 0.004540141 0.003540634 0.002698421
## 191 192 193 194 195 196
## 0.001277775 0.011837709 0.012449062 0.013536624 0.015848988 0.008788712
## 197 198 199
## 0.015789566 0.013529210 0.009534335
##
## $CI95
## [,1] [,2]
## 101 3.853231 4.278604
## 102 3.865861 4.225341
## 103 3.857439 4.238548
## 104 3.853382 4.280465
## 105 3.801056 4.272430
## 106 3.785731 4.313298
## 107 3.842649 4.266406
## 108 3.878112 4.212323
## 109 3.743577 4.215408
## 110 3.730566 4.174314
## 111 3.722632 4.130995
## 112 3.767935 4.160893
## 113 3.841003 4.150668
## 114 3.753844 4.142225
## 115 3.775804 4.134601
## 116 3.822885 4.132799
## 117 3.970710 4.457279
## 118 3.998380 4.483354
## 119 4.063391 4.433509
## 120 4.077021 4.438127
## 121 3.983520 4.494601
## 122 3.765272 4.242968
## 123 3.784582 4.229467
## 124 3.800860 4.229639
## 125 3.805632 4.207808
## 126 3.791509 4.193725
## 127 3.700250 4.151445
## 128 3.774102 4.217417
## 129 3.821307 4.277930
## 130 3.699117 4.162469
## 131 3.711864 4.104822
## 132 3.710121 4.092915
## 133 3.705584 4.064589
## 134 3.861156 4.219555
## 135 3.863733 4.258207
## 136 3.767708 4.207869
## 137 3.720366 4.183390
## 138 3.547403 4.080624
## 139 3.474947 3.940519
## 140 3.661855 4.190439
## 141 3.727646 4.249843
## 142 3.945699 4.390267
## 143 3.918347 4.395695
## 144 3.952414 4.379879
## 145 3.963210 4.329515
## 146 3.964320 4.327943
## 147 3.962955 4.355150
## 148 3.941721 4.284813
## 149 3.971230 4.294907
## 150 4.004152 4.440293
## 151 4.217036 4.705715
## 152 4.255309 4.721682
## 153 4.215833 4.658667
## 154 4.297789 4.726352
## 155 4.813665 5.092627
## 156 4.935977 5.130528
## 157 4.799283 5.125471
## 158 4.885787 5.132263
## 159 4.902272 5.134507
## 160 3.768375 4.181425
## 161 3.671408 4.089556
## 162 3.653282 4.065255
## 163 3.647416 4.071113
## 164 3.661158 4.082632
## 165 3.683050 4.116777
## 166 3.598098 4.017787
## 167 3.641833 4.011406
## 168 3.917985 4.385211
## 169 3.525204 3.995688
## 170 3.498871 3.809777
## 171 3.532221 3.821205
## 172 3.587807 4.063310
## 173 3.543245 3.952207
## 174 3.575189 4.052760
## 175 3.594415 4.011738
## 176 3.479458 3.925240
## 177 3.397311 3.735642
## 178 3.471792 3.884680
## 179 3.469907 3.881332
## 180 3.440264 3.884639
## 181 3.420138 3.706225
## 182 3.423020 3.867833
## 183 3.729297 4.305622
## 184 3.876793 4.350585
## 185 4.004590 4.454196
## 186 4.097282 4.665937
## 187 4.022636 4.463670
## 188 4.909215 5.173347
## 189 4.934937 5.168189
## 190 4.949904 5.153533
## 191 4.987529 5.127654
## 192 3.970403 4.396903
## 193 3.932817 4.370192
## 194 3.885239 4.341319
## 195 3.723046 4.216545
## 196 3.721376 4.088869
## 197 3.945523 4.438096
## 198 3.933441 4.389397
## 199 3.900759 4.283523
## as we discussed in class, 95% CIs can be broad
fit$CI[1,]
## [1] 3.853231 4.278604
range(svl)
## [1] 3.462014 5.113994
## projection of the reconstruction onto the edges of the tree
obj<-contMap(tree,svl,plot=FALSE)
plot(obj,type="fan",legend=0.7*max(nodeHeights(tree)))
## projection of the tree into phenotype space
phenogram(tree,svl,fsize=0.6,spread.costs=c(1,0))
## Optimizing the positions of the tip labels...
然后橙困,我們可以先根據(jù)布朗模型探討連續(xù)特征的祖先狀態(tài)重構(gòu)的一些性質(zhì):
## simulate a tree & some data
tree<-pbtree(n=26,scale=1,tip.label=LETTERS[26:1])
## simulate with ancestral states
x<-fastBM(tree,internal=TRUE)
## ancestral states
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
## tip data
x<-x[tree$tip.label]
fit<-fastAnc(tree,x,CI=TRUE)
fit
## $ace
## 27 28 29 30 31 32
## -0.01383743 -0.43466511 -0.78496842 -0.92652372 -0.97992735 -1.06606765
## 33 34 35 36 37 38
## -1.19242145 -0.77435425 -1.33427168 -0.77148395 0.37519774 0.35486511
## 39 40 41 42 43 44
## 0.78011325 0.63488391 1.19427496 -0.20165383 -0.17792873 1.07419593
## 45 46 47 48 49 50
## 1.45924828 1.55175294 1.75289132 2.16445039 1.81192553 0.68319366
## 51
## 0.81241807
##
## $CI95
## [,1] [,2]
## 27 -0.8879801 0.8603052
## 28 -1.2581222 0.3887920
## 29 -1.4652988 -0.1046380
## 30 -1.5538351 -0.2992123
## 31 -1.4449545 -0.5149002
## 32 -1.4966267 -0.6355086
## 33 -1.4920621 -0.8927808
## 34 -1.1230122 -0.4256963
## 35 -1.4863921 -1.1821513
## 36 -1.2746445 -0.2683234
## 37 -0.3797317 1.1301272
## 38 -0.4031915 1.1129217
## 39 0.1541249 1.4061016
## 40 0.5162435 0.7535243
## 41 1.0781703 1.3103796
## 42 -0.7364614 0.3331537
## 43 -0.6011442 0.2452867
## 44 0.4157610 1.7326308
## 45 0.8942088 2.0242878
## 46 0.9951790 2.1083269
## 47 1.2382347 2.2675479
## 48 1.9117253 2.4171755
## 49 1.3260228 2.2978283
## 50 0.3463241 1.0200632
## 51 0.5937687 1.0310674
我們可以很容易地將這些估計與用于模擬數(shù)據(jù)的(正常unknown)生成值進(jìn)行比較:
plot(a,fit$ace,xlab="true states",ylab="estimated states")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red") ## 1:1 line
##
對祖先狀態(tài)估計最常見的批評之一是瞧掺,圍繞祖先狀態(tài)的不確定性可能很大;而且凡傅,這種不確定性常常被忽略辟狈。讓我們通過在祖先值上獲得95%的CI來解決這一問題,然后讓我們表明,95%的CI在大約95%的時間內(nèi)包含生成值:
##
## first, let's go back to our previous dataset
print(fit)
mean(((a>=fit$CI95[,1]) + (a<=fit$CI95[,2]))==2)
## [1] 0.92
## custom function that conducts a simulation, estimates ancestral
## states, & returns the fraction on 95% CI
foo<-function(){
tree<-pbtree(n=100)
x<-fastBM(tree,internal=TRUE)
fit<-fastAnc(tree,x[1:length(tree$tip.label)],CI=TRUE)
mean(((x[1:tree$Nnode+length(tree$tip.label)]>=fit$CI95[,1]) +
(x[1:tree$Nnode+length(tree$tip.label)]<=fit$CI95[,2]))==2)
}
## conduct 100 simulations
pp<-replicate(100,foo())
mean(pp)
## [1] 0.9448485
這表明哼转,雖然CI可以很大明未,但當(dāng)模型正確時,它們將包括生成值(1-α)%的時間
也可以使用布朗運(yùn)動以外的其他角色進(jìn)化模型來估計祖先的狀態(tài)壹蔓。例如趟妥,我們可以根據(jù)我們已經(jīng)學(xué)到的OU模型來估計祖先的狀態(tài)。需要注意的一點(diǎn)是:當(dāng)選擇參數(shù)(α)非常大時佣蓉,祖先狀態(tài)都會趨向于相同的值(擬合的最優(yōu)披摄,θ):
## simulate tree & data under the OU model
alpha<-2
tree<-pbtree(n=100,scale=1)
x<-fastBM(tree,internal=TRUE,model="OU",alpha=alpha,sig2=0.2)
## Warning: alpha but not theta specified in OU model, setting theta to a.
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
phenogram(tree,c(x,a),ftype="off",spread.labels=FALSE)
title("Evolution under OU")
## fit the model (this may be slow)
fit<-anc.ML(tree,x,model="OU")
fit
## $sig2
## [1] 0.05573428
##
## $alpha
## [1] 1e-08
##
## $ace
## 101 102 103 104 105
## 9.963462e-03 2.244757e-02 2.240573e-02 -5.931035e-02 -4.329498e-02
## 106 107 108 109 110
## -3.480964e-02 -4.081294e-02 -1.143122e-01 -2.263273e-02 4.155330e-02
## 111 112 113 114 115
## 4.264017e-02 -1.810736e-02 -9.407912e-02 -1.363360e-01 -6.611306e-02
## 116 117 118 119 120
## -1.392603e-02 -3.488464e-02 -9.974915e-02 -1.471439e-01 -1.667432e-01
## 121 122 123 124 125
## -1.680456e-01 2.899822e-02 4.833528e-02 3.506143e-02 1.868645e-02
## 126 127 128 129 130
## 4.748435e-02 3.448893e-02 3.250233e-02 -1.538064e-03 5.460770e-02
## 131 132 133 134 135
## 1.287799e-01 1.331156e-01 1.469378e-01 8.486024e-03 6.439102e-03
## 136 137 138 139 140
## -5.997895e-02 -2.624823e-02 1.156061e-01 1.450449e-01 1.643474e-01
## 141 142 143 144 145
## 8.871516e-02 -1.500705e-01 -2.088446e-01 -1.485247e-01 -9.546454e-02
## 146 147 148 149 150
## -1.648921e-01 -1.003483e-01 -1.122847e-01 -9.953868e-02 2.319007e-01
## 151 152 153 154 155
## 4.114431e-02 3.261996e-02 5.558858e-02 -1.093121e-03 -2.343046e-03
## 156 157 158 159 160
## -9.063816e-05 -3.395186e-02 -7.084372e-02 -2.180441e-03 -9.891307e-02
## 161 162 163 164 165
## -3.951237e-02 -3.874053e-02 -2.391682e-02 9.737577e-02 -3.801571e-02
## 166 167 168 169 170
## -3.578902e-02 -2.162303e-02 -5.260689e-03 -1.308687e-02 -2.051804e-02
## 171 172 173 174 175
## 1.033658e-02 2.185436e-03 4.877736e-02 9.016846e-02 -6.244681e-03
## 176 177 178 179 180
## 2.600085e-03 -2.081362e-02 -6.850276e-02 -1.012200e-01 -9.815220e-02
## 181 182 183 184 185
## -9.432228e-02 -9.482774e-02 -1.266490e-01 -3.097925e-01 -1.008618e-01
## 186 187 188 189 190
## -1.202127e-02 -1.335903e-02 -3.088787e-04 -1.270292e-01 7.848848e-03
## 191 192 193 194 195
## 2.054886e-01 2.441548e-01 -4.855077e-02 -1.614209e-02 -3.053233e-02
## 196 197 198 199
## -5.844232e-02 2.570525e-02 2.987429e-02 -1.579888e-01
##
## $logLik
## [1] 263.5128
##
## $counts
## function gradient
## 8 8
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##
## $model
## [1] "OU"
##
## attr(,"class")
## [1] "anc.ML"
plot(a,fit$ace,xlab="true values",ylab="estimated under OU")
lines(range(x),range(x),lty="dashed",col="red")
title("ancestral states estimated under OU")
祖先狀態(tài)估計當(dāng)某些節(jié)點(diǎn)已知時。我們可以在祖先狀態(tài)的MLE過程中從理論上確定任何內(nèi)部節(jié)點(diǎn)的狀態(tài)勇凭。在實(shí)踐中疚膊,我們可以通過將長度為零的終端邊緣附加到要修復(fù)的內(nèi)部節(jié)點(diǎn),然后將其作為分析中的另一個端值來處理虾标。另一種可能也考慮到祖先狀態(tài)不確定的可能性寓盗,即在內(nèi)部節(jié)點(diǎn)上使用一個或多個狀態(tài)上的信息先驗概率密度,然后使用貝葉斯MCMC來估計祖先狀態(tài)璧函。讓我們嘗試一下傀蚌,用一個非常糟糕的例子來估計祖先的特征,來自同一時期的尖端數(shù)據(jù):有趨勢的布朗進(jìn)化蘸吓。請注意喳张,雖然理論上我們可以這樣做,但我們并不適合這里的趨勢模型美澳。
tree<-pbtree(n=100,scale=1)
## simulate data with a trend
x<-fastBM(tree,internal=TRUE,mu=3)
phenogram(tree,x,ftype="off",spread.labels=FALSE)
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
## let's see how bad we do if we ignore the trend
plot(a,fastAnc(tree,x),xlab="true values",
ylab="estimated states under BM")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated without prior information")
## incorporate prior knowledge
pm<-setNames(c(1000,rep(0,tree$Nnode)),
c("sig2",1:tree$Nnode+length(tree$tip.label)))
## the root & two randomly chosen nodes
nn<-as.character(c(length(tree$tip.label)+1,
sample(2:tree$Nnode+length(tree$tip.label),2)))
pm[nn]<-a[as.character(nn)]
## prior variance
pv<-setNames(c(1000^2,rep(1000,length(pm)-1)),names(pm))
pv[as.character(nn)]<-1e-100
## run MCMC
mcmc<-anc.Bayes(tree,x,ngen=100000,
control=list(pr.mean=pm,pr.var=pv,
a=pm[as.character(length(tree$tip.label)+1)],
y=pm[as.character(2:tree$Nnode+length(tree$tip.label))]))
anc.est<-colMeans(mcmc[201:1001,
as.character(1:tree$Nnode+length(tree$tip.label))])
plot(a[names(anc.est)],anc.est,xlab="true values",
ylab="estimated states using informative prior")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated using informative prior")
接下來,重點(diǎn)介紹如何使用通常稱為Mk模型的連續(xù)時間馬爾可夫鏈模型來估計離散值性狀的祖先特征狀態(tài)摸航。讓我們從阿諾利樹中提取數(shù)據(jù)制跟,以便在這些分析中使用:
require(phytools)
data(anoletree)
## this is just to pull out the tip states from the
## data object - normally we would read this from file
x<-getStates(anoletree,"tips")
tree<-anoletree
rm(anoletree)
tree
##
## Phylogenetic tree with 82 tips and 81 internal nodes.
##
## Tip labels:
## Anolis_ahli, Anolis_allogus, Anolis_rubribarbus, Anolis_imias, Anolis_sagrei, Anolis_bremeri, ...
##
## Rooted; includes branch lengths.
plotTree(tree,type="fan",fsize=0.8,ftype="i")
## estimate ancestral states under a ER model
fitER<-ace(x,tree,model="ER",type="discrete")
fitER
##
## Ancestral Character Estimation
##
## Call: ace(x = x, phy = tree, type = "discrete", model = "ER")
##
## Log-likelihood: -78.04604
##
## Rate index matrix:
## CG GB TC TG Tr Tw
## CG . 1 1 1 1 1
## GB 1 . 1 1 1 1
## TC 1 1 . 1 1 1
## TG 1 1 1 . 1 1
## Tr 1 1 1 1 . 1
## Tw 1 1 1 1 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.0231 0.004
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## CG GB TC TG Tr Tw
## 0.018197565 0.202238628 0.042841575 0.428609607 0.004383532 0.303729093
round(fitER$lik.anc,3)
$lik.anc變量主要包含臨界祖先狀態(tài),也被稱為“經(jīng)驗貝葉斯后驗概率”:
plotTree(tree,type="fan",fsize=0.8,ftype="i")
nodelabels(node=1:tree$Nnode+Ntip(tree),
pie=fitER$lik.anc,piecol=cols,cex=0.5)
tiplabels(pie=to.matrix(x,sort(unique(x))),piecol=cols,cex=0.3)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(tree)),fsize=0.8)
以上概述的方法是使用MCMC從它們的后驗概率分布中對性狀歷史進(jìn)行采樣酱虎,這叫做隨機(jī)性狀映射雨膨。模型是相同的,但在這種情況下读串,我們得到一個明確的歷史樣本聊记,我們的離散型性狀在樹上的演變-而不是一個概率分布的性狀在節(jié)點(diǎn)。例如恢暖,考慮到上面模擬的數(shù)據(jù)排监,我們可以生成如下的隨機(jī)圖:
## simulate single stochastic character map using empirical Bayes method
mtree<-make.simmap(tree,x,model="ER")
plotSimmap(mtree,cols,type="fan",fsize=0.8,ftype="i")
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(tree)),fsize=0.8)
單個隨機(jī)字符映射并不意味著完全孤立-我們需要從隨機(jī)映射的樣本中查看整個分布。這可能有點(diǎn)令人難以抗拒杰捂。例如舆床,以下代碼從我們的數(shù)據(jù)集中生成100個隨機(jī)字符映射,并在網(wǎng)格中繪制它們:
mtrees<-make.simmap(tree,x,model="ER",nsim=100)
par(mfrow=c(10,10))
null<-sapply(mtrees,plotSimmap,colors=cols,lwd=1,ftype="off")
用一種更有意義的方式概括一組隨機(jī)地圖是可能的。例如挨队,在我們的模型下谷暮,我們可以估計每種類型的變化數(shù)、在每種狀態(tài)下花費(fèi)的時間比例以及每個內(nèi)部節(jié)點(diǎn)在每個狀態(tài)下的后驗概率盛垦。例如:
pd<-describe.simmap(mtrees,plot=FALSE)
pd
plot(pd,fsize=0.6,ftype="i")
## now let's plot a random map, and overlay the posterior probabilities
plotSimmap(mtrees[[1]],cols,type="fan",fsize=0.8,ftype="i")
nodelabels(pie=pd$ace,piecol=cols,cex=0.5)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(tree)),fsize=0.8)
最后湿弦,由于我們在完全相同的模型下獲得了這些推論,讓我們將隨機(jī)映射的后驗概率與我們的邊緣祖先狀態(tài)進(jìn)行比較腾夯。在前一種情況下颊埃,我們的概率是通過對祖先的聯(lián)合(而不是邊緣)概率分布的抽樣來獲得的。
plot(fitER$lik.anc,pd$ace,xlab="marginal ancestral states",
ylab="posterior probabilities from stochastic mapping")
lines(c(0,1),c(0,1),lty="dashed",col="red",lwd=2)
這告訴我們俯在,雖然聯(lián)合和邊緣祖先狀態(tài)重建是不一樣的竟秫,但聯(lián)合隨機(jī)抽樣的邊緣概率和邊緣祖先狀態(tài)是高度相關(guān)的-至少在這個案例研究中是這樣的。
實(shí)操演練
- 數(shù)據(jù)集中性狀1和性狀2之間是否存在相關(guān)性(使用標(biāo)準(zhǔn)相關(guān)性GLS)跷乐,然后再使用PGLS來確定數(shù)據(jù)集中的性狀1和2之間是否存在進(jìn)化相關(guān)性肥败,使用AIC來確定對PGLS使用OU還是Pagel的λ。
library(geiger)
gruntTree <- read.tree("grunts.phy")
gruntData <- read.csv("grunts.csv", row.names=1)
name.check(gruntTree, gruntData)
library(nlme)
plot(trait2~trait1, data=gruntData)
###run PGLS
pglsResult<-gls(trait1~trait2, cor=corBrownian(phy=gruntTree), data=gruntData, method="ML")
pglsResultOU<-gls(trait1~trait2, cor=corMartins(1, phy=gruntTree, fixed=F), data=gruntData, method="ML")
pglsResultPagel<-gls(trait1~trait2, cor=corPagel(1, phy=gruntTree, fixed=F), data=gruntData, method="ML")
anova(pglsResult, pglsResultOU, pglsResultPagel)
## Model df AIC BIC logLik Test L.Ratio
## pglsResult 1 3 59.81018 65.54625 -26.905088
## pglsResultOU 2 4 30.42696 38.07506 -11.213482 1 vs 2 31.38321
## pglsResultPagel 3 4 22.83320 30.48129 -7.416599
## p-value
## pglsResult
## pglsResultOU <.0001
## pglsResultPagel
# the main one is the lowest AIC, choose that
anova(pglsResultPagel)
## Denom. DF: 48
## numDF F-value p-value
## (Intercept) 1 0.004714 0.9455
## trait2 1 6.053490 0.0175
# yes there is a relationship.
- 測試連續(xù)型性狀進(jìn)化模型愕提,將BM馒稍、OU以及EB三種模型應(yīng)用于性狀1、性狀2以及性狀3的數(shù)據(jù)集中浅侨。
results<-matrix(nrow=3, ncol=3)
for(i in 3:5) {
xx <- gruntData[,i]
names(xx) <- rownames(gruntData)
bm<-fitContinuous(gruntTree, xx)
ou<-fitContinuous(gruntTree, xx, model="OU")
eb<-fitContinuous(gruntTree, xx, model="EB")
aicc<-c(bm$opt$aicc, ou$opt$aicc, eb$opt$aicc)
aiccD <- aicc - min(aicc)
aw <- exp(-0.5 * aiccD)
aiccW <- aw/sum(aw)
results[i-2,]<-aiccW
}
rownames(results)<-colnames(gruntData)[3:5]
colnames(results)<-c("bm", "ou", "eb")
# Table of AICc weights
results
## bm ou eb
## trait1 0.4583683 0.3940379 0.1475937
## trait2 0.5292929 0.3002759 0.1704312
## trait3 0.5230986 0.3084647 0.1684367
# 所以EB模型較合適
- 重構(gòu)性狀1的祖先狀態(tài)纽谒。
library(phytools)
habitat<-gruntData[,2]
names(habitat)<-rownames(gruntData)
fitER<-ace(habitat,gruntTree,model="ER",type="discrete")
cols<-setNames(palette()[1:length(unique(habitat))],sort(unique(habitat)))
plotTree(gruntTree,type="fan",fsize=0.8,ftype="i")
nodelabels(node=1:gruntTree$Nnode+Ntip(gruntTree),
pie=fitER$lik.anc,piecol=cols,cex=0.5)
tiplabels(pie=to.matrix(habitat,sort(unique(habitat))),piecol=cols,cex=0.3)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
y=-max(nodeHeights(gruntTree)),fsize=0.8)
t1 <- gruntData[,"trait1"]
names(t1) <- rownames(gruntData)
fit <- fastAnc(gruntTree, t1, vars=T, CI=T)
obj<-contMap(gruntTree,t1,plot=FALSE)
plot(obj,type="fan",legend=0.7*max(nodeHeights(gruntTree)))
物種多樣化
使用RPANDA與BAMM軟件分析物種多樣性
BAMM(Bayesian Analysis of Macroevolutionary Mixtures)是一個利用rjMCMC模型對物種形成、滅絕和性狀進(jìn)化的復(fù)雜動態(tài)進(jìn)行建模的軟件如输。然而鼓黔,本教程將只集中于物種的形成和滅絕率的估計。運(yùn)行BAMM需要兩個主要輸入文件:控制文件和系統(tǒng)發(fā)生樹不见。
控制文件以及樹文件內(nèi)容等詳細(xì)信息可閱讀說明書澳化,下面是運(yùn)行分析結(jié)果:
install.packages("BAMMtools")
library(BAMMtools)
tree <- read.tree("whaletree.tre")
plot(tree, cex = 0.35)
events <- read.csv("event_data.txt")
data(whales, events.whales)
ed <- getEventData(whales, events.whales, burnin=0.25)
head(ed$eventData)
bamm.whales <- plot.bammdata(ed, lwd=2, labels = T, cex = 0.5)
addBAMMshifts(ed, cex=2)
addBAMMlegend(bamm.whales, location = c(-1, -0.5, 40, 80), nTicks = 4, side = 4, las = 1)
#We can also plot the rate through time estimated by BAMM
plot.new()
plotRateThroughTime(ed, intervalCol="red", avgCol="red", ylim=c(0,1), cex.axis=2)
text(x=30, y= 0.8, label="All whales", font=4, cex=2.0, pos=4)
#plot the rate through time estimated for a given clade
plotRateThroughTime(ed, intervalCol="blue", avgCol="blue", node=141, ylim=c(0,1),cex.axis=1.5)
text(x=30, y= 0.8, label="Dolphins only", font=4, cex=2.0, pos=4)
#exclude this given clade and thus plot the rate through time for the “background clade”
plotRateThroughTime(ed, intervalCol="darkgreen", avgCol="darkgreen", node=141, nodetype = "exclude", ylim=c(0,1), cex.axis=1.5)
text(x=30, y= 0.8, label="Non-dolphins", font=4, cex=2.0, pos=4)
##Bamm estimates speciation and extinction rates for each branch on the tree.
##Here we will extract the rates estimated for each tip and plot in different ways.
tip.rates <- getTipRates(ed)
str(tip.rates)
hist(tip.rates$lambda.avg,xlab="average lambda"
hist(tip.rates$mu.avg,xlab="average mu")
##you can separate different groups of taxa to compare their tip rates
boxplot(tip.rates$lambda.avg[53:87], tip.rates$lambda.avg[1:52], col = c("red", "blue"), names = c("dolphins", "other cetaceans"))
RPANDA利用最大似然分析,可以擬合不同的速率隨時間變化的模型稳吮,并選擇最優(yōu)的擬合模型缎谷。RPANDA與BAMM主要有兩個區(qū)別:
- RPANDA軟件必須指定測試的模型,而BAMM會計算每一分支的速率灶似;
- RPANDA軟件列林,用戶必須事先知道哪一類可能會顯示特定的多樣化,而BAMM將使用rjmcmc算法估計速率偏移的位置酪惭;