代謝組學(xué)中差異代謝物的識(shí)別仍然是一個(gè)巨大的挑戰(zhàn),并在代謝組學(xué)數(shù)據(jù)分析中發(fā)揮著突出的作用。由于分析悔雹、實(shí)驗(yàn)和生物的模糊性,代謝組學(xué)數(shù)據(jù)集經(jīng)常包含異常值欣喧,但目前可用的差異代謝物識(shí)別技術(shù)對(duì)異常值很敏感荠商。作者這里提出了一種基于權(quán)重的具有穩(wěn)健性火山圖方法,助于從含有離群值的代謝組數(shù)據(jù)中更加準(zhǔn)確鑒定差異代謝物续誉。
作者將提出的方法與其它9種代謝組中常用的方法(包含t-test)進(jìn)行比較莱没,發(fā)現(xiàn)有較低的misclassification error rates(MERs)和較高的AUC,證明了該方法的優(yōu)越性酷鸦。
安裝加載數(shù)據(jù)集
Github:https://github.com/nishithkumarpaul/Rvolcano
library(devtools)
install_github("nishithkumarpaul/Rvolcano")
library(Rvolcano)
data("DummyFullData")
dim(DummyFullData)
[1] 40 40
數(shù)據(jù)集由40行代謝物和40個(gè)樣本組成饰躲,其中(癌癥和健康組各20生物學(xué)重復(fù))
計(jì)算foldchange
fcval <- foldChngCalc(DummyFullData,nSampG1 = 20,nSampG2 = 20)
head(fcval)
## 查看該函數(shù)
edit(foldChngCalc)
function (data, nSampG1, nSampG2)
{
g1fold <- apply((data[, 1:nSampG1]), 1, weightedMean)
g2fold <- apply((data[, (nSampG1 + 1):(nSampG1 + nSampG2)]),
1, weightedMean)
foldChange <- g2fold - g1fold
return(foldChange)
}
我們查看下foldChngCalc該函數(shù),可以看到主要是通過(guò)apply函數(shù)計(jì)算了weightedMean值,即加權(quán)平均數(shù), 而不是我們常用的mean(算術(shù)平均數(shù))臼隔。
算術(shù)平均數(shù)又稱均值嘹裂,是統(tǒng)計(jì)學(xué)中基本的平均指標(biāo),就是簡(jiǎn)單的把所有數(shù)加起來(lái)然后除以個(gè)數(shù)摔握。
加權(quán)平均數(shù)即加權(quán)平均值寄狼,是將各數(shù)值乘以相應(yīng)的權(quán)數(shù),然后加總求和得到總體值氨淌,再除以總的單位數(shù)泊愧。
通過(guò)一個(gè)例子計(jì)算下,觀察weightedMean函數(shù)和mean有啥區(qū)別盛正,,我們創(chuàng)建一個(gè)正態(tài)分布的向量集x和一個(gè)含有較高離群值的向量集删咱,分布計(jì)算一下結(jié)果。
> set.seed(123)
> x<-c(rnorm(200,3,1))
> x1<-c(rnorm(180,3,1),rnorm(20,60,3)) #contain 20 outliers#
> weightedMean(x)
[1] 2.957601
> mean(x)
[1] 2.99143
> weightedMean(x1)
[1] 3.062994
> mean(x1)
[1] 8.712792
## 查看該函數(shù)的計(jì)算方法
function (a, b = 0.2)
{
w <- NULL
pw <- NULL
for (i in 1:length(a)) {
w[i] <- exp(-(b/2) * (mad(a)^-2) * (a[i] - median(a))^2)
pw[i] <- w[i] * a[i]
}
tpw <- sum(pw)
tw <- sum(w)
wm <- tpw/tw
dd <- c(tpw, tw, wm)
return(wm)
}
結(jié)果發(fā)現(xiàn)豪筝,不含離群值時(shí)痰滋,加權(quán)與算術(shù)平均數(shù)都一樣,而含有離群值時(shí)候续崖,結(jié)果容易形成誤差敲街,算術(shù)平均數(shù)易受極端數(shù)據(jù)的影響,極端值的出現(xiàn)严望,會(huì)使平均數(shù)的真實(shí)性受到干擾多艇,加權(quán)差異有助于在進(jìn)行計(jì)算時(shí)考慮更多數(shù)據(jù),以便獲得最準(zhǔn)確的結(jié)果著蟹。
計(jì)算Pvalue
使用p.valcalc函數(shù)進(jìn)行差異分析墩蔓,使用robust vesion的t-test獲取p值梢莽,公式可以查看函數(shù)源碼,用到了weightedVar加權(quán)方差奸披。
## 添加Pvalue
pval <- NULL
for (i in 1:dim(DummyFullData)[1]){
pval[i] <- p.valcalc(DummyFullData[i,1:20],DummyFullData[i,21:40])
}
##查看該源碼
edit(p.valcalc)
function (x, y)
{
t.stat <- (weightedMean(x) - weightedMean(y))/sqrt(((length(x) -
1) * weightedVar(x) + (length(y) - 1) * weightedVar(y)) *
(1/(length(x) + length(y) - 2)) * ((1/length(x)) + 1/length(y)))
pval <- 2 * pt(abs(t.stat), (length(x) + length(y) - 2),
lower.tail = FALSE)
return(pval)
}
繪制火山圖
log2foldchange和Pvalue都有了昏名,直接使用RobVolPlot函數(shù)繪圖
##Robust volcano plot
RobVolPlot(fcval,pval,cexcutoff = 0.7,cexlab = 0.8,
main = 'Volcano Plot',
xlab = 'log2fc',
ylab = '-log(Pvalue)',
plimit = 0.05,fclimit = 2)
edit(RobVolPlot)
ggplot2繪圖
已經(jīng)得到了log2foldchange和Pvalue數(shù)據(jù)了,我們就可以自己拿著數(shù)據(jù)繪制圖形了
library(ggrepel)
## 獲取數(shù)據(jù)
result <- data.frame(log2foldchange = fcval,p_value = pval)
result$threshold = factor(ifelse(result$p_value < 0.05 & abs(result$log2foldchange) >= 1,
ifelse(result$log2foldchange>= 1 ,'Up','Down'),'NoSignificance'),
levels=c('Up','Down','NoSignificance'))
##差異代謝物
re <- subset(result,p_value < 0.05)
ggplot(result,aes(log2foldchange, -log10(p_value)))+
geom_point(aes(color = threshold), size = 5) +
scale_colour_manual(values = c('#fb81f5', '#6868ed', 'gray40'), labels = c('Enriched metabolites', 'Depleted metabolites', 'No diff')) +
# 添加x軸標(biāo)簽
xlab(bquote(Log[2] ~ '(fold change)')) +
# 添加y軸標(biāo)簽
ylab(bquote(-Log[10] ~ italic('Pvalue')))+
# 添加標(biāo)簽:
geom_text_repel(data = re, aes(color = threshold,label = rownames(re)),max.overlaps = 20,
size = 4.8 ,fontface = "bold",arrow = arrow(length=unit(0.01, "npc")),force = 1, max.iter = 3e3,
box.padding = unit(0.6, 'lines'), segment.color = 'black', show.legend = FALSE)+
theme_bw()+
# 調(diào)整主題和圖例位置:
theme(panel.grid = element_blank(),
legend.position = c(0.01,0.25),
legend.justification = c(0,1)
)+
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey",size = 1)+
geom_vline(xintercept = c(-1.2,1.2), linetype = "dashed", color = "grey",size = 1)
參考資料