一恩敌、什么是批次效應(yīng)
批次效應(yīng)(batch effect)瞬测,表示樣品在不同批次中處理和測量產(chǎn)生的與試驗期間記錄的任何生物變異無關(guān)的技術(shù)差異。批次效應(yīng)是高通量試驗中常見的變異來源纠炮,受日期月趟、環(huán)境、處理組恢口、實驗人員孝宗、試劑、平臺等一些非生物因素的影響耕肩。
合并分析不同批次的數(shù)據(jù)時因妇,平常的標(biāo)準(zhǔn)化方法不足以調(diào)整批次之間的差異。如果批次效應(yīng)比較嚴(yán)重猿诸,這些差異就會干擾實驗結(jié)果婚被,我們就不能夠判斷得到的差異表達(dá)的基因是來源于想要研究的因素,還是和批次相關(guān)梳虽。
批次效應(yīng)不能被消除摔寨,只有盡可能的降低。校正批次效應(yīng)的目的是怖辆,減少批次之間的差異是复,盡量讓多個批次的數(shù)據(jù)重新組合在一起删顶,這樣下游分析就可以只考慮生物學(xué)差異因素。
Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used.
Normalization is a data analysis technique that adjusts global properties of measurements for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. [1]
二淑廊、處理方法
目前已經(jīng)有多種處理批次差異的方法逗余。[2]
批次效應(yīng)處理方法
三、哪種方法比較好
一項研究比較了6種去除批次效應(yīng)的方法季惩,其中包括ComBat方法(parametric prior method录粱,ComBat_p和non-parametric method,ComBat_n)画拾、代理變量法(Surrogate variable analysis啥繁,SVA)、基于比值的方法(Geometric ratio-based method青抛,Ratio_G)旗闽、平均中心方法(Mean-centering,PAMR)和距離加權(quán)判別(Distance-weighted discrimination蜜另,DWD)方法适室,綜合多種指標(biāo)認(rèn)為ComBat在精確性、準(zhǔn)確性和整體性能方面(precision, accuracy and overall performance)總體優(yōu)于其他方法举瑰。
A number of approaches have been used or developed for identifying and removing batch effects from microarray data, of which we have chosen six for evaluation.
We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics.
Its parametric and non-parametric algorithms both worked well in both kinds of data sets, controlling the variation attributable to batch effects, increasing the correlation among replicates, and producing the largest AUC in our assessment of overall performance. [3]
四捣辆、ComBat方法的算法
模型的假設(shè)是基于位置和尺度(Location and scale,L/S)的調(diào)整此迅。L/S調(diào)整可以定義為一系列廣泛的調(diào)整汽畴,其中為數(shù)據(jù)在批次內(nèi)的位置(均值)和/或規(guī)模(方差)假設(shè)了一個模型,然后調(diào)整批次以滿足假設(shè)模型的規(guī)范耸序。因此忍些,L/S批次調(diào)整假設(shè)批次效應(yīng)可以通過標(biāo)準(zhǔn)化批次之間的均值和方差來建模。這些調(diào)整可以從簡單的基因范圍的均值和方差標(biāo)準(zhǔn)化佑吝,到復(fù)雜的基因間線性或非線性調(diào)整坐昙。
模型:Yijg = αg + Xβg + γig + δigεijg
Yijg表示來自批次i的樣品j的基因g的表達(dá)值。其中αg是基因g的平均表達(dá)值芋忿,X是樣本條件的設(shè)計矩陣炸客,βg是對應(yīng)于X的回歸系數(shù)向量。誤差項εijg服從期望值為0和方差為σg的正態(tài)分布N(0戈钢,σg )痹仙,γig和δig表示批次i中基因g加法和乘法的批次效應(yīng)。
算法分為3步: [4]
算法
Location and scale (L/S) adjustments can be defined as a wide family of adjustments in which one assumes a model for the location (mean) and/or scale (variance) of the data within batches and then adjusts the batches to meet assumed model specifications. Therefore, L/S batch adjustments assume that the batch effects can be modeled out by standardizing means and variances across batches. These adjustments can range from simple gene-wise mean and variance standardization to complex linear or non-linear adjustments across the genes. [5]
五殉了、推薦的解決方法
(1)根據(jù)分析目的確定批次效應(yīng)處理方法:差異表達(dá)分析开仰,在模型中添加批次因素;可視化,先對原數(shù)據(jù)進(jìn)行校正众弓,再使用校正后的數(shù)據(jù)進(jìn)行分析恩溅。
(2)已知的批次,removeBatchEffect或ComBat谓娃;未知的批次脚乡,sva。
(3)removeBatchEffect和ComBat滨达、sva輸入的數(shù)據(jù)需要進(jìn)行轉(zhuǎn)換奶稠,例如取對數(shù)(rlog或logCPM)。
(4)read counts數(shù)據(jù)可使用ComBat-Seq或svaseq捡遍。
六锌订、差異表達(dá)分析的批次校正
很多人以為去除批次效應(yīng)是要改變你的表達(dá)矩陣,新的表達(dá)矩陣然后去走差異分析流程画株,其實大部分的差異分析流程包里面辆飘,人家內(nèi)置好了考慮你的批次效應(yīng)這樣的混雜因素的函數(shù)用法設(shè)計。例如:
構(gòu)建DESeq2對象時的設(shè)計公式:design = ~ batch + conditions
DESeq2
因為我們把批次這個因素污秆,寫在了DESeq2里面劈猪,所以它在幫我們做差異分析的時候昧甘,其實就考慮了良拼,得到的差異分析結(jié)果,就是去除了批次效應(yīng)的充边。
如果要合并不同批次的數(shù)據(jù)進(jìn)行差異表達(dá)分析庸推,建議直接把批次信息加入到構(gòu)建模型里面。但是這種方法并沒有改變原數(shù)據(jù)浇冰。如果你確實一定要親眼看看批次效應(yīng)到底是如何影響這個表達(dá)矩陣的贬媒,就需要對表達(dá)矩陣做另外的處理,例如removeBatchEffect或ComBat肘习。但是處理之后會改變counts矩陣际乘,之后就沒辦法走DESeq2差異分析流程啦,僅僅是為了拿到去除批次效應(yīng)前后對比的表達(dá)矩陣而已漂佩。
PCA
七脖含、使用limma的removeBatchEffect處理批次效應(yīng)
removeBatchEffect這個函數(shù)用于進(jìn)行聚類或無監(jiān)督分析之前,移除與雜交時間或其他技術(shù)變異相關(guān)的批次效應(yīng)投蝉。它是針對芯片設(shè)計的养葵,因此不要直接使用read counts,數(shù)據(jù)需要經(jīng)過一定的標(biāo)準(zhǔn)化操作瘩缆,如log轉(zhuǎn)化关拒。
removeBatchEffect只用于銜接聚類、PCA等可視化展示,不要在線性建模之前使用着绊。因為用矯正后的數(shù)據(jù)進(jìn)行差異表達(dá)分析谐算,有兩個缺陷:一是批次因素和分組因素可能重疊,所以直接對原數(shù)據(jù)矯正批次可能會抵消一部分真實生物學(xué)因素归露;二是低估了誤差氯夷。所以,如果想做差異表達(dá)分析靶擦,但數(shù)據(jù)中又有已知的批次問題腮考,則最好把批次效應(yīng)納入線性模型中。
This function is useful for removing batch effects, associated with hybridization time or other technical variables, prior to clustering or unsupervised analysis such as PCA, MDS or heatmaps. The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed. The function (in effect) fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects.
This function is intended for use with clustering or PCA, not for use prior to linear modelling. If linear modelling is intended, it is better to include the batch effect as part of the linear model.
removeBatchEffect用法
removeBatchEffect
八玄捕、使用SVA的ComBat處理批次效應(yīng)
sva這個R包可以處理已知的和未知的批次效應(yīng)踩蔚,sva函數(shù)可以通過構(gòu)建高維數(shù)據(jù)集的代理變量,移除批次效應(yīng)和其它所有不需要的變異枚粘。如果是芯片數(shù)據(jù)用sva馅闽,高通量測序數(shù)據(jù)使用svaseq。ComBat函數(shù)可以處理已知的批次效應(yīng)馍迄。
sva has functionality to estimate and remove artifacts from high dimensional data the sva function can be used to estimate artifacts from microarray data. the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data. The ComBat function can be used to remove known batch effecs from microarray data. The fsva function can be used to remove batch effects for prediction problems.
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric
empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.
ComBat用法
ComBat
九福也、使用sva處理未知的批次
SVA具有移除批次效應(yīng)和高通量測序中其它不需要的變異的功能。使用sva識別和構(gòu)建高維數(shù)據(jù)集代理變量(surrogate variables)攀圈,代理變量是由高維數(shù)據(jù)直接構(gòu)建的協(xié)變量(covariates)暴凑,可用于后續(xù)分析,以調(diào)整未知赘来、未建南衷或潛在的噪聲源。
sva函數(shù)的輸出本身就是代理變量犬辰。它們可以包含在全模型矩陣和空模型矩陣中嗦篱,然后與數(shù)據(jù)矩陣一起傳遞給SVA包中的f.pvalue函數(shù),以計算參數(shù)F檢驗p值幌缝,從而調(diào)整代理變量灸促。
根據(jù)經(jīng)驗,當(dāng)存在大量已知或未知的潛在混雜因素時涵卵,代理變量調(diào)整(sva)可能更合適浴栽。當(dāng)一個或多個生物學(xué)分組已知是異質(zhì)的,并且有已知的批次變量時缘厢,直接調(diào)整(ComBat)可能更合適吃度。
sva考慮了兩種類型的變量:調(diào)整變量和感興趣的變量。例如贴硫,感興趣變量(variables of interest)可能是癌癥組與對照組椿每;調(diào)整變量(adjustment variables)可能是病人的年齡伊者、性別、測序時間间护。
建立兩個模型矩陣:全模型(full model)和空模型(null model)亦渗。空模型包括所有調(diào)整變量汁尺,而不包含感興趣的變量法精;全模型包括所有調(diào)整變量和感興趣的變量。我們將試圖分析感興趣的變量與基因表達(dá)之間的關(guān)聯(lián)痴突,調(diào)整調(diào)整變量搂蜓。模型矩陣可以使用model.matrix創(chuàng)建。sva的目標(biāo)是消除所有不需要的變異來源辽装,同時通過mod中包含的主要變量來檢測對比帮碰。
注意:在我們最初的工作中,使用識別函數(shù)來測量在近似對稱和連續(xù)尺度上的數(shù)據(jù)拾积。對于通常表示為counts的測序數(shù)據(jù)殉挽,更合適的模型可能涉及使用適度的對數(shù)函數(shù)。例如拓巧,我們首先用log(counts+1)轉(zhuǎn)換基因表達(dá)數(shù)據(jù)斯碌。
The goal of sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables specified in the function call. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation.
As a rule of thumb, when there are a large number of known or unknown potential confounders, surrogate variable adjustment may be more appropriate. Alternatively, when one or more biological groups is known to be heterogeneous, and there are known batch variables, direct adjustment may be more appropriate. [6]
用法:
(1)使用sva得到代理變量
##準(zhǔn)備數(shù)據(jù)
data(bladderdata)
pheno = pData(bladderEset)
head(pheno) #分組信息
edata = exprs(bladderEset)
head(edata) #表達(dá)量信息
##建立兩個模型
pheno$cancer
mod = model.matrix(~as.factor(cancer), data=pheno)
mod #全模型包括所有調(diào)整變量和感興趣的變量,這里只有感興趣的變量
mod0 = model.matrix(~1,data=pheno)
mod0 #空模型只有調(diào)整變量,這里沒有調(diào)整變量,所以只有一個截距intercept
##進(jìn)行SVA
n.sv = num.sv(edata,mod,method="leek") #估計潛在因素的數(shù)量
n.sv
svobj = sva(edata,mod,mod0,n.sv=n.sv) #應(yīng)用SVA函數(shù)來估計代理變量
str(svobj) #包含4項的列表
head(svobj$sv)
#SV是一個矩陣,其列對應(yīng)于估計的代理變量,它們可以用于下游分析
#pprob.gam是每個基因與一個或多個潛在變量相關(guān)的后驗概率
#pprob.b是每個基因與感興趣的變量相關(guān)的后驗概率
#n.sv是SVA估計的代理變量數(shù)
(2)使用f.pvalue函數(shù)調(diào)整代理變量(calculate parametric F-test P-values adjusted for surrogate variables)
#f.pvalue可以用來計算數(shù)據(jù)矩陣每一行(基因)的參數(shù)F檢驗p值
#F檢驗比較了模型mod和mod0
#沒有調(diào)整代理變量的原始矩陣
pValues = f.pvalue(edata,mod,mod0)
qValues = p.adjust(pValues,method="BH") #adjust them for multiple testing
head(qValues)
sum(qValues < 0.05)/length(qValues) #68.2%的基因都是差異表達(dá)的,This number seems artificially high
#Now we can perform the same analysis,but adjusting for surrogate variables.
#首先在空模型和全模型中包含代理變量
#原因是我們想要調(diào)整代理變量,所以我們把它們作為調(diào)整變量,必須包含在這兩個模型中
modSv = cbind(mod,svobj$sv)
mod0Sv = cbind(mod0,svobj$sv)
pValuesSv = f.pvalue(edata,modSv,mod0Sv)
qValuesSv = p.adjust(pValuesSv,method="BH")
head(qValuesSv)
sum(qValuesSv < 0.05)/length(qValuesSv)#66.5%的基因是差異表達(dá)的
(3)sva可以與差異表達(dá)分析程序一起使用,如limma肛度、DESeq2傻唾。
##sva+DESeq2
dds <- DESeqDataSetFromMatrix(countData = countdata,
colData = colData,
design = ~ Batch + Group)
dds <- dds[rowSums(counts(dds))>1,]
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
dat <- dat[rowMeans(dat) > 1,]
head(dat)
mod <- model.matrix(~ Group, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
n.sv = num.sv(dat,mod,method="leek")
n.sv
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$Group,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$Group,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Group
colData(ddssva)
ddssva <- DESeq(ddssva)
References
[1] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[2] 李颯,趙毅強(qiáng).基因表達(dá)數(shù)據(jù)批次效應(yīng)去除方法的研究進(jìn)展[J].南京農(nóng)業(yè)大學(xué)學(xué)報,2019,42(03):389-397.
[3] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[4]陳天成,侯艷,李康.基因組學(xué)數(shù)據(jù)整合中的批次效應(yīng)移除算法[J].中國衛(wèi)生統(tǒng)計,2016,33(03):527-529+533.
[5] Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods[J]. Biostatistics. 2007;8(1):118-27.
[6] Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments[J]. Bioinformatics. 2012;28(6):882-3.