歡迎大家關注全網(wǎng)生信學習者系列:
- WX公zhong號:生信學習者
- Xiao hong書:生信學習者
- 知hu:生信學習者
- CDSN:生信學習者2
介紹
微生物數(shù)據(jù)具有一下的特點,這使得在做差異分析的時候需要考慮到更多的問題羔味,
Sparsity
Compositional
Overdispersion
現(xiàn)在 Nearing, Douglas et al. Nature Comm. Microbiome differential abundance methods produce different results across 38 datasets.文章對常用的差異分析方法做了基準測試,本文將不同方法的核心代碼記錄下來。
作者這項工作的目標是比較一系列常見的差異分析(DA)方法在16S rRNA基因數(shù)據(jù)集上的表現(xiàn)嘉涌。因此,在這項工作的大部分中,作者并不確切知道正確答案是什么:作者主要只能說哪些工具在性能上更相似于其他工具。然而螺垢,還包含了幾項分析,以幫助評估這些工具在不同環(huán)境下來自同一數(shù)據(jù)集的假陽性率和一致性赖歌。
簡單地說枉圃,該工作試圖評估不同的微生物組差異豐度分析方法在多個數(shù)據(jù)集上的表現(xiàn),并比較它們之間的相似性和一致性庐冯,同時嘗試評估這些工具在不同數(shù)據(jù)集上產(chǎn)生假陽性結(jié)果的頻率孽亲。
Sparsity
即使在同一環(huán)境中,不同樣本的微生物出現(xiàn)概率或者豐度都是不一樣的展父,大部分微生物豐度極低返劲。又因為在測序儀的檢測極限下,微生物豐度(相對或絕對豐度)為0的概率又極大增加了犯祠。除此之外旭等,比對所使用的數(shù)據(jù)庫大小也即是覆蓋物種率也會對最終的微生物豐度表達譜有較大的影響酌呆。最后我們所獲得的微生物豐度譜必然含有大量的零值衡载,它有兩種情況,一種是真實的零值隙袁,另一種是誤差導致的零值痰娱。很多算法會針對這兩個特性構(gòu)建不同的處理零值策略。
零值數(shù)量的大小構(gòu)成了微生物豐度譜稀疏性菩收。在某次16s數(shù)據(jù)的OTU水平中梨睁,零值比例高達80%以上。Sparsity屬性導致常用的數(shù)據(jù)分析方法如t-test/wilcox-test假設檢驗方法均不適合娜饵。為了解決sparsity對分析的影響坡贺,很多R包的方法如ANCOM的Zero劃分,metagenomeSeq的ZIP/ZILN對Zero進行處理箱舞,處理后的矩陣再做如CLR等變換遍坟,CLR變換又是為了處理微生物數(shù)據(jù)另一個特點compositional (下一部分講)。最后轉(zhuǎn)換后的數(shù)據(jù)會服從常見的分布晴股,也即是可以使用常見的如Wilcox/t-test之類(兩分組)的方法做假設檢驗愿伴,需要說明的是ANCOM還會根據(jù)物種在樣本內(nèi)的顯著性的差異比例區(qū)分差異物種,這也是為何ANCOM的穩(wěn)健性的原因电湘。
Compositional
Compositional的數(shù)據(jù)特性是服從simplex空間隔节,簡而言之是指:某個樣本內(nèi)所有微生物的加和是一個常數(shù)(可以是1也可以是10,100)。符合該屬性的數(shù)據(jù)內(nèi)部元素之間存在著相關關系怠蹂,即某個元素的比例發(fā)生波動蓝晒,必然引起其他元素比例的波動,但在實際的微生物環(huán)境中刽虹,這種關聯(lián)關系可能是不存在的酗捌。為了解決compositional的問題,有人提出了使用各種normalization方法(比如上文提到的CLR:涌哲,我暫時只熟悉這個方法)胖缤。
Compositional數(shù)據(jù)不服從歐式空間分布,在使用log-ratio transformation后阀圾,數(shù)據(jù)可以一一對應到真實的多維變量的空間哪廓,方便后續(xù)應用標準分析方法。
Overdispersion
Overdispersion的條件是 Variance >> mean初烘,也就是說數(shù)據(jù)的方差要遠遠大于均值涡真。常用的適合count matrix的Poisson分布是無法處理這樣的數(shù)據(jù)的,因此現(xiàn)在很多方法都是用負二項分布去擬合數(shù)據(jù)肾筐。
總結(jié)
使用一張自己講過的PPT總結(jié)一下哆料。
差異分析方法
不同的差異分析方法識別到差異微生物可能會存在較大的區(qū)別,這是因為這些方法的原理是不一樣的吗铐,但從微生物的數(shù)據(jù)特點而言东亦,方法需要符合微生物數(shù)據(jù)特性。
ALDEx2
ALDEx2(ANOVA-Like Differential Expression 2)是一種用于微生物組數(shù)據(jù)差異分析的方法唬渗,它特別適用于處理組成數(shù)據(jù)(compositional data)典阵,這類數(shù)據(jù)的特點是在每個樣本中各部分的總和為一個固定值,例如微生物群落中各物種的相對豐度之和為1镊逝。ALDEx2方法的核心原理包括以下幾個步驟:
- 生成后驗概率分布:首先壮啊,ALDEx2使用Dirichlet分布來模擬每個分類單元(如OTU或ASV)的讀數(shù)計數(shù)的后驗概率分布。這一步是基于微生物群落數(shù)據(jù)的組成特性撑蒜,即數(shù)據(jù)點在高維空間中位于一個低維的簡單形(simplex)上歹啼。
- 中心對數(shù)比變換(CLR):ALDEx2對原始計數(shù)數(shù)據(jù)進行中心對數(shù)比(Centered Log-Ratio)變換,這是一種適合組成數(shù)據(jù)的變換方法座菠,可以消除數(shù)據(jù)的組成特性帶來的影響狸眼,使得數(shù)據(jù)更適合常規(guī)的統(tǒng)計分析方法。
- 單變量統(tǒng)計檢驗:變換后的數(shù)據(jù)將用于單變量統(tǒng)計檢驗辈灼,如Welch's t檢驗或秩和檢驗份企,以確定不同組之間各分類單元的豐度是否存在顯著差異。
- 效應量估計:ALDEx2還計算效應量大小巡莹,這是衡量組間差異相對于組內(nèi)變異的一個重要指標司志,有助于評估差異的生物學意義甜紫。
- 多重檢驗校正:在識別出顯著差異的分類單元后,ALDEx2使用Benjamini-Hochberg方法進行多重檢驗校正骂远,以控制假陽性率囚霸。
#### Script to Run ALDEX2 differential abundance
deps = c("ALDEx2")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ALDEx2")
}
library(dep, character.only = TRUE)
}
library(ALDEx2)
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
results <- aldex(reads = ASV_table,
conditions = groupings[, 1],
mc.samples = 128,
test = "t",
effect = TRUE,
include.sample.summary = FALSE,
verbose = T,
denom = "all")
write.table(results, file=args[3], quote=FALSE, sep='\t', col.names = NA)
message("Results table saved to ", args[3])
ANCOM-II
ANCOM(Analysis of Composition of Microbiomes)是一種用于分析微生物組數(shù)據(jù)的統(tǒng)計方法,專門設計來識別和比較不同樣本或處理組之間的微生物組成差異激才。其核心原理包括以下幾個步驟:
- 數(shù)據(jù)聚合:首先拓型,對數(shù)據(jù)進行預處理,去除低豐度的微生物分類單元(OTU/ASV)瘸恼,并對數(shù)據(jù)進行標準化或轉(zhuǎn)換操作劣挫,將絕對豐度轉(zhuǎn)換為相對豐度。
- 添加偽計數(shù):由于ANCOM分析過程中需要使用對數(shù)變換东帅,而相對豐度為0的分類群無法進行對數(shù)變換压固,因此需要添加一個小的正數(shù)作為偽計數(shù),以解決這個問題靠闭。
-
計算特征差異:ANCOM使用W統(tǒng)計量來檢測不同組之間的特征是否存在顯著差異帐我。W統(tǒng)計量的計算包括以下步驟:
- 對每個特征在所有樣本中的相對豐度進行排序。
- 將樣本分為目標組和參考組愧膀,通常情況下拦键,目標組是研究者感興趣的組別,而參考組是其他組別的合并檩淋。
- 計算目標組和參考組中每個特征的累積相對豐度芬为,即從最低相對豐度的特征開始,逐漸累積到當前特征的相對豐度之和狼钮。
- 計算目標組和參考組中每個特征的平均累積相對豐度碳柱。
- 計算目標組和參考組之間的差異值捡絮,即目標組的平均累積相對豐度減去參考組的均累積相對豐度熬芜。
- 計算每個特征的W統(tǒng)計量,即將差異值除以其標準差福稳。W統(tǒng)計量的絕對值大于1.96通常被認為是顯著差異的特征涎拉。
- 結(jié)果解讀:ANCOM的結(jié)果通常包括火山圖和統(tǒng)計表格,火山圖展示了W統(tǒng)計量與中心對數(shù)比例(CLR)變換后的數(shù)據(jù)的圆,而統(tǒng)計表格列出了差異顯著的特征及其相關信息鼓拧。
ANCOM的優(yōu)點包括能夠處理稀疏數(shù)據(jù)、保持較低的誤報率以及對異常值具有魯棒性越妈。然而季俩,它也存在一些限制,例如對數(shù)據(jù)的分布假設敏感梅掠,對樣本數(shù)目和特征維度的要求較高
deps = c("exactRankTests", "nlme", "dplyr", "ggplot2", "compositions")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
install.packages(dep)
}
library(dep, character.only = TRUE)
}
#args[4] will contain path for the ancom code
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 3) {
stop("At least three arguments must be supplied", call.=FALSE)
}
source(args[[4]])
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
groupings$Sample <- rownames(groupings)
prepro <- feature_table_pre_process(feature_table = ASV_table,
meta_data = groupings,
sample_var = 'Sample',
group_var = NULL,
out_cut = 0.05,
zero_cut = 0.90,
lib_cut = 1000,
neg_lb=FALSE)
feature_table <- prepro$feature_table
metadata <- prepro$meta_data
struc_zero <- prepro$structure_zeros
#run ancom
main_var <- colnames(groupings)[1]
p_adj_method = "BH"
alpha=0.05
adj_formula=NULL
rand_formula=NULL
res <- ANCOM(feature_table = feature_table,
meta_data = metadata,
struc_zero = struc_zero,
main_var = main_var,
p_adj_method = p_adj_method,
alpha=alpha,
adj_formula = adj_formula,
rand_formula = rand_formula)
write.table(res$out, file=args[3], quote=FALSE, sep="\t", col.names = NA)
Corncob
Corncob 是一種用于微生物組數(shù)據(jù)分析的R包酌住,它專門用于對微生物相對豐度進行建模并測試協(xié)變量對相對豐度的影響店归。其核心原理包括以下幾個方面:
- 相對豐度建模:Corncob 通過統(tǒng)計模型來分析微生物的相對豐度數(shù)據(jù),考慮到數(shù)據(jù)的組成性特征酪我,即樣本中各微生物的相對豐度總和為1消痛。
- β-二項式分布:Corncob 假設微生物的計數(shù)數(shù)據(jù)遵循β-二項式分布,這種分布可以更好地描述微生物組數(shù)據(jù)中的離散性和過度離散現(xiàn)象都哭。
- 協(xié)變量效應測試:Corncob 允許研究者測試一個或多個協(xié)變量對微生物相對豐度的影響秩伞,這可以通過Wald檢驗等統(tǒng)計方法來實現(xiàn)。
- 多重假設檢驗校正:在分析過程中欺矫,Corncob 會對多重比較問題進行校正纱新,以控制第一類錯誤率,常用的校正方法包括Benjamini-Hochberg (BH) 方法穆趴。
- 模型擬合與假設檢驗:Corncob 進行模型擬合并對模型參數(shù)進行估計怒炸,然后通過假設檢驗確定特定微生物分類群的相對豐度是否存在顯著差異。
- 稀疏性和零膨脹數(shù)據(jù)處理:Corncob 還考慮到了微生物組數(shù)據(jù)的稀疏性毡代,即許多微生物在多數(shù)樣本中可能未被檢測到阅羹,以及零膨脹問題,即存在大量零計數(shù)的情況教寂。
#### Run Corncob
library(corncob)
library(phyloseq)
#install corncob if its not installed.
deps = c("corncob")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if(dep=="corncob"){
devtools::install_github("bryandmartin/corncob")
}
else
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
#run corncob
#put data into phyloseq object.
colnames(groupings)
colnames(groupings)[1] <- "places"
OTU <- phyloseq::otu_table(ASV_table, taxa_are_rows = T)
sampledata <- phyloseq::sample_data(groupings, errorIfNULL = T)
phylo <- phyloseq::merge_phyloseq(OTU, sampledata)
my_formula <- as.formula(paste("~","places",sep=" ", collapse = ""))
my_formula
results <- corncob::differentialTest(formula= my_formula,
phi.formula = my_formula,
phi.formula_null = my_formula,
formula_null = ~ 1,
test="Wald", data=phylo,
boot=F,
fdr_cutoff = 0.05)
write.table(results$p_fdr, file=args[[3]], sep="\t", col.names = NA, quote=F)
write.table(results$p, file=paste0(args[[3]], "_uncor", sep=""), sep="\t", col.names = NA, quote=F)
DESeq2
DESeq2是一種用于微生物組數(shù)據(jù)差異分析的統(tǒng)計方法捏鱼,特別適用于處理計數(shù)數(shù)據(jù),如RNA-seq數(shù)據(jù)或微生物組的OTU(操作分類單元)計數(shù)酪耕。DESeq2的核心原理包括以下幾個關鍵步驟:
- 數(shù)據(jù)標準化:DESeq2首先對原始的計數(shù)數(shù)據(jù)進行標準化导梆,以校正樣本間的測序深度差異。這是通過計算大小因子(size factors)實現(xiàn)的迂烁,每個樣本的大小因子乘以其總計數(shù)看尼,以調(diào)整測序深度。
- 離散度估計:DESeq2估計每個特征(如OTU或基因)的離散度盟步,即數(shù)據(jù)的變異程度藏斩。離散度是負二項分布的一個參數(shù),用于描述數(shù)據(jù)的過度離散現(xiàn)象却盘。
- 經(jīng)驗貝葉斯收縮:DESeq2使用經(jīng)驗貝葉斯方法對離散度進行收縮狰域,即利用所有特征的離散度估計來改進個別特征的離散度估計,這有助于提高統(tǒng)計估計的穩(wěn)定性黄橘。
- 負二項分布建模:DESeq2假設計數(shù)數(shù)據(jù)遵循負二項分布兆览,該分布是處理計數(shù)數(shù)據(jù)的常用分布,特別適用于處理微生物組數(shù)據(jù)中的零膨脹現(xiàn)象塞关。
- 設計公式:DESeq2通過設計公式來考慮實驗設計抬探,包括處理效應、批次效應和其他協(xié)變量帆赢,從而允許研究者評估特定條件下的微生物差異豐度小压。
- 假設檢驗:DESeq2使用統(tǒng)計檢驗來確定不同樣本組之間特定微生物的豐度是否存在顯著差異砰左。這通常涉及比較零假設(兩組間無差異)和備擇假設(兩組間有差異)。
- 多重檢驗校正:由于微生物組數(shù)據(jù)涉及大量多重比較场航,DESeq2使用Benjamini-Hochberg方法進行多重檢驗校正缠导,以控制假發(fā)現(xiàn)率(FDR)。
- 結(jié)果解釋:DESeq2提供了豐富的結(jié)果輸出溉痢,包括P值僻造、校正后的P值、對數(shù)倍數(shù)變化(log2 fold change)等孩饼,這些結(jié)果可以幫助研究者識別和解釋數(shù)據(jù)中的生物學意義髓削。
#Run_DeSeq2
deps = c("DESeq2")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
}
library(dep, character.only = TRUE)
}
library(DESeq2)
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
colnames(groupings)[1] <- "Groupings"
#Run Deseq2
dds <- DESeq2::DESeqDataSetFromMatrix(countData = ASV_table,
colData=groupings,
design = ~ Groupings)
dds_res <- DESeq2::DESeq(dds, sfType = "poscounts")
res <- results(dds_res, tidy=T, format="DataFrame")
rownames(res) <- res$row
res <- res[,-1]
write.table(res, file=args[3], quote=FALSE, sep="\t", col.names = NA)
message("Results written to ", args[3])
edgeR
EdgeR(Empirical Analysis of Digital Gene Expression in R)是一種用于分析計數(shù)數(shù)據(jù)的統(tǒng)計方法,特別適用于微生物組學镀娶、轉(zhuǎn)錄組學和其他高通量測序技術(shù)產(chǎn)生的數(shù)據(jù)立膛。EdgeR的核心原理包括以下幾個關鍵步驟:
- 數(shù)據(jù)標準化:EdgeR通過計算標準化因子(Normalization Factors)來調(diào)整不同樣本的測序深度或庫大小,確保比較的公平性梯码。
- 離散度估計:EdgeR估計每個基因或OTU的離散度(Dispersion)宝泵,這是衡量數(shù)據(jù)變異程度的一個參數(shù),對于后續(xù)的統(tǒng)計檢驗至關重要轩娶。
- 負二項分布建模:EdgeR假設數(shù)據(jù)遵循負二項分布儿奶,這是一種常用于計數(shù)數(shù)據(jù)的分布模型,可以處理數(shù)據(jù)的過度離散現(xiàn)象鳄抒。
- 經(jīng)驗貝葉斯收縮:EdgeR使用經(jīng)驗貝葉斯方法對離散度進行收縮(Shrinkage)闯捎,通過借用全局信息來提高對每個基因或OTU離散度估計的準確性。
- 設計矩陣:EdgeR通過設計矩陣(Design Matrix)來表示實驗設計许溅,包括處理效應瓤鼻、時間效應、批次效應等贤重,允許研究者評估不同條件下的基因或OTU表達差異茬祷。
- 統(tǒng)計檢驗:EdgeR進行似然比檢驗(Likelihood Ratio Test, LRT)或精確檢驗(Exact Test),以確定不同樣本組之間特定基因或OTU的表達是否存在顯著差異游桩。
- 多重檢驗校正:EdgeR使用多種方法進行多重檢驗校正牲迫,如Benjamini-Hochberg(BH)方法耐朴,以控制假發(fā)現(xiàn)率(FDR)借卧。
- 結(jié)果解釋:EdgeR提供了豐富的結(jié)果輸出,包括P值筛峭、校正后的P值铐刘、對數(shù)倍數(shù)變化(Log Fold Change, LFC)等,幫助研究者識別和解釋數(shù)據(jù)中的生物學變化影晓。
deps = c("edgeR", "phyloseq")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
### Taken from phyloseq authors at: https://joey711.github.io/phyloseq-extensions/edgeR.html
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
require("edgeR")
require("phyloseq")
# Enforce orientation.
if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
x = as(otu_table(physeq), "matrix")
# Add one to protect against overflow, log(0) issues.
x = x + 1
# Check `group` argument
if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
# Assume that group was a sample variable name (must be categorical)
group = get_variable(physeq, group)
}
# Define gene annotations (`genes`) as tax_table
taxonomy = tax_table(physeq, errorIfNULL=FALSE)
if( !is.null(taxonomy) ){
taxonomy = data.frame(as(taxonomy, "matrix"))
}
# Now turn into a DGEList
y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
# Calculate the normalization factors
z = calcNormFactors(y, method=method)
# Check for division by zero inside `calcNormFactors`
if( !all(is.finite(z$samples$norm.factors)) ){
stop("Something wrong with edgeR::calcNormFactors on this data,
non-finite $norm.factors, consider changing `method` argument")
}
# Estimate dispersions
return(estimateTagwiseDisp(estimateCommonDisp(z)))
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
OTU <- phyloseq::otu_table(ASV_table, taxa_are_rows = T)
sampledata <- phyloseq::sample_data(groupings, errorIfNULL = T)
phylo <- phyloseq::merge_phyloseq(OTU, sampledata)
test <- phyloseq_to_edgeR(physeq = phylo, group=colnames(groupings)[1])
et = exactTest(test)
tt = topTags(et, n=nrow(test$table), adjust.method="fdr", sort.by="PValue")
res <- tt@.Data[[1]]
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
Limma-Voom-TMM
Limma-Voom-TMM(Trimmed Mean of M-values)是一種用于微生物組數(shù)據(jù)差異分析的方法镰吵,它結(jié)合了limma檩禾、voom和TMM(Trimmed Mean of M-values)技術(shù)的優(yōu)勢,以處理和分析來自高通量測序技術(shù)(如RNA-seq)的數(shù)據(jù)疤祭。下面是Limma-Voom-TMM方法的基本原理:
- 數(shù)據(jù)預處理:首先盼产,使用limma包中的預處理功能對原始的測序數(shù)據(jù)進行質(zhì)量控制和標準化處理。
- TMM標準化:TMM是一種用于RNA-seq數(shù)據(jù)的標準化方法勺馆,它通過計算所有基因的幾何平均表達值戏售,然后對每個基因的表達值進行縮放,以校正樣本間的測序深度差異草穆。
- voom轉(zhuǎn)換:voom是一種用于將計數(shù)數(shù)據(jù)轉(zhuǎn)換為適合線性模型分析的格式的方法灌灾。它通過對數(shù)據(jù)進行對數(shù)變換和中心化處理,將原始的計數(shù)數(shù)據(jù)轉(zhuǎn)換為相對于某個參照樣本的比例悲柱,從而減少數(shù)據(jù)的離散性锋喜。
- 線性模型:Limma-Voom方法使用線性模型來分析數(shù)據(jù),這種模型可以包括多個協(xié)變量和批次效應豌鸡,以評估不同樣本組之間的基因表達差異嘿般。
- 經(jīng)驗貝葉斯估計:Limma-Voom方法使用經(jīng)驗貝葉斯統(tǒng)計來估計基因表達的離散度,這有助于提高對基因表達變化的檢測靈敏度涯冠。
- 統(tǒng)計檢驗:在模型擬合之后博个,Limma-Voom進行統(tǒng)計檢驗來確定基因表達是否存在顯著差異。這通常涉及到似然比檢驗(LRT)或t檢驗功偿。
- 多重檢驗校正:由于同時測試多個基因盆佣,Limma-Voom使用多重檢驗校正方法(如Benjamini-Hochberg方法)來控制假發(fā)現(xiàn)率(FDR)。
- 結(jié)果解釋:最終械荷,Limma-Voom提供了一系列結(jié)果共耍,包括P值、校正后的P值吨瞎、對數(shù)倍數(shù)變化(Log Fold Change, LFC)等痹兜,這些結(jié)果可以幫助研究者識別差異表達的基因或微生物類群。
deps = c("edgeR")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
DGE_LIST <- DGEList(ASV_table)
### do normalization
### Reference sample will be the sample with the highest read depth
### check if upper quartile method works for selecting reference
Upper_Quartile_norm_test <- calcNormFactors(DGE_LIST, method="upperquartile")
summary_upper_quartile <- summary(Upper_Quartile_norm_test$samples$norm.factors)[3]
if(is.na(summary_upper_quartile) | is.infinite(summary_upper_quartile)){
message("Upper Quartile reference selection failed will use find sample with largest sqrt(read_depth) to use as reference")
Ref_col <- which.max(colSums(sqrt(ASV_table)))
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method = "TMM", refColumn = Ref_col)
fileConn<-file(args[[4]])
writeLines(c("Used max square root read depth to determine reference sample"), fileConn)
close(fileConn)
}else{
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method="TMM")
}
## make matrix for testing
colnames(groupings) <- c("comparison")
mm <- model.matrix(~comparison, groupings)
voomvoom <- voom(DGE_LIST_Norm, mm, plot=F)
fit <- lmFit(voomvoom,mm)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, n=nrow(DGE_LIST_Norm), sort.by="none")
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
Limma-Voom-TMMwsp
Limma-Voom-TMMwsp(Trimmed Mean of M-values with Singleton Pairing)是一種用于處理和分析微生物組數(shù)據(jù)的差異分析方法颤诀,它結(jié)合了幾種不同的統(tǒng)計技術(shù)來提高分析的準確性和可靠性字旭。下面是Limma-Voom-TMMwsp方法的基本原理:
- 數(shù)據(jù)預處理:首先,對原始的測序數(shù)據(jù)進行質(zhì)量控制和預處理操作崖叫,包括去除低質(zhì)量的讀段遗淳、過濾掉可能的污染物等。
- TMMwsp標準化:TMMwsp是一種標準化方法心傀,用于校正不同樣本之間的測序深度差異屈暗。它通過對每個樣本的計數(shù)數(shù)據(jù)應用TMM方法,并在計算中考慮單例配對(singleton pairing),以減少由于樣本間測序深度不同帶來的偏差养叛。
- voom轉(zhuǎn)換:voom是一種轉(zhuǎn)換方法种呐,用于將原始的計數(shù)數(shù)據(jù)轉(zhuǎn)換為適合線性模型分析的格式。voom通過對數(shù)據(jù)進行對數(shù)變換弃甥,并根據(jù)樣本之間的差異來估計每個基因或OTU的準確差異度量爽室。
- 線性模型:Limma-Voom方法使用線性模型來分析數(shù)據(jù),這種模型可以包括多個協(xié)變量和批次效應淆攻,以評估不同樣本組之間的基因或OTU表達差異肮之。
- 經(jīng)驗貝葉斯估計:Limma-Voom方法使用經(jīng)驗貝葉斯統(tǒng)計來估計基因或OTU表達的離散度,這有助于提高對基因或OTU表達變化的檢測靈敏度卜录。
- 統(tǒng)計檢驗:在模型擬合之后戈擒,Limma-Voom進行統(tǒng)計檢驗來確定基因或OTU表達是否存在顯著差異。這通常涉及到似然比檢驗(LRT)或t檢驗艰毒。
- 多重檢驗校正:由于同時測試多個基因或OTU筐高,Limma-Voom使用多重檢驗校正方法(如Benjamini-Hochberg方法)來控制假發(fā)現(xiàn)率(FDR)。
- 結(jié)果解釋:最終丑瞧,Limma-Voom提供了一系列結(jié)果柑土,包括P值、校正后的P值绊汹、對數(shù)倍數(shù)變化(Log Fold Change, LFC)等稽屏,這些結(jié)果可以幫助研究者識別差異表達的基因或微生物類群。
deps = c("edgeR")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
DGE_LIST <- DGEList(ASV_table)
### do normalization
### Reference sample will be the sample with the highest read depth
### check if upper quartile method works for selecting reference
Upper_Quartile_norm_test <- calcNormFactors(DGE_LIST, method="upperquartile")
summary_upper_quartile <- summary(Upper_Quartile_norm_test$samples$norm.factors)[3]
if(is.na(summary_upper_quartile) | is.infinite(summary_upper_quartile)){
message("Upper Quartile reference selection failed will use find sample with largest sqrt(read_depth) to use as reference")
Ref_col <- which.max(colSums(sqrt(ASV_table)))
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method = "TMMwsp", refColumn = Ref_col)
fileConn<-file(args[[4]])
writeLines(c("Used max square root read depth to determine reference sample"), fileConn)
close(fileConn)
}else{
DGE_LIST_Norm <- calcNormFactors(DGE_LIST, method="TMMwsp")
}
## make matrix for testing
colnames(groupings) <- c("comparison")
mm <- model.matrix(~comparison, groupings)
voomvoom <- voom(DGE_LIST_Norm, mm, plot=F)
fit <- lmFit(voomvoom,mm)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, n=nrow(DGE_LIST_Norm), sort.by="none")
write.table(res, file=args[3], quote=F, sep="\t", col.names = NA)
Maaslin2
MaAsLin2是一種用于微生物組數(shù)據(jù)差異分析的統(tǒng)計方法西乖,它專門設計用來處理微生物組數(shù)據(jù)的復雜性狐榔,包括噪聲、稀疏性(零膨脹)获雕、高維度薄腻、極端非正態(tài)分布以及通常以計數(shù)或組成性測量的形式出現(xiàn)的數(shù)據(jù)。以下是MaAsLin2方法的基本原理:
- 多變量統(tǒng)計框架:MaAsLin2(Microbiome Multivariable Associations with Linear Models)使用廣義線性和混合模型來適應各種現(xiàn)代流行病學研究設計届案,包括橫截面和縱向研究庵楷。
- 數(shù)據(jù)預處理:MaAsLin2提供了預處理模塊,用于處理元數(shù)據(jù)和微生物特征中的缺失值楣颠、未知數(shù)據(jù)值和異常值尽纽。
- 歸一化和轉(zhuǎn)換:MaAsLin2可以對微生物測量結(jié)果進行歸一化和轉(zhuǎn)換,以解決樣本中可變的覆蓋深度童漩。
- 特征標準化:可選擇執(zhí)行特征標準化弄贿,并使用元數(shù)據(jù)的子集或完整補充來模擬生成的質(zhì)量控制微生物特征。
- 多變量模型:MaAsLin2使用各種可能的多變量模型之一為每個特征的每個元數(shù)據(jù)關聯(lián)定義p值睁冬。
- 重復測量的處理:在存在重復測量的情況下挎春,MaAsLin2通過在混合模型范式中適當?shù)啬M對象內(nèi)(或環(huán)境)相關來識別協(xié)變量相關的微生物特征看疙,同時還通過指定對象間隨機效應來解釋個體變異在模型中豆拨。
- 統(tǒng)計檢驗:MaAsLin2可以識別每個單獨特征與元數(shù)據(jù)對之間的關聯(lián)直奋,促進特征方面的協(xié)變量調(diào)整,并提高轉(zhuǎn)化和基本生物學應用的可解釋性施禾。
- 結(jié)果校正:MaAsLin2還提供多重檢驗校正功能脚线,以控制第一類錯誤率。
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#BiocManager::install("Maaslin2")
args <- commandArgs(trailingOnly = TRUE)
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
library(Maaslin2)
ASV_table <- data.frame(t(ASV_table), check.rows = F, check.names = F, stringsAsFactors = F)
fit_data <- Maaslin2(
ASV_table, groupings,args[3], transform = "AST",
fixed_effects = c(colnames(groupings[1])),
standardize = FALSE, plot_heatmap = F, plot_scatter = F)
metagenomeSeq
metagenomeSeq是一種用于分析微生物組測序數(shù)據(jù)的統(tǒng)計學方法弥搞,它可以幫助研究人員發(fā)現(xiàn)不同條件下微生物組的差異豐度邮绿。以下是metagenomeSeq方法的基本原理:
- 數(shù)據(jù)預處理:metagenomeSeq首先對原始的測序數(shù)據(jù)進行質(zhì)量控制和預處理,以確保數(shù)據(jù)的準確性和可靠性攀例。
- 歸一化:對測序數(shù)據(jù)進行歸一化處理船逮,以校正樣本間的測序深度差異,確保不同樣本間的比較是公平的粤铭。
- 統(tǒng)計模型:metagenomeSeq使用統(tǒng)計模型來分析數(shù)據(jù)挖胃,尤其是針對組成性數(shù)據(jù)的模型,比如零膨脹模型或負二項分布模型梆惯,這些模型可以處理微生物組數(shù)據(jù)的離散性和零膨脹問題酱鸭。
- 差異豐度分析:metagenomeSeq確定在兩個或多個組之間具有差異豐度的特征(如OTU、物種等)垛吗,通過比較它們的相對豐度來識別差異凹髓。
- 多重檢驗校正:由于同時測試多個特征,metagenomeSeq使用多重檢驗校正方法(如Benjamini-Hochberg方法)來控制假發(fā)現(xiàn)率(FDR)怯屉。
- 結(jié)果解釋:metagenomeSeq提供了豐富的結(jié)果輸出蔚舀,包括P值、校正后的P值锨络、對數(shù)倍數(shù)變化(Log Fold Change, LFC)等蝗敢,這些結(jié)果可以幫助研究者識別和解釋數(shù)據(jù)中的生物學意義。
deps = c("metagenomeSeq")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(deps)
}
library(dep, character.only = TRUE)
}
args <- commandArgs(trailingOnly = TRUE)
#test if there is an argument supply
if (length(args) <= 2) {
stop("At least three arguments must be supplied", call.=FALSE)
}
con <- file(args[1])
file_1_line1 <- readLines(con,n=1)
close(con)
if(grepl("Constructed from biom file", file_1_line1)){
ASV_table <- read.table(args[1], sep="\t", skip=1, header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}else{
ASV_table <- read.table(args[1], sep="\t", header=T, row.names = 1,
comment.char = "", quote="", check.names = F)
}
groupings <- read.table(args[2], sep="\t", row.names = 1, header=T, comment.char = "", quote="", check.names = F)
#number of samples
sample_num <- length(colnames(ASV_table))
grouping_num <- length(rownames(groupings))
#check if the same number of samples are being input.
if(sample_num != grouping_num){
message("The number of samples in the ASV table and the groupings table are unequal")
message("Will remove any samples that are not found in either the ASV table or the groupings table")
}
#check if order of samples match up.
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings and ASV table are in the same order")
}else{
rows_to_keep <- intersect(colnames(ASV_table), rownames(groupings))
groupings <- groupings[rows_to_keep,,drop=F]
ASV_table <- ASV_table[,rows_to_keep]
if(identical(colnames(ASV_table), rownames(groupings))==T){
message("Groupings table was re-arrange to be in the same order as the ASV table")
message("A total of ", sample_num-length(colnames(ASV_table)), " from the ASV_table")
message("A total of ", grouping_num-length(rownames(groupings)), " from the groupings table")
}else{
stop("Unable to match samples between the ASV table and groupings table")
}
}
data_list <- list()
data_list[["counts"]] <- ASV_table
data_list[["taxa"]] <- rownames(ASV_table)
pheno <- AnnotatedDataFrame(groupings)
pheno
counts <- AnnotatedDataFrame(ASV_table)
feature_data <- data.frame("ASV"=rownames(ASV_table),
"ASV2"=rownames(ASV_table))
feature_data <- AnnotatedDataFrame(feature_data)
rownames(feature_data) <- feature_data@data$ASV
test_obj <- newMRexperiment(counts = data_list$counts, phenoData = pheno, featureData = feature_data)
p <- cumNormStat(test_obj, pFlag = T)
p
test_obj_norm <- cumNorm(test_obj, p=p)
fromula <- as.formula(paste(~1, colnames(groupings)[1], sep=" + "))
pd <- pData(test_obj_norm)
mod <- model.matrix(fromula, data=pd)
regres <- fitFeatureModel(test_obj_norm, mod)
res_table <- MRfulltable(regres, number = length(rownames(ASV_table)))
write.table(res_table, file=args[3], quote=F, sep="\t", col.names = NA)
更多內(nèi)容請前往:識別差異微生物的方法匯總
總結(jié)
該研究證實足删,上述不同差異分析方法識別出了截然不同的數(shù)量和顯著ASV(擴增序列變體)集合寿谴,結(jié)果依賴于數(shù)據(jù)預處理。對于許多方法來說失受,識別出的特征數(shù)量與數(shù)據(jù)的某些方面相關讶泰,如樣本大小、測序深度以及群落差異的效應大小拂到。ALDEx2和ANCOM-II在不同研究中產(chǎn)生最一致的結(jié)果痪署,并且與不同方法結(jié)果交集的一致性最好。