原文鏈接Chapter 13 Integrating Datasets
1、目的
- 在實(shí)際試驗(yàn)中抖甘,不可避免的可能generate data across multiple batches热鞍;
- 正如天下沒有兩片完全的葉子一樣,分批實(shí)驗(yàn)總會(huì)對(duì)數(shù)據(jù)產(chǎn)生不可避免且不同的影響衔彻。different batches is often subject to uncontrollable differences
-Hence薇宠,computational correction of these effects is critical for eliminating batch-to-batch variation
2、準(zhǔn)備實(shí)驗(yàn)示例數(shù)據(jù)
使用TENxPBMCData()
包提供的多種 10X PBMC data中的兩批做示范
BiocManager::install("TENxPBMCData")
library(TENxPBMCData)
all.sce <- list(
pbmc3k=TENxPBMCData('pbmc3k'),
pbmc4k=TENxPBMCData('pbmc4k'))
colnames(rowData(all.sce$pbmc3k))
#將gene的不同ID儲(chǔ)存到rowData是一種值得借鑒的方法
head((rowData(all.sce$pbmc3k)))
table(is.na(rowData(all.sce$pbmc3k)$Symbol))
#也有1w+個(gè)symbol為NA值
- Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable.
- 因此分別對(duì)兩批數(shù)據(jù)進(jìn)行前期的處理
注意下面流程中的R語言技巧值得借鑒學(xué)習(xí)艰额,以及復(fù)習(xí)下截止目前學(xué)習(xí)的pipelines
(1)質(zhì)控
以線粒體基因?yàn)槭纠?/p>
library(scater)
stats <- high.mito <- list()
for (n in names(all.sce)) {
current <- all.sce[[n]]
is.mito <- grep("MT", rowData(current)$Symbol_TENx)
stats[[n]] <- perCellQCMetrics(current, subsets=list(Mito=is.mito))
high.mito[[n]] <- isOutlier(stats[[n]]$subsets_Mito_percent, type="higher")
all.sce[[n]] <- current[,!high.mito[[n]]]
}
(2)標(biāo)準(zhǔn)化
all.sce <- lapply(all.sce, logNormCounts)
(3)高變基因
library(scran)
all.dec <- lapply(all.sce, modelGeneVar)
all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1)
(4)降維
library(BiocSingular)
set.seed(10000)
all.sce <- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs,
MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()),
SIMPLIFY=FALSE)
set.seed(100000)
all.sce <- lapply(all.sce, runTSNE, dimred="PCA")
set.seed(1000000)
all.sce <- lapply(all.sce, runUMAP, dimred="PCA")
(5)聚類
for (n in names(all.sce)) {
g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA')
clust <- igraph::cluster_walktrap(g)$membership
all.sce[[n]]$clust <- factor(clust)
}
(6)整理
pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k
#挑選兩個(gè)批次相同的基因
universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)
pbmc3k <- all.sce$pbmc3k
dec3k <- all.dec$pbmc3k
pbmc3k
pbmc4k <- all.sce$pbmc4k
dec4k <- all.dec$pbmc4k
pbmc4k
(7)消除測(cè)序深度差異
- rescale each batch to adjust for differences in sequencing depth between batches(adjusting the size factors)
- 可以認(rèn)為先去除批次效應(yīng)的一個(gè)方面
library(batchelor)
rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]
(8)新的hvg
library(scran)
combined.dec <- combineVar(dec3k, dec4k)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)
最終得到的數(shù)據(jù)是
pbmc3k
澄港、pbmc4k
兩個(gè)processed and batch-normalized sce 以及chosen.hvgs
。供下面的批次效應(yīng)流程處理柄沮。
3慢睡、檢查是否存在批次效應(yīng)
- 簡(jiǎn)單思路:直接合并兩個(gè)sce,進(jìn)行整體聚類铡溪;再與原始sce的分類結(jié)果比較觀察漂辐。
- 理想的情況是整體數(shù)據(jù)的細(xì)胞分類平均分到每個(gè)batch中,而不是單一棕硫、異常的分布髓涯。
(1)合并sce對(duì)象
rowData(pbmc3k) <- rowData(pbmc4k) #使用4k的rowData
#因?yàn)閮蓚€(gè)原始sce對(duì)象的rowData(feature annotation)不完全一致(無法合并)
pbmc3k$batch <- "3k" #增加批次colData annotation
pbmc4k$batch <- "4k"
#直接合并兩個(gè)sce對(duì)象(未校正)
uncorrected <- cbind(pbmc3k, pbmc4k)
#基于直接合并的sce進(jìn)行聚類
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
set.seed(0010101010)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam())
library(scran)
snn.gr <- buildSNNGraph(uncorrected, use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)$membership
(2)觀察直接合并的聚類與原始sce的聚類有無差異
- Ideally, each cluster should ideally consist of cells from both batches.
tab <- table(Cluster=clusters, Batch=uncorrected$batch)
tab
- 但是從上面結(jié)果來看, cells of the same type are artificially separated due to technical differences between batches.
- 可視化看一看哈扮,的確批次效應(yīng)造成的分離影響很大纬纪,需要去除該效應(yīng)
set.seed(1111001)
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")
4蚓再、去除方法
法一:Linear regression
(1)前提假設(shè)
- assume that the composition of cell subpopulations is the same across batches.
- assume that the batch effect is additive.
(2)實(shí)現(xiàn)方法
- batchelor包的
rescaleBatches()
函數(shù) - roughly equivalent to applying a linear regression to the log-expression values per gene.
set.seed(1010101010) # To ensure reproducibility of IRLBA.
#降維
rescaled <- runPCA(rescaled, subset_row=chosen.hvgs, exprs_values="corrected")
#聚類
snn.gr <- buildSNNGraph(rescaled, use.dimred="PCA")
clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
#聚類結(jié)果在不同batch的分布情況
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled$batch)
tab.resc
如下圖結(jié)果,可以看到相當(dāng)部分的批次效應(yīng)導(dǎo)致的分類偏離已經(jīng)消除了
-
However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete.(batch effect 未完全消除)
再tsne可視化看一看
rescaled <- runTSNE(rescaled, dimred="PCA")
rescaled$batch <- factor(rescaled$batch)
plotTSNE(rescaled, colour_by="batch")
法二:MNN(Mutual nearest neighbors)【推薦】
- Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors.
- 簡(jiǎn)單來說尋找兩個(gè)批次中理想情況下不受影響包各,互相距離最近的點(diǎn)(pairs)摘仅。利用這些pairs將兩個(gè)batch“拉到一起”
- batchelor包的
fastMNN()
函數(shù)
set.seed(1000101001)
mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
assay slot 的 reconstructed 矩陣為兩個(gè)batch 的調(diào)整后的融合矩陣;
reducedDim slot 的 corrected 是融合矩陣的主成分信息问畅,即不需要再進(jìn)行降維處理娃属。
關(guān)于fastMNN()
函數(shù)的參數(shù)--
-
d=
:To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. -
k=
:specifies the number of nearest neighbors to consider when defining MNN pairs. -
subset.row=
一般僅挑選矩陣的hvg代表基因
library(scran)
#聚類
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn
如下圖結(jié)果
- We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.
library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
-
與第一種方法相比,兩個(gè)batch也的確更加緊密的結(jié)合在一起了
以上是第十三章Integrating Datasets部分的簡(jiǎn)單流程筆記护姆,主要學(xué)習(xí)了linear regression與MNN兩種整合數(shù)據(jù)并去除批次效應(yīng)的兩種方法矾端。但整合的同時(shí)也要考慮是否去除了數(shù)據(jù)本身的 biological difference,因此可進(jìn)一步對(duì)整合數(shù)據(jù)進(jìn)行診斷卵皂,這里暫時(shí)不做記錄了秩铆,詳見Chapter 13 Integrating Datasets
本系列筆記基于OSCA全流程的大致流程梳理,詳細(xì)原理可參考原文灯变。如有錯(cuò)誤殴玛,懇請(qǐng)指正!
此外還有劉小澤老師整理的全文翻譯筆記添祸,詳見目錄族阅。