最近在學(xué)習(xí)芬蘭CSC-IT科學(xué)中心主講的生物信息課程(https://www.csc.fi/web/training/-/scrnaseq)視頻,官網(wǎng)上還提供了練習(xí)素材以及詳細(xì)代碼稻扬,今天就來練習(xí)一下單細(xì)胞數(shù)據(jù)整合的過程卦方。跟著官網(wǎng)的代碼走一遍:
https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md
該練習(xí)中使用兩種方法進(jìn)行多個(gè)單細(xì)胞測(cè)序dataset的整合,之后進(jìn)行批次效應(yīng)的去除泰佳,并且定量評(píng)估整合后的數(shù)據(jù)質(zhì)量盼砍。練習(xí)中的datasets分別來自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩陣和相關(guān)metadata在這里下載逝她。(這里需要注意的是浇坐,作者上傳的這個(gè)矩陣是已經(jīng)經(jīng)過整合的,但是并沒有去除批次效應(yīng)黔宛,后面代碼里會(huì)將這個(gè)矩陣拆分成4個(gè)datasets近刘,然后再進(jìn)行整合)
開始之前,加載R包:
> library("Seurat")
> library("ggplot2")
> library("cowplot")
> library("scater")
> library("scran")
> library("BiocParallel")
> library("BiocNeighbors")
(一)利用Seurat (anchors and CCA) 方法進(jìn)行數(shù)據(jù)整合以及批次效應(yīng)處理
加載表達(dá)矩陣和metadata,其中metadata里包含測(cè)序平臺(tái)(列)觉渴,細(xì)胞類型注釋(列)
> pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
> metadata <- readRDS(file = "pancreas_metadata.rds")
看一下這個(gè)metadata:
創(chuàng)建seurat對(duì)象:
> pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
在做任何批次效應(yīng)處理之前介劫,都要先查看一下dataset,我們先做標(biāo)準(zhǔn)的預(yù)處理(log-標(biāo)準(zhǔn)化)案淋,然后識(shí)別變量(“vst”)座韵,接下來scale整合后的data,跑PCA和可視化踢京,再將整合后的細(xì)胞分群(cluster)
# 標(biāo)準(zhǔn)化并且尋找變量(variable features)
> pancreas <- NormalizeData(pancreas, verbose = FALSE)
> pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# 跑標(biāo)準(zhǔn)的流程(可視化和clustering)
> pancreas <- ScaleData(pancreas, verbose = FALSE)
> pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
> pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
> p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
> p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p1, p2)
下面作者將這個(gè)整合的數(shù)據(jù)拆分成一個(gè)列表(包含4個(gè)不同的datasets)黔帕,每一個(gè)dataset作為一個(gè)元素。進(jìn)行標(biāo)準(zhǔn)的預(yù)處理(log-normalization)蹈丸,識(shí)別每一個(gè)datset的變量特征("vst"):
> pancreas.list <- SplitObject(pancreas, split.by = "tech")
> for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
}
整合4個(gè)胰島細(xì)胞的datasets
利用FindIntegrationAnchors功能識(shí)別anchor成黄,seurat對(duì)象列表作為輸入:
> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3499 anchors
Filtering anchors
Retained 2821 anchors
Extracting within-dataset neighbors
|+++++++++ | 17% ~01m 01s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3515 anchors
Filtering anchors
Retained 2701 anchors
Extracting within-dataset neighbors
|+++++++++++++++++ | 33% ~49s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6173 anchors
Filtering anchors
Retained 4634 anchors
Extracting within-dataset neighbors
|+++++++++++++++++++++++++ | 50% ~50s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2176 anchors
Filtering anchors
Retained 1841 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++ | 67% ~27s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2774 anchors
Filtering anchors
Retained 2478 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++ | 83% ~12s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2723 anchors
Filtering anchors
Retained 2410 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s
然后將上面這些anchors傳遞給IntegrateData函數(shù),該函數(shù)返回一個(gè)Seurat對(duì)象:
> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
運(yùn)行IntegrateData后白华,Seurat對(duì)象將包含一個(gè)新的整合后的(或“批量校正”)表達(dá)矩陣的Assay慨默,請(qǐng)注意,原始矩陣(未修正的值)仍然存儲(chǔ)在Seurat對(duì)象的RNA Assay中弧腥,因此可以來回切換厦取。
然后我們可以使用這個(gè)新的整合的矩陣進(jìn)行下游分析和可視化。在這里管搪,我們scale整合的數(shù)據(jù)虾攻,運(yùn)行PCA,并使用UMAP可視化結(jié)果更鲁。整合的數(shù)據(jù)集按細(xì)胞類型cluster霎箍,而不是按技術(shù)。
#切換到整合后的assay
> DefaultAssay(pancreas.integrated) <- "integrated"
跑標(biāo)準(zhǔn)流程(可視化和clustering):
> pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
> pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
> pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
> p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
> p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p3, p4)
(二)利用Mutual Nearest Neighbor (MNN)方法進(jìn)行數(shù)據(jù)整合
你可以用count矩陣創(chuàng)建一個(gè)singlecellexper(SCE)對(duì)象媒至,也可以從Seurat轉(zhuǎn)換成SCE對(duì)象:
> celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
> celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
> fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
> smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
尋找共同的基因顶别,并且把每個(gè)dataset簡(jiǎn)化成由那些共同基因組成的dataset:
> keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
+ rownames(fluidigmc1.data),rownames(smartseq2.data)))
> celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
> celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
> fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
> smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
接下來使用calculateQCMetrics()計(jì)算質(zhì)量控制特征,通過發(fā)現(xiàn)異常count數(shù)低的或可檢測(cè)到的基因總數(shù)少的異常值來確定低質(zhì)量細(xì)胞:
# 處理celseq.data
> celseq.data <- calculateQCMetrics(celseq.data)
> low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
# 處理celseq2.data
> celseq2.data <- calculateQCMetrics(celseq2.data)
> low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
# 處理fluidigmc1.data
> fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
> low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
> low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
> fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
# 處理smartseq2.data
> smartseq2.data <- calculateQCMetrics(smartseq2.data)
> low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
然后使用computeSumFactors()和scran包的Normalize()函數(shù)計(jì)算sizefactor來標(biāo)準(zhǔn)化數(shù)據(jù):
# Compute sizefactors
> celseq.data <- computeSumFactors(celseq.data)
> celseq2.data <- computeSumFactors(celseq2.data)
> fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
> smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
> celseq.data <- normalize(celseq.data)
> celseq2.data <- normalize(celseq2.data)
> fluidigmc1.data <- normalize(fluidigmc1.data)
> smartseq2.data <- normalize(smartseq2.data)
features(基因)選擇:使用trendVar()和decomposeVar()函數(shù)來計(jì)算每個(gè)基因的variance拒啰,并將其分為技術(shù)variance和生物學(xué)的variance:
# celseq.data
> fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE)
> dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
> dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
> dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
# celseq2.data
> fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE)
> dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
> dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
> dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
# fluidigmc1.data
> fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE)
> dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
> dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
> dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
# smartseq2.data
> fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE)
> dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
> dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
> dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# 選擇最能提供信息的基因驯绎,這些基因在所有的dataset里都表達(dá)
> universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
> mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] +
dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
> hvg_genes <- universe[mean.bio > 0]
將這些datasets結(jié)合到一個(gè)統(tǒng)一的SingleCellExperiment里:
# 總原始counts的整合
> counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data),
counts(fluidigmc1.data), counts(smartseq2.data))
# 總的標(biāo)準(zhǔn)化后的counts整合 (with multibatch normalization)
> logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data),
logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# 構(gòu)建整合數(shù)據(jù)的sce對(duì)象
> sce <- SingleCellExperiment(
assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),
rowData = rowData(celseq.data), # same as rowData(pbmc4k)
colData = rbind(colData(celseq.data), colData(celseq2.data),
colData(fluidigmc1.data), colData(smartseq2.data))
)
# 將前面的hvg_genes存儲(chǔ)到sce對(duì)象的metadata slot中
> metadata(sce)$hvg_genes <- hvg_genes
用MNN處理批次效應(yīng)之前先看一下這些datasets:
> sce <- runPCA(sce,
ncomponents = 20,
feature_set = hvg_genes,
method = "irlba")
>
> names(reducedDims(sce)) <- "PCA_naive"
>
> p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") +
ggtitle("PCA Without batch correction")
> p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") +
ggtitle("PCA Without batch correction")
> plot_grid(p5, p6)
使用fastMNN() 功能處理批次效應(yīng)。跑fastMNN()之前谋旦,我們需要先rescale每一個(gè)批次剩失,來調(diào)整不同批次之間的測(cè)序深度屈尼。用scran包里的multiBatchNorm()功能對(duì)size factor進(jìn)行調(diào)整后,重新計(jì)算log標(biāo)準(zhǔn)化的表達(dá)值拴孤,以適應(yīng)不同SingleCellExperiment對(duì)象的系統(tǒng)差異脾歧。之前的size factors僅能移除單個(gè)批次里細(xì)胞之間的bias。現(xiàn)在我們要通過消除批次之間技術(shù)差異來提高了校正的質(zhì)量:
> rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data)
> celseq.data_rescaled <- rescaled[[1]]
> celseq2.data_rescaled <- rescaled[[2]]
> fluidigmc1.data_rescaled <- rescaled[[3]]
> smartseq2.data_rescaled <- rescaled[[4]]
跑fastMNN乞巧,把降維的MNN representation存在sce對(duì)象的 reducedDims slot里:
> mnn_out <- fastMNN(celseq.data_rescaled,
celseq2.data_rescaled,
fluidigmc1.data_rescaled,
smartseq2.data_rescaled,
subset.row = metadata(sce)$hvg_genes,
k = 20, d = 50, approximate = TRUE,
# BPPARAM = BiocParallel::MulticoreParam(8),
BNPARAM = BiocNeighbors::AnnoyParam())
> reducedDim(sce, "MNN") <- mnn_out$correct
需要注意的是涨椒,fastMNN()不會(huì)生成批次處理后的表達(dá)矩陣摊鸡。因此绽媒,fastMNN()的結(jié)果只能作為降維表示,適用于直接繪圖免猾、TSNE/UMAP是辕、聚類和軌跡分析。
畫批次矯正后的圖:
> p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
> p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
> plot_grid(p7, p8)