Seurat3引入了用于多個(gè)單細(xì)胞測(cè)序數(shù)據(jù)集進(jìn)行整合分析的新方法学歧。這些方法可以對(duì)來(lái)自不同的個(gè)體勾效、實(shí)驗(yàn)條件、測(cè)序技術(shù)甚至物種中收集來(lái)的數(shù)據(jù)進(jìn)行整合嵌器,旨在識(shí)別出不同數(shù)據(jù)集之間的共享細(xì)胞狀態(tài)(shared cell states)肛真。
這些方法首先識(shí)別出不同數(shù)據(jù)集對(duì)之間的“錨(anchors)”,這些anchors代表了個(gè)體細(xì)胞之間成對(duì)的對(duì)應(yīng)關(guān)系(每個(gè)數(shù)據(jù)集中有一個(gè))爽航,并假設(shè)它們?cè)醋韵嗤纳餇顟B(tài)蚓让。然后,再利用這些識(shí)別出的anchors用于協(xié)調(diào)不同的數(shù)據(jù)集讥珍,或者將信息從一個(gè)數(shù)據(jù)集傳輸?shù)搅硪粋€(gè)數(shù)據(jù)集历极。
一、標(biāo)準(zhǔn)工作流程進(jìn)行整合分析
在本例教程中衷佃,我們選擇了通過四種不同測(cè)序技術(shù)(CelSeq (GSE81076)趟卸、 CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人類胰島細(xì)胞數(shù)據(jù)集,我們通過SeuratData包來(lái)加載這個(gè)數(shù)據(jù)集锄列。
安裝并加載所需的R包
# 安裝并加載SeuratData包
devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(Seurat)
# 查看SeuratData包搜集的數(shù)據(jù)集
AvailableData()
Dataset Version Summary species system ncells tech notes Installed InstalledVersion
cbmc.SeuratData cbmc 3.0.0 scRNAseq and 13-antibody sequencing of CBMCs human CBMC (cord blood) 8617 CITE-seq <NA> TRUE 3.0.0
hcabm40k.SeuratData hcabm40k 3.0.0 40,000 Cells From the Human Cell Atlas ICA Bone Marrow Dataset human bone marrow 40000 10x v2 <NA> FALSE 3.0.0
ifnb.SeuratData ifnb 3.0.0 IFNB-Stimulated and Control PBMCs human PBMC 13999 10x v1 <NA> TRUE 3.0.0
panc8.SeuratData panc8 3.0.0 Eight Pancreas Datasets Across Five Technologies human Pancreatic Islets 14892 SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops <NA> TRUE 3.0.0
pbmc3k.SeuratData pbmc3k 3.0.0 3k PBMCs from 10X Genomics human PBMC 2700 10x v1 <NA> TRUE 3.0.0
pbmcsca.SeuratData pbmcsca 3.0.0 Broad Institute PBMC Systematic Comparative Analysis human PBMC 31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2 HCA benchmark FALSE 3.0.0
# 下載安裝SeuratData包收集的特定數(shù)據(jù)集
InstallData("panc8")
# 加載數(shù)據(jù)集
library(panc8.SeuratData)
data("panc8")
panc8
An object of class Seurat
34363 features across 14890 samples within 1 assay
Active assay: RNA (34363 features)
分割對(duì)象新蟆,構(gòu)建不同的數(shù)據(jù)集
head(panc8@meta.data)
orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
D101_5 D101 4615.810 1986 celseq celseq <NA>
D101_7 D101 29001.563 4209 celseq celseq <NA>
D101_10 D101 6707.857 2408 celseq celseq <NA>
D101_13 D101 8797.224 2964 celseq celseq <NA>
D101_14 D101 5032.558 2264 celseq celseq <NA>
D101_17 D101 13474.866 3982 celseq celseq <NA>
celltype dataset
D101_5 gamma celseq
D101_7 acinar celseq
D101_10 alpha celseq
D101_13 delta celseq
D101_14 beta celseq
D101_17 ductal celseq
# 根據(jù)meta信息中不同的測(cè)序技術(shù)(tech)對(duì)Seurat對(duì)象進(jìn)行分割,構(gòu)建不同的數(shù)據(jù)集
pancreas.list <- SplitObject(panc8, split.by = "tech")
# 選擇出四種不同測(cè)序技術(shù)產(chǎn)生的數(shù)據(jù)
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
pancreas.list
$celseq
An object of class Seurat
34363 features across 1004 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
$celseq2
An object of class Seurat
34363 features across 2285 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
$fluidigmc1
An object of class Seurat
34363 features across 638 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
$smartseq2
An object of class Seurat
34363 features across 2394 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
分別對(duì)每個(gè)數(shù)據(jù)集進(jìn)行標(biāo)準(zhǔn)的預(yù)處理
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
}
將不同的數(shù)據(jù)集進(jìn)行整合
首先使用FindIntegrationAnchors
函數(shù)來(lái)識(shí)別anchors右蕊,該函數(shù)接受Seurat對(duì)象的列表(list)作為輸入,在這里我們將三個(gè)對(duì)象構(gòu)建成一個(gè)參考數(shù)據(jù)集吮螺。使用默認(rèn)參數(shù)來(lái)識(shí)別錨饶囚,如數(shù)據(jù)集的“維數(shù)”(30)
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3514 anchors
Filtering anchors
Retained 2761 anchors
Extracting within-dataset neighbors
|+++++++++++++++++ | 33% ~25s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3500 anchors
Filtering anchors
Retained 2728 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++ | 67% ~12s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6174 anchors
Filtering anchors
Retained 4561 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 49s
pancreas.anchors
An AnchorSet object containing 20100 anchors between 3 Seurat objects
This can be used as input to IntegrateData or TransferData.
然后將這些識(shí)別好的anchors傳遞給IntegrateData
函數(shù),整合后的數(shù)據(jù)返回一個(gè)Seurat對(duì)象鸠补,該對(duì)象中將包含一個(gè)新的Assay(integrated)萝风,里面存儲(chǔ)了整合后表達(dá)矩陣,原始的表達(dá)矩陣存儲(chǔ)在RNA這個(gè)Assay中紫岩。
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
pancreas.integrated
An object of class Seurat
36363 features across 5683 samples within 2 assays
Active assay: integrated (2000 features, 2000 variable features)
1 other assay present: RNA
對(duì)整合后的數(shù)據(jù)集進(jìn)行常規(guī)的降維聚類可視化
library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
# 數(shù)據(jù)標(biāo)準(zhǔn)化
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
# PCA降維
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
DimPlot(pancreas.integrated, reduction = "pca", group.by = "tech")
# UMAP降維可視化
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
19:12:41 UMAP embedding parameters a = 0.9922 b = 1.112
19:12:41 Read 5683 rows and found 30 numeric columns
19:12:41 Using Annoy for neighbor search, n_neighbors = 30
19:12:41 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:12:43 Writing NN index file to temp file /tmp/RtmpwSZSgF/file2786475d27897
19:12:43 Searching Annoy index using 1 thread, search_k = 3000
19:12:45 Annoy recall = 100%
19:12:46 Commencing smooth kNN distance calibration using 1 thread
19:12:49 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
19:12:49 Initializing from PCA
19:12:49 PCA: 2 components explained 44.22% variance
19:12:49 Commencing optimization for 500 epochs, with 252460 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:13:07 Optimization finished
# 使用group.by函數(shù)根據(jù)不同的條件進(jìn)行分群
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2
p3 <- DimPlot(pancreas.integrated, reduction = "umap", split.by = "tech")
p3
使用整合后的參考數(shù)據(jù)集對(duì)細(xì)胞類型進(jìn)行分類
Seurat3還支持將參考數(shù)據(jù)集(或元數(shù)據(jù))投影到查詢對(duì)象上规惰。雖然許多方法是一致的(這兩個(gè)過程都是從識(shí)別錨開始的),但數(shù)據(jù)映射(data transfer)和數(shù)據(jù)整合(data integration)之間有兩個(gè)重要的區(qū)別:
1)In data transfer, Seurat does not correct or modify the query expression data.
2)In data transfer, Seurat has an option (set by default) to project the PCA structure of a reference onto the query, instead of learning a joint structure with CCA. We generally suggest using this option when projecting data between scRNA-seq datasets.
識(shí)別到anchors之后泉蝌,我們使用TransferData
函數(shù)根據(jù)參考數(shù)據(jù)集中細(xì)胞類型標(biāo)簽向量對(duì)查詢數(shù)據(jù)集的細(xì)胞進(jìn)行分類歇万。TransferData函數(shù)返回一個(gè)帶有預(yù)測(cè)id和預(yù)測(cè)分?jǐn)?shù)的矩陣,我們可以將其添加到query metadata中勋陪。
# 構(gòu)建query數(shù)據(jù)集
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.query
An object of class Seurat
34363 features across 638 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
# 識(shí)別參考數(shù)據(jù)集的anchors
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, dims = 1:30)
pancreas.anchors
An AnchorSet object containing 20100 anchors between 3 Seurat objects
This can be used as input to IntegrateData or TransferData.
# 將查詢數(shù)據(jù)集映射到參考數(shù)據(jù)集上
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, dims = 1:30)
# 添加預(yù)測(cè)出的信息
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
因?yàn)槲覀兙哂衼?lái)自整合后數(shù)據(jù)集中含有的原始注釋標(biāo)簽贪磺,所以我們可以評(píng)估預(yù)測(cè)的細(xì)胞類型注釋與完整參考的匹配程度。在此示例中诅愚,我們發(fā)現(xiàn)在細(xì)胞類型分類中具有很高的一致性寒锚,有超過97%的細(xì)胞被正確的標(biāo)記出。
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
## FALSE TRUE
## 18 620
為了進(jìn)一步驗(yàn)證這一點(diǎn)违孝,我們可以查看一些特定胰島細(xì)胞群中的典型細(xì)胞類型標(biāo)記基因(cell type markers)刹前。
table(pancreas.query$predicted.id)
##
## acinar activated_stellate alpha beta
## 21 17 248 258
## delta ductal endothelial epsilon
## 22 33 13 1
## gamma macrophage mast schwann
## 17 1 2 5
VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
二、使用SCTransform方法進(jìn)行整合分析
Here, instead, we will harmonize the Pearson residuals that are output from SCTransform. As demonstrated below, the workflow consists of the following steps:
- Create a list of Seurat objects to integrate
- Perform SCTransform normalization separately for each dataset
- Run the PrepSCTIntegration function on the object list
- Integrate datasets, and proceed with joint analysis
首先雌桑,構(gòu)建Seurat對(duì)象列表喇喉,并分別對(duì)每個(gè)對(duì)象運(yùn)行SCTransform進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化:
library(Seurat)
library(ggplot2)
library(patchwork)
options(future.globals.maxSize = 4000 * 1024^2)
data("panc8")
pancreas.list <- SplitObject(panc8, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- SCTransform(pancreas.list[[i]], verbose = FALSE)
}
接下來(lái),選擇用于數(shù)據(jù)整合的一些features筹燕,并運(yùn)行PrepSCTIntegration
pancreas.features <- SelectIntegrationFeatures(object.list = pancreas.list, nfeatures = 3000)
pancreas.list <- PrepSCTIntegration(object.list = pancreas.list, anchor.features = pancreas.features, verbose = FALSE)
然后使用FindIntegrationAnchors識(shí)別anchors轧飞,并運(yùn)行IntegrateData進(jìn)行數(shù)據(jù)集的整合,確保設(shè)置了normalization.method = 'SCT'撒踪。
pancreas.anchors <- FindIntegrationAnchors(object.list = pancreas.list, normalization.method = "SCT", anchor.features = pancreas.features, verbose = FALSE)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, normalization.method = "SCT", verbose = FALSE)
對(duì)整合后的數(shù)據(jù)進(jìn)行下游的降維可視化
pancreas.integrated <- RunPCA(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, dims = 1:30)
plots <- DimPlot(pancreas.integrated, group.by = c("tech", "celltype"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 3, byrow = TRUE, override.aes = list(size = 3)))
三过咬、基于Reference-based的方法進(jìn)行整合分析
As an alternative, we introduce here the possibility of specifying one or more of the datasets as the ‘reference’ for integrated analysis, with the remainder designated as ‘query’ datasets. In this workflow, we do not identify anchors between pairs of query datasets, reducing the number of comparisons. For example, when integrating 10 datasets with one specified as a reference, we perform only 9 comparisons. Reference-based integration can be applied to either log-normalized or SCTransform-normalized datasets.
library(Seurat)
library(SeuratData)
library(ggplot2)
library(patchwork)
InstallData("pbmcsca")
data("pbmcsca")
# 分割數(shù)據(jù)集構(gòu)建Seurat對(duì)象列表
pbmc.list <- SplitObject(pbmcsca, split.by = "Method")
# 分別對(duì)每個(gè)對(duì)象進(jìn)行SCTransform標(biāo)準(zhǔn)化處理
for (i in names(pbmc.list)) {
pbmc.list[[i]] <- SCTransform(pbmc.list[[i]], verbose = FALSE)
}
# 選擇用于數(shù)據(jù)集整合的features
pbmc.features <- SelectIntegrationFeatures(object.list = pbmc.list, nfeatures = 3000)
# 執(zhí)行PrepSCTIntegration處理
pbmc.list <- PrepSCTIntegration(object.list = pbmc.list, anchor.features = pbmc.features)
# 選擇參考數(shù)據(jù)集
reference_dataset <- which(names(pbmc.list) == "10x Chromium (v3)")
# 識(shí)別整合的anchors
pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc.list, normalization.method = "SCT", anchor.features = pbmc.features, reference = reference_dataset)
# 進(jìn)行數(shù)據(jù)整合
pbmc.integrated <- IntegrateData(anchorset = pbmc.anchors, normalization.method = "SCT")
# 數(shù)據(jù)降維可視化
pbmc.integrated <- RunPCA(object = pbmc.integrated, verbose = FALSE)
pbmc.integrated <- RunUMAP(object = pbmc.integrated, dims = 1:30)
plots <- DimPlot(pbmc.integrated, group.by = c("Method", "CellType"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 4, byrow = TRUE, override.aes = list(size = 2.5)))
DefaultAssay(pbmc.integrated) <- "RNA"
# Normalize RNA data for visualization purposes
pbmc.integrated <- NormalizeData(pbmc.integrated, verbose = FALSE)
FeaturePlot(pbmc.integrated, c("CCR7", "S100A4", "GZMB", "GZMK", "GZMH", "TCL1A"))
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines parallel stats4 grid stats graphics grDevices
[8] utils datasets methods base
other attached packages:
[1] loomR_0.2.1.9000 hdf5r_1.3.1
[3] R6_2.4.0 mclust_5.4.5
[5] garnett_0.1.16 org.Hs.eg.db_3.8.2
[7] AnnotationDbi_1.46.0 pagedown_0.9.1
[9] devtools_2.1.0 usethis_1.5.1
[11] celaref_1.2.0 scran_1.12.1
[13] scRNAseq_1.10.0 FLOWMAPR_1.2.0
[15] Seurat_3.1.4.9902 sctransform_0.2.0
[17] patchwork_0.0.1 cowplot_1.0.0
[19] DT_0.12 RColorBrewer_1.1-2
[21] shinydashboard_0.7.1 ggsci_2.9
[23] shiny_1.3.2 stxBrain.SeuratData_0.1.1
[25] pbmcsca.SeuratData_3.0.0 panc8.SeuratData_3.0.2
[27] SeuratData_0.2.1 doParallel_1.0.14
[29] iterators_1.0.12 binless_0.15.1
[31] RcppEigen_0.3.3.5.0 foreach_1.4.7
[33] scales_1.0.0 dplyr_0.8.3
[35] pagoda2_0.1.0 harmony_1.0
[37] Rcpp_1.0.2 conos_1.1.2
參考來(lái)源:https://satijalab.org/seurat/v3.1/integration.html