隨著越來越多的scRNA-seq數(shù)據(jù)集可用酗电,對它們進(jìn)行比較是關(guān)鍵谓松。主要的應(yīng)用程序是比較不同實(shí)驗(yàn)室收集的具有相似生物學(xué)來源的數(shù)據(jù)集感挥,以確保注釋和分析是一致的纬霞。此外募逞,隨著大量的參考數(shù)據(jù)集蛋铆,如人類細(xì)胞圖譜(HCA)的出現(xiàn),一個重要的應(yīng)用將是將來自新樣本(如來自疾病組織)的細(xì)胞投射到參考數(shù)據(jù)集上放接,以表征組成的差異刺啦,或檢測新的細(xì)胞類型。
scmap是一種將細(xì)胞從scRNA-seq實(shí)驗(yàn)投射到不同實(shí)驗(yàn)中識別的細(xì)胞類型或細(xì)胞的方法纠脾。bioRxiv.
scmap建立在Bioconductor的singlecellexper對象之上玛瘸。請閱讀如何從你自己的數(shù)據(jù)創(chuàng)建一個SingleCellExperiment。在這里乳乌,我們將展示一個關(guān)于如何做到這一點(diǎn)的小例子捧韵,但請注意,它不是一個全面的指南汉操。
如果你已經(jīng)有一個SingleCellExperiment對象再来,那么繼續(xù)下一章。
如果您有一個表達(dá)矩陣磷瘤,那么您首先需要創(chuàng)建一個包含您的數(shù)據(jù)的singlecellexper對象芒篷。為了便于說明,我們將使用scmap提供的示例表達(dá)式矩陣采缚。數(shù)據(jù)集(yan)表示來自人類胚胎的90個細(xì)胞的FPKM基因表達(dá)针炉。作者(Yan等人)在原始出版物(ann數(shù)據(jù)框架)中定義了所有細(xì)胞的發(fā)育階段。我們稍后將在投影中使用這些階段扳抽。
library(SingleCellExperiment)
library(scmap)
head(ann)
## cell_type1
## Oocyte..1.RPKM. zygote
## Oocyte..2.RPKM. zygote
## Oocyte..3.RPKM. zygote
## Zygote..1.RPKM. zygote
## Zygote..2.RPKM. zygote
## Zygote..3.RPKM. zygote
yan[1:3, 1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152 0.0 0.0 0.0
## RPS11 1219.9 1021.1 931.6
## ELMO2 7.0 12.2 9.3
Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment
## dim: 20214 90
## metadata(0):
## assays(2): normcounts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
## Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
Feature selection
一旦我們有了一個單獨(dú)的實(shí)驗(yàn)對象篡帕,我們就可以運(yùn)行scmap了殖侵。首先,我們需要從我們的輸入數(shù)據(jù)集中選擇信息最豐富的特征(基因):
sce <- selectFeatures(sce, suppress_plot = FALSE)
## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...
用紅色突出顯示的特征將用于進(jìn)一步的分析(投影)镰烧。
特性存儲在輸入對象的rowData槽的scmap_features列中拢军。默認(rèn)scmap選擇500個功能(也可以通過設(shè)置n_features參數(shù)來控制):
table(rowData(sce)$scmap_features)
##
## FALSE TRUE
## 19714 500
scmap-cluster
參考數(shù)據(jù)集的scmap-cluster索引是通過查找每個集群的中間基因表達(dá)來創(chuàng)建的。默認(rèn)情況下怔鳖,scmap使用引用中colData的cell_type1列來標(biāo)識集群茉唉。其他列可以通過調(diào)整cluster_col參數(shù)手動選擇:
sce <- indexCluster(sce)
函數(shù)indexCluster自動寫入引用數(shù)據(jù)集元數(shù)據(jù)槽的scmap_cluster_index項(xiàng)。
head(metadata(sce)$scmap_cluster_index)
## zygote 2cell 4cell 8cell 16cell blast
## ABCB4 5.788589 6.2258580 5.935134 0.6667119 0.000000 0.000000
## ABCC6P1 7.863625 7.7303559 8.322769 7.4303689 4.759867 0.000000
## ABT1 0.320773 0.1315172 0.000000 5.9787977 6.100671 4.627798
## ACCSL 7.922318 8.4274290 9.662611 4.5869260 1.768026 0.000000
## ACOT11 0.000000 0.0000000 0.000000 6.4677243 7.147798 4.057444
## ACOT9 4.877394 4.2196038 5.446969 4.0685468 3.827819 0.000000
heatmap(as.matrix(metadata(sce)$scmap_cluster_index))
一旦生成了scmap-cluster索引结执,我們就可以使用它將數(shù)據(jù)集投射到自身(僅用于說明目的)度陆。這可以通過一次一個索引來實(shí)現(xiàn),但是如果以列表的形式提供献幔,scmap也允許同時投影到多個索引:
scmapCluster_results <- scmapCluster(
projection = sce,
index_list = list(
yan = metadata(sce)$scmap_cluster_index
)
)
scmap-cluster將查詢數(shù)據(jù)集投射到index_list中定義的所有投影懂傀。細(xì)胞標(biāo)簽分配的結(jié)果合并為一個矩陣:
head(scmapCluster_results$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "2cell"
## [5,] "2cell"
## [6,] "2cell"
對應(yīng)的相似性存儲在scmap_cluster_siml項(xiàng)中:
head(scmapCluster_results$scmap_cluster_siml)
## yan
## [1,] 0.9947609
## [2,] 0.9951257
## [3,] 0.9955916
## [4,] 0.9934012
## [5,] 0.9953694
## [6,] 0.9871041
scmap還提供所有參考數(shù)據(jù)集的組合結(jié)果(選擇對應(yīng)于參考數(shù)據(jù)集之間最大相似性的標(biāo)簽):
head(scmapCluster_results$combined_labs)
## [1] "zygote" "zygote" "zygote" "2cell" "2cell" "2cell"
可以將scmap-cluster的結(jié)果可視化為Sankey圖,以顯示如何匹配cell-cluster (getSankey()函數(shù))斜姥。請注意鸿竖,只有在查詢和引用數(shù)據(jù)集都已聚類的情況下,Sankey圖才會提供信息铸敏,但是沒有必要為查詢分配有意義的標(biāo)簽(cluster1缚忧、cluster2等就足夠了):
plot(
getSankey(
colData(sce)$cell_type1,
scmapCluster_results$scmap_cluster_labs[,'yan'],
plot_height = 400
)
)
scmap-cell
與scmap-cluster不同,scmap-cell將輸入數(shù)據(jù)集的單元投射到引用的單個細(xì)胞杈笔,而不是群闪水。
scmap-cell包含k-means步驟,這使得它是隨機(jī)的蒙具,即多次運(yùn)行它將提供略有不同的結(jié)果球榆。因此,我們將固定一個隨機(jī)種子禁筏,以便用戶能夠準(zhǔn)確地復(fù)制我們的結(jié)果:
···
set.seed(1)
···
在scmap-cell中持钉,索引是由product quantiser算法創(chuàng)建的,該算法使用一組子中心來標(biāo)識引用中的每個單元篱昔,這些子中心是通過基于特征子集的k-means聚類找到的每强。
···
sce <- indexCell(sce)
···
與scmap-cluster索引不同,scmap-cell索引包含關(guān)于每個細(xì)胞的信息州刽,因此不容易可視化空执。scmap-cell索引由兩項(xiàng)組成:
···
names(metadata(sce)$scmap_cell_index)
[1] "subcentroids" "subclusters"
···
subcentroids包含由product quantiser算法的選定特征、k和M參數(shù)定義的低維子空間的subcentroids的坐標(biāo)(參見?indexCell)穗椅。
length(metadata(sce)$scmap_cell_index$subcentroids)
## [1] 50
dim(metadata(sce)$scmap_cell_index$subcentroids[[1]])
## [1] 10 9
metadata(sce)$scmap_cell_index$subcentroids[[1]][,1:5]
## 1 2 3 4 5
## ZAR1L 0.072987697 0.2848353 0.33713297 0.26694708 0.3051086
## SERPINF1 0.179135680 0.3784345 0.35886481 0.39453521 0.4326297
## GRB2 0.439712934 0.4246024 0.23308320 0.43238208 0.3247221
## GSTP1 0.801498298 0.1464230 0.14880665 0.19900079 0.0000000
## ABCC6P1 0.005544482 0.4358565 0.46276591 0.40280401 0.3989602
## ARGFX 0.341212258 0.4284664 0.07629512 0.47961460 0.1296112
## DCT 0.004323311 0.1943568 0.32117489 0.21259776 0.3836451
## C15orf60 0.006681366 0.1862540 0.28346531 0.01123282 0.1096438
## SVOPL 0.003004345 0.1548237 0.33551596 0.12691677 0.2525819
## NLRP9 0.101524942 0.3223963 0.40624639 0.30465156 0.4640308
In the case of our yan dataset:
yan dataset contains N=90
cells
We selected f=500
features (scmap default)
M was calculated as f/10=50
(scmap default for f≤1000
). M is the number of low dimensional subspaces
Number of features in any low dimensional subspace equals to f/M=10
k was calculated as k=N??√≈9
(scmap default).
子簇包含每個給定細(xì)胞所屬的亞中心的低維子空間索引:
dim(metadata(sce)$scmap_cell_index$subclusters)
## [1] 50 90
metadata(sce)$scmap_cell_index$subclusters[1:5,1:5]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM. Zygote..1.RPKM.
## [1,] 6 6 6 6
## [2,] 5 5 5 5
## [3,] 5 5 5 5
## [4,] 3 3 3 3
## [5,] 6 6 6 6
## Zygote..2.RPKM.
## [1,] 6
## [2,] 5
## [3,] 5
## [4,] 3
## [5,] 6
一旦生成了scmap-cell索引辨绊,我們就可以使用它們來投影baron數(shù)據(jù)集。這可以用一個索引一次完成匹表,但是scmap允許同時投影到多個索引门坷,如果它們以列表的形式提供:
scmapCell_results <- scmapCell(
sce,
list(
yan = metadata(sce)$scmap_cell_index
)
)
每個數(shù)據(jù)集有兩個母系宣鄙。細(xì)胞矩陣包含投影數(shù)據(jù)集的給定細(xì)胞最接近的參考數(shù)據(jù)集的前10個(scmap默認(rèn)值)細(xì)胞id:
scmapCell_results$yan$cells[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 11 11 11
## [5,] 5 5 5
## [6,] 6 6 6
## [7,] 7 7 7
## [8,] 12 8 12
## [9,] 9 9 9
## [10,] 10 10 10
similarities matrix contains corresponding cosine similarities:
scmapCell_results$yan$similarities[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 0.9742737 0.9736593 0.9748542
## [2,] 0.9742274 0.9737083 0.9748995
## [3,] 0.9742274 0.9737083 0.9748995
## [4,] 0.9693955 0.9684169 0.9697731
## [5,] 0.9698173 0.9688538 0.9701976
## [6,] 0.9695394 0.9685904 0.9699759
## [7,] 0.9694336 0.9686058 0.9699198
## [8,] 0.9694091 0.9684312 0.9697699
## [9,] 0.9692544 0.9684312 0.9697358
## [10,] 0.9694336 0.9686058 0.9699198
如果cell cluster注釋可用于參考數(shù)據(jù)集,除了查找前10位最近鄰之外拜鹤,scmap-cell還允許使用引用的標(biāo)簽來注釋投影數(shù)據(jù)集的單細(xì)胞框冀。它通過查看前3個最近的鄰居(scmap默認(rèn)值),如果它們都屬于參考中的相同集群敏簿,并且它們的最大相似度高于閾值(0.5是scmap默認(rèn)值),則將一個投影細(xì)胞分配給相應(yīng)的參考群:
scmapCell_clusters <- scmapCell2Cluster(
scmapCell_results,
list(
as.character(colData(sce)$cell_type1)
)
)
scmap-cell results are in the same format as the ones provided by scmap-cluster (see above):
head(scmapCell_clusters$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "unassigned"
## [5,] "unassigned"
## [6,] "unassigned"
對應(yīng)的相似性存儲在scmap_cluster_siml項(xiàng)中:
head(scmapCell_clusters$scmap_cluster_siml)
## yan
## [1,] 0.9742737
## [2,] 0.9737083
## [3,] 0.9748995
## [4,] NA
## [5,] NA
## [6,] NA
head(scmapCell_clusters$combined_labs)
## [1] "zygote" "zygote" "zygote" "unassigned" "unassigned"
## [6] "unassigned"
plot(
getSankey(
colData(sce)$cell_type1,
scmapCell_clusters$scmap_cluster_labs[,"yan"],
plot_height = 400
)
)
scmap: projection of single-cell RNA-seq data across data sets
http://bioconductor.org/packages/release/bioc/vignettes/scmap/inst/doc/scmap.html