對(duì)應(yīng)原版教程第12章http://bioconductor.org/books/release/OSCA/overview.html”
“物以類聚”的類是什么類竞慢?比如將一群水果分為不同的類群骑冗,則又紅又圓特征的可能是蘋果拙泽。對(duì)于單細(xì)胞聚類的結(jié)果窖张,類的最直接注釋就是細(xì)胞類型。本節(jié)將學(xué)習(xí)單細(xì)胞數(shù)據(jù)分析過程中注釋細(xì)胞類型的三種思路豹障。
筆記要點(diǎn)
1冯事、參考已注釋細(xì)胞類型的單細(xì)胞表達(dá)矩陣
2、觀察各種細(xì)胞類型marker gene sets在每個(gè)細(xì)胞里的表達(dá)“活性”
3血公、cluster差異基因富集分析結(jié)果聯(lián)系細(xì)胞類型
筆記要點(diǎn)
1昵仅、參考已注釋細(xì)胞類型的單細(xì)胞表達(dá)矩陣
2、觀察各種細(xì)胞類型marker gene sets在每個(gè)細(xì)胞里的表達(dá)“活性”
3坞笙、cluster差異基因富集分析結(jié)果聯(lián)系細(xì)胞類型
1岩饼、參考已注釋細(xì)胞類型的單細(xì)胞表達(dá)矩陣
1.1 背景知識(shí)
- 如果已知若干細(xì)胞類型的全部基因表達(dá)情況,對(duì)于未知的單細(xì)胞表達(dá)矩陣的每一個(gè)細(xì)胞薛夜,可以計(jì)算與前者的各個(gè)相似性程度籍茧,根據(jù)最優(yōu)的相似結(jié)果,進(jìn)而推斷出細(xì)胞類型梯澜。
- 在如上描述中寞冯,有兩個(gè)關(guān)鍵點(diǎn)。首先是已知細(xì)胞類型的表達(dá)矩陣(列名為細(xì)胞類型晚伙,行名為基因)吮龄。常用的
celldex
數(shù)據(jù)包提供了人/鼠多個(gè)參考表達(dá)矩陣,多來自于Bulk RNA-seq或芯片測(cè)序技術(shù)咆疗,詳見https://bioconductor.org/packages/3.12/data/experiment/vignettes/celldex/inst/doc/userguide.html - 第二的關(guān)鍵點(diǎn)是評(píng)價(jià)未知細(xì)胞的表達(dá)情況與已知細(xì)胞類型的表達(dá)圖譜的相似性方法漓帚。這里重點(diǎn)介紹
SingleR
包提供的評(píng)價(jià)以及預(yù)測(cè)方法
1.2 示例數(shù)據(jù)
- 測(cè)試數(shù)據(jù)集(未知細(xì)胞類型):人外周血組織單細(xì)胞測(cè)序結(jié)果
sce.pbmc #已完成質(zhì)控、標(biāo)準(zhǔn)化午磁、聚類尝抖;具體可參考原教程
#class: SingleCellExperiment
#dim: 33694 737280
- 參考數(shù)據(jù)集:
celldex
包提供的BlueprintEncodeData()
The Blueprint/ENCODE reference consists of bulk RNA-seq data for pure stroma and immune cells generated by Blueprint (Martens and Stunnenberg 2013) and ENCODE projects (The ENCODE Project Consortium 2012).
library(celldex)
ref <- BlueprintEncodeData()
ref
1.3 SingleR
細(xì)胞類型預(yù)測(cè)
(1)默認(rèn)直接預(yù)測(cè)每個(gè)細(xì)胞的類型
library(SingleR)
pred <- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main)
head(pred)
# DataFrame with 6 rows and 5 columns
# scores first.labels tuning.scores labels pruned.labels
# <matrix> <character> <DataFrame> <character> <character>
# AAACCTGAGAAGGCCT-1 0.251873:0.118890:0.287805:... Monocytes 0.413717:0.00491616 Monocytes Monocytes
# AAACCTGAGACAGACC-1 0.280080:0.133100:0.334590:... Monocytes 0.469530:0.40291032 Monocytes Monocytes
# AAACCTGAGGCATGGT-1 0.211503:0.153752:0.345878:... CD4+ T-cells 0.180486:0.06419868 CD4+ T-cells CD4+ T-cells
# AAACCTGCAAGGTTCT-1 0.218346:0.155343:0.367597:... CD8+ T-cells 0.307063:0.19412309 CD8+ T-cells CD8+ T-cells
# AAACCTGCAGGCGATA-1 0.323697:0.182205:0.483679:... Monocytes 0.558153:0.50413520 Monocytes Monocytes
# AAACCTGCATGAAGTA-1 0.312739:0.166972:0.461094:... Monocytes 0.531659:0.46923231 Monocytes Monocytes
table(pred$labels)
# B-cells CD4+ T-cells CD8+ T-cells DC Eosinophils Erythrocytes HSC Monocytes NK cells
# 549 773 1274 1 1 5 14 1117 251
celldex
包提供的參考數(shù)據(jù)集里一般提供label.main
、label.fine
兩種resolution的細(xì)胞類型注釋迅皇,前者的分辨率低昧辽,重點(diǎn)關(guān)注主要的細(xì)胞類型;后者則適合具體的細(xì)胞亞類研究登颓。
- 觀察具體每個(gè)位置細(xì)胞對(duì)所有參考細(xì)胞類型的評(píng)分搅荞。從圖中可以看出
B-cells
、Monocytes
的評(píng)分具有專一性,而CD4+
與CD8+
細(xì)胞可能會(huì)出現(xiàn)難以區(qū)分的情況咕痛。
plotScoreHeatmap(pred)
如下圖痢甘,每一列代表一個(gè)未知細(xì)胞對(duì)所有參考細(xì)胞類型的相似性評(píng)分情況。
- 進(jìn)一步觀察注釋的細(xì)胞類型的分類與之前聚類結(jié)果的混淆關(guān)系暇检。從圖中可以看出产阱,多個(gè)cluster對(duì)應(yīng)于一種細(xì)胞類型
Monocytes
,可能反應(yīng)可不同的細(xì)胞亞類情況块仆。而CD4+
與CD8+
細(xì)胞會(huì)出現(xiàn)對(duì)應(yīng)于同一個(gè)cluster的情況。
tab <- table(Assigned=pred$pruned.labels, Cluster=colLabels(sce.pbmc))
# Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell.
library(pheatmap)
pheatmap(log2(tab+10), color=colorRampPalette(c("white", "blue"))(101))
(2)統(tǒng)一預(yù)測(cè)每個(gè)cluster的細(xì)胞類型
pred2 <- SingleR(test=sce.pbmc, ref=ref, clusters = colLabels(sce.pbmc), labels=ref$label.main)
# B-cells CD4+ T-cells CD8+ T-cells HSC Monocytes NK cells
# 1 2 4 1 7 1
table(pred2$labels)
plotScoreHeatmap(pred2)
2王暗、觀察各種細(xì)胞類型marker gene sets在每個(gè)細(xì)胞里的表達(dá)“活性”
2.1 背景知識(shí)
- 首先得到每種細(xì)胞類型的marker gene set(一般指特異高表達(dá)基因集)悔据。
- 然后計(jì)算對(duì)應(yīng)于單細(xì)胞測(cè)序結(jié)果中,每個(gè)細(xì)胞的表達(dá)信息中俗壹,上述基因集的高表達(dá)程度科汗。
-
AUCell
包對(duì)此提供的計(jì)算方式是采用Wilcoxon rank sum test。簡(jiǎn)單來說先將一個(gè)細(xì)胞的基因按表達(dá)量從高到低排序绷雏,根據(jù)秩次比較在基因集中的基因表達(dá)水平是否區(qū)分于(高于)不在基因集中的基因表達(dá)水平头滔。最終展示的指標(biāo)為AUC值,值越接近1說明該細(xì)胞越特異表達(dá)該基因基涎显,越有可能為該基因集對(duì)應(yīng)的細(xì)胞類型坤检。
2.2 示例數(shù)據(jù)
- 測(cè)試數(shù)據(jù)集:鼠腦單細(xì)胞測(cè)序數(shù)據(jù)
library(scRNAseq)
sce.tasic <- TasicBrainData()
sce.tasic
# class: SingleCellExperiment
# dim: 24058 1809
- 細(xì)胞類型marker gene set:獲取方式很多,也有專門的數(shù)據(jù)庫整理期吓。教程中是根據(jù)已標(biāo)注細(xì)胞類型的單細(xì)胞表達(dá)矩陣計(jì)算出每個(gè)細(xì)胞類型的marker gene set
sce.zeisel
# class: SingleCellExperiment
# dim: 19839 2816
library(scran)
wilcox.z <- pairwiseWilcox(sce.zeisel, sce.zeisel$level1class,
lfc=1, direction="up")
markers.z <- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs,
pairwise=FALSE, n=50)
#每個(gè)細(xì)胞類型的marker gene set為元素組成的list對(duì)象
2.3 AUCell
計(jì)算
- 首先需提前使用
GSEABase
包整理marker gene sets的對(duì)象為指定的GeneSetCollection
對(duì)象格式
library(GSEABase)
all.sets <- lapply(names(markers.z), function(x) {
GeneSet(markers.z[[x]], setName=x)
})
all.sets <- GeneSetCollection(all.sets)
- 然后計(jì)算測(cè)序數(shù)據(jù)中每個(gè)細(xì)胞的基因表達(dá)排名
rankings <- AUCell_buildRankings(counts(sce.tasic),
plotStats=FALSE, verbose=FALSE)
- 計(jì)算每個(gè)細(xì)胞對(duì)多種細(xì)胞mark基因集的AUC結(jié)果
cell.aucs <- AUCell_calcAUC(all.sets, rankings)
results <- t(assay(cell.aucs))
head(results)
# gene sets
# cells astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal CA1 pyramidal SS
# Calb2_tdTpositive_cell_1 0.1386528 0.04264310 0.5306093 0.04845394 0.1317958 0.2317668 0.3476778
# Calb2_tdTpositive_cell_2 0.1366283 0.04884635 0.4538357 0.02682648 0.1211293 0.2062570 0.2762466
# Calb2_tdTpositive_cell_3 0.1087323 0.07269892 0.3458998 0.03583482 0.1567204 0.3218726 0.5244492
# Calb2_tdTpositive_cell_4 0.1321658 0.04993108 0.5112665 0.05387632 0.1480893 0.2546714 0.3505890
# Calb2_tdTpositive_cell_5 0.1512892 0.07161420 0.4929771 0.06655747 0.1385766 0.2088478 0.3009582
# Calb2_tdTpositive_cell_6 0.1342012 0.09161375 0.3378310 0.03201310 0.1552946 0.4010508 0.5392798
- 對(duì)于每一個(gè)細(xì)胞來說早歇,一般使用AUC值最高的細(xì)胞類型標(biāo)簽
new.labels <- colnames(results)[max.col(results)]
# new.labels
# astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal SS
# 69 29 776 23 41 871
對(duì)于這種注釋細(xì)胞類型的方法比較適用于不同細(xì)胞類型能夠的marker基因集不重疊的情況,即對(duì)于細(xì)胞亞類的區(qū)分效果可能不會(huì)很好讨勤。因此箭跳,在選取細(xì)胞類型特異基因集時(shí)需要多加考慮。
3潭千、cluster差異基因GO/KEGG富集分析結(jié)果聯(lián)系細(xì)胞類型
3.1 背景知識(shí)
- 這是基于聚類結(jié)果谱姓,計(jì)算每個(gè)cluster相較于其它c(diǎn)luster的上調(diào)差異基因。然后對(duì)每個(gè)cluster的up DEG進(jìn)行富集分析刨晴,最后根據(jù)富集分析結(jié)果屉来,手動(dòng)注釋出細(xì)胞類型媳板。
- 富集背景通路基可以選擇GO(Gene Ontology)的
BP, biological
term set結(jié)果青团,因?yàn)榉奖闩c細(xì)胞類型聯(lián)系起來。
3.2 示例數(shù)據(jù)
- 測(cè)試數(shù)據(jù)集:小鼠乳腺組織測(cè)序數(shù)據(jù)
sce.mam
3.3 limma
包go富集分析goana()
#cluster差異分析
markers.mam <- findMarkers(sce.mam, direction="up", lfc=1)
#選擇cluster 2的DEG
chosen <- "2"
cur.markers <- markers.mam[[chosen]]
# 基因名格式轉(zhuǎn)換
library(org.Mm.eg.db)
entrez.ids <- mapIds(org.Mm.eg.db, keys=rownames(cur.markers),
column="ENTREZID", keytype="SYMBOL")
# 進(jìn)一步篩選具有顯著意義的差異基因用于接下來的富集分析
is.de <- cur.markers$FDR <= 0.05
summary(is.de)
library(limma)
go.out <- goana(unique(entrez.ids[is.de]), species="Mm",
universe=unique(entrez.ids))
# Only keeping biological process terms that are not overly general.
go.out <- go.out[order(go.out$P.DE),]
go.useful <- go.out[go.out$Ont=="BP" & go.out$N <= 200,]
head(go.useful, 10)
# Term Ont N DE P.DE
# GO:0006641 triglyceride metabolic process BP 105 11 2.799728e-10
# GO:0006639 acylglycerol metabolic process BP 135 11 4.173564e-09
# GO:0006638 neutral lipid metabolic process BP 137 11 4.876288e-09
# GO:0006119 oxidative phosphorylation BP 94 9 2.722992e-08
# GO:0042775 mitochondrial ATP synthesis coupled electron transport BP 51 7 8.032239e-08
# GO:0042773 ATP synthesis coupled electron transport BP 52 7 9.225569e-08
# GO:0046390 ribose phosphate biosynthetic process BP 147 10 1.203158e-07
# GO:0022408 negative regulation of cell-cell adhesion BP 187 11 1.225564e-07
# GO:0009152 purine ribonucleotide biosynthetic process BP 131 9 4.817966e-07
# GO:0035148 tube formation BP 174 10 5.775054e-07
We see an enrichment for genes involved in lipid synthesis, cell adhesion and tube formation. Given that this is a mammary gland experiment, we might guess that cluster 2 contains luminal epithelial cells responsible for milk production and secretion.
如上推理過程可以看出特漩,這種預(yù)測(cè)細(xì)胞類型需要有較強(qiáng)的生物背景知識(shí)亿驾。
- 如果想進(jìn)一步看具體富集到的某一term里所有基因在該cluster的DEG列表分布嘹黔,可如下操作:
# Extract symbols for each GO term; done once.
tab <- select(org.Mm.eg.db, keytype="SYMBOL",
keys=rownames(sce.mam), columns="GOALL")
by.go <- split(tab[,1], tab[,2])
# Identify genes associated with an interesting term.
adhesion <- unique(by.go[["GO:0022408"]])
head(cur.markers[rownames(cur.markers) %in% adhesion,1:4], 10)
# DataFrame with 10 rows and 4 columns
# Top p.value FDR summary.logFC
# <integer> <numeric> <numeric> <numeric>
# Spint2 11 3.28234e-34 1.37163e-31 2.39280
# Epcam 17 8.86978e-94 7.09531e-91 2.32968
# Cebpb 21 6.76957e-16 2.03800e-13 1.80192
# Cd24a 21 3.24195e-33 1.29669e-30 1.72318
# Btn1a1 24 2.16574e-13 6.12488e-11 1.26343
# Cd9 51 1.41373e-11 3.56592e-09 2.73785
# Ceacam1 52 1.66948e-38 7.79034e-36 1.56912
# Sdc4 59 9.15001e-07 1.75467e-04 1.84014
# Anxa1 68 2.58840e-06 4.76777e-04 1.29724
# Cdh1 69 1.73658e-07 3.54897e-05 1.31265
以上分別介紹了三種注釋細(xì)胞類型的方法,各有側(cè)重。細(xì)胞類型注釋是一個(gè)單細(xì)胞數(shù)據(jù)分析過程中的重要步驟儡蔓,還有其它一些注釋方法郭蕉,有機(jī)會(huì)再多多學(xué)習(xí)。