水滴石穿办素,非一日之功
本期內(nèi)容同時(shí)發(fā)布于單細(xì)胞天地
微信公眾號(hào)纯趋。
在進(jìn)行單細(xì)胞分析時(shí)姻成,非常重要的一步就是聚類分群勿决,而在這個(gè)過程當(dāng)中經(jīng)常令人困惑的是如何選擇一個(gè)合適的分辨率(resolution),因?yàn)榉直媛试O(shè)置過大或過小都不利于我們后續(xù)的分析:設(shè)置過大會(huì)導(dǎo)致分群“過度”琅束,理論上來講每個(gè)細(xì)胞都是不同的扭屁,但這樣過于細(xì)化的初步分群顯然是沒有意義的;相反設(shè)置過小會(huì)導(dǎo)致我們無法發(fā)現(xiàn)一些對我們可能比較重要的細(xì)胞亞群狰闪,這也喪失了單細(xì)胞數(shù)據(jù)本身的意義疯搅。
在這里分享3個(gè)可以供大家參考的單細(xì)胞數(shù)據(jù)降維聚類分群的resolution參數(shù)選定評估方法:
clustree
marker gene AUC值
ROGUE
上游分析
本次分析所使用的示例數(shù)據(jù)集為GSE197778濒生,包含四個(gè)樣本埋泵,首先使用Harmony進(jìn)行批次效應(yīng)的去除:
library(Seurat)
library(harmony)
library(dplyr)
library(ROGUE)
PATH = 'GSE197778'
samples = list.files(path = PATH)
sce_list = lapply(X = samples, FUN = function(sample){
sce <- CreateSeuratObject(counts = Read10X(data.dir = file.path(PATH, sample)),
project = sample,
min.cells = 10,
min.features = 10)
sce[['percent.mt']] <- PercentageFeatureSet(sce, pattern = '^MT-')
return(sce)
})
#data qc
sce_list[[1]] <- subset(sce_list[[1]], subset = nFeature_RNA > 500 & nFeature_RNA < 5000)
sce_list[[2]] <- subset(sce_list[[2]], subset = nFeature_RNA > 500 & nFeature_RNA < 3000)
sce_list[[3]] <- subset(sce_list[[3]], subset = nFeature_RNA > 500 & nFeature_RNA < 3000)
sce_list[[4]] <- subset(sce_list[[4]], subset = nFeature_RNA > 500 & nFeature_RNA < 3000)
#merge sample
sce_all <- merge(x = sce_list[[1]],
y = c(sce_list[[2]], sce_list[[3]], sce_list[[4]]),
add.cell.ids = samples)
sce_all <- subset(sce_all, subset = percent.mt < 10)
sce_all <- sce_all %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData()
sce_all <- RunPCA(sce_all, features = VariableFeatures(sce_all), npcs = 30)
#run harmony
sce_all <- RunHarmony(sce_all, group.by.vars = 'orig.ident')
sce_all <- sce_all %>%
RunUMAP(reduction = "harmony", dims = 1:20) %>%
FindNeighbors(reduction = "harmony", dims = 1:20)
seq <- seq(0.1, 1, by = 0.1)
for(res in seq){
sce_all <- FindClusters(sce_all, resolution = res)
}
clustree
clustree通過可視化不同分辨率下不同cluster之間的交互關(guān)系來幫助我們選擇合適的分辨率來進(jìn)行下游分析。
library(clustree)
library(patchwork)
p1 <- clustree(sce_all, prefix = 'RNA_snn_res.') + coord_flip()
p2 <- DimPlot(sce_all, group.by = 'RNA_snn_res.0.3', label = T)
p1 + p2 + plot_layout(widths = c(3, 1))
通過這個(gè)圖我們會(huì)發(fā)現(xiàn)當(dāng)resolution大于0.3后不同cluster在不同的分辨率下會(huì)存在越來越多的相互交織罪治,這暗示著我們可能存在分群過度的情況了丽声,所以在這里我們可以選擇0.3作為初步的分群resolution來進(jìn)行后面的分析。
marker gene AUC值
這個(gè)方法來自于今年6月發(fā)表于Nature上的一篇文章:Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy觉义。在其Method部分提到了他們確定resolution的方法:Then, Leiden resolution was increased from 0.05 to 1.0 in increments of 0.05 until a cluster emerged with no genes having area under the receiver operating characteristic curve (AUC) > 0.60 in predicting that class.
這個(gè)AUC值使用FindAllMarkers()
函數(shù)就能實(shí)現(xiàn)雁社,只需要設(shè)置test.use = ‘roc’
即可,其實(shí)原理也很簡單晒骇,主要是通過ROC分析來對marker基因的表達(dá)特異性進(jìn)行評估霉撵,也就是將marker基因作為分類器來對細(xì)胞進(jìn)行分類,用AUC值來評估分類的好壞洪囤,顯然AUC值越接近于1或0就表示這個(gè)基因有非常好的細(xì)胞表達(dá)特異性徒坡,相反,如果鑒定出來的marker基因AUC值接近0.5就表明這個(gè)marker基因沒有很好的細(xì)胞分類能力瘤缩,這也提示我們這兩個(gè)細(xì)胞亞群之間可能并沒有非常顯著的生物學(xué)差異喇完,我們可能分群“分過度”了。
根據(jù)文章的描述剥啤,分群“恰到好處”的標(biāo)準(zhǔn)是锦溪,增加resolution直至出現(xiàn)一個(gè)細(xì)胞亞群不脯,其所有marker基因的AUC值均低于0.6,我們就認(rèn)為resolution再增加就會(huì)出現(xiàn)“過度分群”的狀況刻诊。我們看一下不同分辨率下每個(gè)細(xì)胞亞群marker基因的AUC值情況:
data("pbmc3k.final")
seq <- seq(0.1, 1.5, by = 0.1)
for (res in seq){
pbmc3k.final <- FindClusters(pbmc3k.final, resolution = res)
}
#res = 0.5
pbmc3k.final %>%
SetIdent(value = 'RNA_snn_res.0.5') %>%
FindAllMarkers(test.use = 'roc') %>%
filter(myAUC > 0.6) %>%
count(cluster, name = 'number')
## cluster number
## 1 0 45
## 2 1 6
## 3 2 87
## 4 3 31
## 5 4 19
## 6 5 147
## 7 6 62
## 8 7 111
## 9 8 140
#res = 1.0
pbmc3k.final %>%
SetIdent(value = 'RNA_snn_res.1') %>%
FindAllMarkers(test.use = 'roc') %>%
filter(myAUC > 0.6) %>%
count(cluster, name = 'number')
## cluster number
## 1 0 49
## 2 1 6
## 3 2 31
## 4 3 20
## 5 4 117
## 6 5 49
## 7 6 147
## 8 7 68
## 9 8 1
## 10 9 111
## 11 10 140
#res = 1.5
pbmc3k.final %>%
SetIdent(value = 'RNA_snn_res.1.5') %>%
FindAllMarkers(test.use = 'roc') %>%
filter(myAUC > 0.6) %>%
count(cluster, name = 'number')
## cluster number
## 1 0 6
## 2 1 38
## 3 2 31
## 4 3 19
## 5 4 117
## 6 5 35
## 7 6 49
## 8 7 147
## 9 8 2
## 10 9 68
## 11 10 111
## 12 11 140
ROGUE
這個(gè)方法是由張澤民老師團(tuán)隊(duì)開發(fā)的防楷,一個(gè)很好的、單純的細(xì)胞亞群應(yīng)該是其中的細(xì)胞基因表達(dá)具有相似的水平则涯,沒有具有顯著差異的差異表達(dá)基因域帐,ROGUE使用一個(gè)基于熵的方法來對一個(gè)細(xì)胞類群的“純度”來進(jìn)行評估。一個(gè)細(xì)胞類群的ROGUE值表征了該細(xì)胞類群的“純度”:越接近于1表示細(xì)胞類群越“純”是整,具體細(xì)節(jié)可以去看張澤民老師的文章:An entropy-based metric for assessing the purity of single cell populations肖揣。
首先使用ROGUE
內(nèi)置數(shù)據(jù)集進(jìn)行分析:
library(ROGUE)
# inner data of package
expr <- readRDS('DC.rds')
meta <- readRDS('info.rds')
# filter low quality data
expr <- matr.filter(expr, min.cells = 10, min.genes = 10)
p1 <- rogue(expr, labels = meta$ct, samples = meta$Patient,
platform = 'UMI', span = 0.6) %>%
rogue.boxplot() + ggtitle('4 clusters')
p2 <- rogue(expr, labels = meta$`Major cell type`, samples = meta$Patient,
platform = 'UMI', span = 0.6) %>%
rogue.boxplot() + ggtitle('2 clusters')
p1 + p2
可以看到這里的分群在不同的樣本中都具有比較好的均一性。
下面用我們自己的GSE197778數(shù)據(jù)集進(jìn)行測試:
sce_sub <- sce_all %>% subset(., downsample = 100)
expr <- GetAssayData(sce_sub, slot = 'counts') %>% as.matrix()
meta <- sce_sub@meta.data
rogue(expr = expr,
labels = meta$RNA_snn_res.0.3,
samples = meta$orig.ident,
platform = 'UMI',
span = 0.6) %>% rogue.boxplot()
最后浮入,我們就可以利用細(xì)胞類型的基因marker對細(xì)胞進(jìn)行注釋啦龙优,最終可以得到下面的結(jié)果:
sce_all <- SetIdent(object = sce_all, value = 'RNA_snn_res.0.3')
DotPlot(object = sce_all,
features = c('PTPRC', 'CD3D', #T cell
'MS4A1', 'CD79A', #B cell
'JCHAIN', 'IGHG1', #plasma cell
'LYZ', 'CD68', 'FCGR3A', 'CD14', #myeloid cell
'EPCAM', 'CDH1', 'KRT19', #epithelial cell
'PECAM1', 'VWF', #endothelial cell
'COL1A1', 'COL1A2', 'DCN' #fibroblast cell
)) +
coord_flip()
#add cell type
#T cell
T_cluster <- c(0, 1, 4, 8, 14)
#B cell
B_cluster <- 2
#plasma cell
Plasma_cluster <- 9
#myeloid cell
Myeloid_cluster <- c(5, 13)
#epithelial cell
Epithelial_cluster <- 11
#endothelial
Endothelial_cluster <- c(7, 10)
#fibroblast
Fibroblast_cluster <- c(3, 6, 12)
sce_all@meta.data <- sce_all@meta.data %>%
mutate(celltype = case_when(
RNA_snn_res.0.3 %in% T_cluster ~ 'T',
RNA_snn_res.0.3 %in% B_cluster ~ 'B',
RNA_snn_res.0.3 %in% Plasma_cluster ~ 'Plasma',
RNA_snn_res.0.3 %in% Myeloid_cluster ~ 'Myeloid',
RNA_snn_res.0.3 %in% Epithelial_cluster ~ 'Epithelial',
RNA_snn_res.0.3 %in% Endothelial_cluster ~ 'Endotheliall',
RNA_snn_res.0.3 %in% Fibroblast_cluster ~ 'Fibroblast'
))
sce_all@meta.data$celltype <- factor(sce_all@meta.data$celltype,
levels = c('T', 'B', 'Plasma',
'Myeloid', 'Epithelial',
'Endotheliall', 'Fibroblast'))
p3 <- DimPlot(sce_all, group.by = 'celltype', label = T) + theme(legend.position = 'none')
p4 <- DotPlot(sce_all,
features = c('PTPRC', 'CD3D', #T cell
'MS4A1', 'CD79A', #B cell
'JCHAIN', 'IGHG1', #plasma cell
'LYZ', 'CD68', 'FCGR3A', 'CD14', #myeloid cell
'EPCAM', 'CDH1', 'KRT19', #epithelial cell
'PECAM1', 'VWF', #endothelial cell
'COL1A1', 'COL1A2', 'DCN' #fibroblast cell
),
group.by = 'celltype') +
coord_flip() +
theme(axis.text.x = element_text(hjust = 1, angle = 45),
axis.title = element_blank()) +
ggtitle(label = 'GSE197778')
p3 + p4