hello华畏,國(guó)慶上班了谨敛,不知道大家過的怎么樣,今天我們來分享一個(gè)新的多樣本整合批次去除的好方法抗悍,RCA2驹饺,很好的方法,文章在RCA2: a scalable supervised clustering algorithm that reduces batch effects in scRNA-seq data,2021年9月份發(fā)表于Nucleic Acids Research檐春,影響因子17分逻淌,我們簡(jiǎn)單看看文獻(xiàn)么伯,最后看一下示例代碼疟暖。
說起樣本之間的批次效應(yīng),其實(shí)談了很多了,但是想很好的去除卻并非容易的事情俐巴,下面列舉了一些批次效應(yīng)出去的方法骨望,大家感興趣可以參考。
先看看文章的一些重點(diǎn)信息
ABSTRACT
1欣舵、using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects and data quality artifacts.(當(dāng)然之前我們都是無監(jiān)管聚類擎鸠,效果好不好很難評(píng)判。)
2缘圈、RCA2劣光,這是第一個(gè)將參考投影(批量效應(yīng)魯棒性)與基于圖的聚類(可擴(kuò)展性)相結(jié)合的算法。 當(dāng)然糟把,也包括很多下游的分析模塊绢涡。RCA2 also provides new reference panels for human and mouse and supports generation of custom panels.(這個(gè)panel需要我們注意)。
3遣疯、RCA2 facilitates cell type-specific QC, which is essential for accurate clustering of data from heterogeneous tissues(關(guān)于數(shù)據(jù)質(zhì)控雄可,多次強(qiáng)調(diào)過,質(zhì)控是基礎(chǔ)缠犀,必須要做好数苫,基礎(chǔ)不牢,后面分析的再好都是錯(cuò)的)辨液。
4虐急、Scalable supervised clustering methods such as RCA2 will facilitate unified analysis of cohort-scale SC datasets.(這也是單細(xì)胞數(shù)據(jù)追求的目標(biāo))。
INTRODUCTION
單細(xì)胞聚類的兩種方法滔迈,(i) unsupervised (de-novo) clustering, which is the most prevalent(Louvain graph-based clustering algorithm is the most prevalent), and (ii) supervised clustering, which exploits a panel of reference transcriptomes
單細(xì)胞聚類方法存在的挑戰(zhàn)戏仓,(i) cells may cluster by technical variation and batch effects rather than biological properties, (ii) scRNA-seq data tend to be noisy, primarily due to sampling noise and (iii) the gene expression matrix can be very large, since modern datasets commonly include > 100,000 cells.(相信這樣的問題大家都遇到過)。所以不同的聚類方法往往導(dǎo)致不同的分析結(jié)果亡鼠。此外赏殃,從頭聚類需要一個(gè)容易出錯(cuò)、耗時(shí)的手動(dòng)步驟间涵,根據(jù)對(duì)標(biāo)記基因表達(dá)的主觀評(píng)估將細(xì)胞簇分配給細(xì)胞類型(注釋)仁热。 已經(jīng)開發(fā)了監(jiān)督聚類和監(jiān)督細(xì)胞類型注釋算法來解決這些限制。
Unlike the above-mentioned methods, RCA was not primarily designed for cell type annotation. Rather, the objective of RCA is to cluster single cells in the space of reference transcriptome projections. This is fundamentally different from unsupervised clustering approaches, which cluster cells in the space defined by over a thousand feature genes (關(guān)于之前的版本RCA勾哩,大家可以網(wǎng)上找一找抗蠢,之間用過,效果一般)思劳。RCA is the only supervised clustering algorithm for scRNA-seq data. However, the original version of RCA could not scale to datasets larger than 20,000 cells on a high-end laptop, used only a single reference panel, did not implement methods to identify differential gene expression, did not offer KEGG and Gene Ontology (GO) enrichment analysis, was benchmarked on only a single Smart-seq dataset and could not be easily integrated into existing data analysis workflows.(看來新改進(jìn)的RCA2的方法是一個(gè)單細(xì)胞數(shù)據(jù)分析的pipieline)迅矛。
MATERIALS AND METHODS
這個(gè)部分需要很深的數(shù)學(xué)知識(shí)了。
Projection to a reference潜叛,這個(gè)地方是對(duì)參考和query數(shù)據(jù)的預(yù)處理
Clustering and interpreting the projection
Reference panels秽褒,提供了一些固有的參考數(shù)據(jù)集
當(dāng)然壶硅,也很其他的去除批次的聚類方法進(jìn)行了比較,包括SEURAT, SEURAT INTEGRATION, SCTRANSFORM, SCTRANSFORM INTEGRATION, SCRAN (30), SCANPY, MNNCORRECT (31) and SCANORAMA销斟,我們后續(xù)看看結(jié)果如何庐椒。
Silhouette Index for quantifying batch effect(這個(gè)地方用來衡量批次去除的效果如何)。提到一個(gè)SI指數(shù)蚂踊,大家可以查一下约谈。
Result
RCA2的work flow,基本就是一套單細(xì)胞的分析流程
包括RCA2和之前的版本RCA的區(qū)別
多方法比較
- 注:Note that a robust method should have low batch SI and high cell type SI, indicating that cells are separated by cell type rather than by batch.
RCA2聚類后的差異基因熱圖犁钟,區(qū)分的很明顯
當(dāng)然棱诱,后面還有一些其他數(shù)據(jù)的測(cè)試,我們就不一一減少了涝动,不過有一點(diǎn)很有意思军俊,如下圖,后續(xù)的代碼分享會(huì)介紹到
來吧捧存,看看代碼
library(RCAv2)
PBMCs<-RCAv2::createRCAObjectFrom10X("10xPBMCs/")##官方數(shù)據(jù)為例
Perform basic QC steps and data normalization
PBMCs<-RCAv2::dataFilter(PBMCs,
nGene.thresholds = c(300,5000),
nUMI.thresholds = c(400,30000),
percent.mito.thresholds = c(0.025,0.2),
min.cell.exp = 3,
plot=T,
filename = "PBMCs_filter_example.pdf")
PBMCs<-RCAv2::dataLogNormalise(PBMCs)
Compute a projection to a reference data set,里面的參數(shù)大家需要注意
PBMCs<-RCAv2::dataProject(PBMCs,
method = "GlobalPanel",
corMeth = "pearson")
In addition to the GlobalPanel, RCA now provides 12 reference panels:(當(dāng)然粪躬,也可以自己指定)
- GlobalPanel from the original RCA (Li et al., 2017) containing both primary cell types and tissues. Can be limited to only cell types with "GlobalPanel_CellTypes.
- ColonEpiPanel: 9 colon epithelial samples from Li et al. (Nature genetics, 2017).
- MonacoPanel: 29 PBMC cell types from Monaco, G., et al. (Cell reports, 2019).
- MonacoBCellPanel: 5 B cell sub-types from Monaco, G., et al. (Cell reports, 2019).
- MonacoMonoPanel: 5 Monocyte sub-types from Monaco, G., et al. (Cell reports, 2019).
- MonacoTCellPanel: 15 T cell sub-types from Monaco, G., et al. (Cell reports, 2019).
- CITESeqPanel based on Seurat 4.0 containing 34 cell types.
- ENCODEHumanPanel: 93 human cell types from ENCODE.
- NovershternPanel: 15 PBMC cell types from Novershtern et al. (Cell, 2011).
- NovershternTCellPanel: 6 T cell sub-types from Novershtern et al. (Cell, 2011).
- ENCODEMousePanel: 15 mouse cell types from ENCODE.
- ZhangMouseBrainPanel: 7 mouse brain cell types from Tasic et al. (Nature neuroscience, 2016).
To use the custom panel MyPanel.RDS use the following command:
PBMCs<-RCAv2::dataProject(PBMCs,
method = "Custom",
customPath = "MyPanel.RDS",
corMeth = "pearson")
To benefit from multiple panels at the same time, users can exploit the dataProjectMultiPanel function:
PBMCs<-RCAv2::dataProjectMultiPanel(PBMCs,method=list("NovershternPanel",
"MonacoPanel", "GlobalPanel_CellTypes"),corMeth="pearson")
Cluster the projection and visualize it
PBMCs<-RCAv2::dataClust(PBMCs)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Heatmap_PBMCs.pdf",var.thrs=1)
PBMCs<-computeUMAP(PBMCs)
RCAv2::plotRCAUMAP(PBMCs,filename = "UMAP_PBMCs.pdf")
PBMCs<-computeUMAP(PBMCs, nDIMS = 3)
RCAv2::plotRCAUMAP3D(PBMCs,filename = "UMAP3D_PBMCs.html")
#Estimate the most probable cell type label for each cell
PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)
#Generate the cluster composition plot
RCAv2::plotRCAClusterComposition(PBMCs,filename="Cluster_Composition.pdf")
Based on the heatmap as well as the stacked bar plots we can relabel the clusters according to the major cell type annotations:
RCAcellTypes<-PBMCs$clustering.out$dynamicColorsList[[1]]
RCAcellTypes[which(RCAcellTypes=="blue")]<-"Monocytes"
RCAcellTypes[which(RCAcellTypes=="green")]<-"Dentritic cells"
RCAcellTypes[which(RCAcellTypes=="yellow")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="grey")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="brown")]<-"NK cells"
RCAcellTypes[which(RCAcellTypes=="turquoise")]<-"T cells"
RCAcellTypes[which(RCAcellTypes=="red")]<-"Myeloid cells"
RCAcellTypes[which(RCAcellTypes=="black")]<-"Progenitor cells"
范例
CD56Exp<-PBMCs$data[which(rownames(PBMCs$data)=="NCAM1"),]
RCAv2::plotRCAUMAP(PBMCs,cellPropertyList = list(CellTypes=RCAcellTypes,CD56=CD56Exp),filename = "UMAP_PBMCs.pdf")
To obtain cluster based cell type predictions, the user can run the function estimateCellTypeFromProjectionPerCluster:
PBMCs<-RCAv2::createRCAObjectFrom10X("../Documents/10xExample/")
PBMCs<-RCAv2::dataFilter(PBMCs,nGene.thresholds = c(300,4500),
percent.mito.thresholds = c(0.025,0.1),
min.cell.exp = 3)
PBMCs<-RCAv2::dataLogNormalise(PBMCs)
PBMCs<-RCAv2::dataProject(PBMCs,
method = "NovershternPanel",
corMeth = "pearson",
nPCs=0,
approx= FALSE)
PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.15)
PBMCs<-estimateCellTypeFromProjectionPerCluster(PBMCs)
Graph based clustering as an alternative to hierarchical clustering
由于單細(xì)胞數(shù)據(jù)集的大小不斷增加,層次聚類需要具有大內(nèi)存的機(jī)器昔穴。 為了克服這個(gè)假定的限制镰官,RCAv2 還使用共享的最近鄰 (snn) 方法提供基于圖的聚類
PBMCs<-RCAv2::dataSNN(PBMCs,k=100,eps=25,minPts=30,dist.fun="All",corMeth="pearson")
This function has three main parameters: * k as the number of considered neighbours per cell, * eps as the minimum number of shared neighbours between to cells, * minPts minimum number of points that share eps neighbours such that a point is considered a core point.
The dist.fun parameter controls whether the distance matrix used for SNN clustering is based on the full correlation distance matrix or, if dist.fun is set to PCA, on a PC reduction of the reference projection. The corMeth sets which correlation function is used for the distance computation (pearson (default), spearman or kendal). As for the hierarchical clustering, heatmaps and umaps can be generated as well.
To help the user choosing the parameters for clustering, we provide a parameter space exploration feature leading to a 3D umap illustrating the number of clusters depending on the three parameters, as shown below.
上圖可用下面的代碼生成
parameterSpaceSNN(PBMCs,kL=c(30:50),epsL=c(5:20),minPtsL=c(5:10),folderpath=".",filename="Graph_based_Clustering_Parameter_Space.html")
where kL, epsL and minPtsL define the search space for k, eps and minPts respectively. Note that executing this function will take longer for large search spaces.
In addition to those clustering parameters, via the dist.fun parameter one can choose whether a PCA reduction of the projection matrix, or the entire projection matrix should be used to construct the snn graph.
Using Louvain graph clustering implemented in Seurat
PBMCs<-dataSClust(PBMCs,res=0.5)
其中 res 是解析到 Seurat 聚類函數(shù)的分辨率參數(shù)。
基于圖的聚類吗货,RCA 投影熱圖將在沒有列樹狀圖的情況下繪制泳唠。
此函數(shù)使用與原始 Seurat 函數(shù)相同的 PCA 近似值。 將 approx 設(shè)置為 FALSE 將計(jì)算準(zhǔn)確的 PCA宙搬。 如果 corMeth 參數(shù)設(shè)置為 pearson笨腥、spearman 或 kendal,則該函數(shù)將使用相關(guān)距離而不是默認(rèn)的歐幾里得距離來計(jì)算全距離矩陣勇垛。 這也可以與精確的 PCA 結(jié)合使用脖母。 但是,并不是這些選項(xiàng)需要比默認(rèn)值更多的可用內(nèi)存闲孤。 使用肘部繪圖功能谆级,可以生成指導(dǎo)選擇要考慮的 PC 數(shù)量的 ElbowPlot。
Clustering free analysis of the projection
特別是對(duì)于非常大的數(shù)據(jù)集讼积,對(duì)投影進(jìn)行聚類可能具有挑戰(zhàn)性肥照。 對(duì)于這些情況,RCA 包括由 SingleR 和 scMatch 驅(qū)動(dòng)的聚類獨(dú)立細(xì)胞類型分配方法勤众,該方法純粹基于基于參考投影的每個(gè)細(xì)胞 z 分?jǐn)?shù)分布舆绎。 對(duì)函數(shù)的調(diào)用
PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)
將為每個(gè)cell返回最可能的cell類型并將其保存在 PBMCs 對(duì)象中。 通過參數(shù)置信度们颜,可以對(duì)兩種最可能的細(xì)胞類型之間的比率施加閾值(0 和 1 之間)吕朵。 在不確定的情況下猎醇,cell將被標(biāo)記為未知。
table(unlist(PBMCs$cell.Type.Estimate))
BDCA4._DentriticCells CD14._Monocytes CD19._BCells.neg._sel..
69 108 307
CD33._Myeloid CD34. CD4._Tcells
1363 3 2266
CD56._NKCells CD8._Tcells L45_CMP_Bone.Marrow
458 364 4
L51_B.Cell_Bone.Marrow
#Retrieve annotation
SimplifiedAnnotation<-unlist(PBMCs$cell.Type.Estimate)
#Relabel it
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD33._Myeloid")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD4._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD8._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD14._Monocytes")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="BDCA4._DentriticCells")]<-"Dentritic cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L93_B.Cell_Plasma.Cell")]<- "B cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L52_Platelet")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L74_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L51_B.Cell_Bone.Marrow")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L75_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L85_NK.Cell_CD56Hi")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD34.")]<-"Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L45_CMP_Bone.Marrow")]<- "Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="WholeBlood")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L69_Dendritic.Cell_Monocyte.derived")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L80_T.Cell_CD8.Eff..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L60_Monocyte_CD16")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L86_NK.Cell_CD56Lo")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L73_T.Cell_CD4.Naive")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD56._NKCells")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD19._BCells.neg._sel..")]<- "B cells"
umapFigures<-RCAv2::plotRCAUMAP(PBMCs,
cellPropertyList = list(`Cell Type`=SimplifiedAnnotation),
filename = "UMAP_PBMCs.pdf")
Compute DE genes for RCA clusters
PBMCs<-RCAv2::dataDE(PBMCs,
logFoldChange = 1.5,
method = "wilcox",
mean.Exp = 0.5,
deep.Split.Values = 1,
min.pct = 0.25,
min.diff.pct = -Inf,
random.seed = 1,
min.cells.group = 3,
pseudocount.use = 1,
p.adjust.methods = "BH",
top.genes.per.cluster = 10
)
Here, is the required logFoldChange to call a gene to be differentially expressed. The method parameter indicates which statistical test is used. Multiple test correction is perfomed using the method indicated in p.adjust.methods. The parameters mean.Exp and min.pct indicat the minimum expression value as well as the minimum percentage of cells expressing a gene. Furthermore, the pseudocount can be adjusted via the pseudocount.use parameter. The top.genes.per.cluster parameter indicats how many genes should be selected as top DE genes per pairwise comparison for each cluster. Both the entire set of DE genes as well as the top DE genes are stored in the PBMCs rca.obj. The topDE genes can be plotted in a heatmap via the plotDEHeatmap function:
RCAv2::plotDEHeatmap(PBMCs,scale=FALSE)
The scale parameter allows the user to plot either the normalized UMI counts or scaled count (z-transformed). An example is shown below.
Compute enrichment for GO terms and KEGG pathways
使用 clusterProfiler 包边锁,RCAv2 支持分別使用函數(shù) doEnrichGo 和 doEnrichKEGG 對(duì) GO 術(shù)語和 KEGG 通路進(jìn)行富集測(cè)試姑食。 兩者都需要設(shè)置參數(shù)注釋波岛,這對(duì)于 ID 映射和 GO-term 分配都需要茅坛。 人類注釋的一個(gè)例子是 Bioconductor 上的 org.Hs.eg.db。
GO分析
doEnrichGo<-function(rca.obj,
annotation=NULL,
ontology="BP",
p.Val=0.05,
q.Val=0.2,
p.Adjust.Method="BH",
gene.label.type="SYMBOL",
filename="GoEnrichment.pdf",
background.set="ALL",
background.set.threshold=NULL,
n.Cells.Expressed=NULL,
cluster.ID=NULL,
deep.split=NULL)
其中本體是 BP则拷、MF 或 CC贡蓖,p.Val 和 q.Val 是 clusterprofiler 使用的閾值,p.Adjust.Method 指示使用哪種方法來糾正多次測(cè)試煌茬。為了給用戶帶來更大的方便斥铺,該功能會(huì)自動(dòng)映射基因 ID。為此坛善,gene.label.type 保存原始標(biāo)簽的類型晾蜘。根據(jù) 10X 數(shù)據(jù),默認(rèn)設(shè)置為 SYMBOL眠屎。背景集要么基于所有集群剔交,要么僅基于調(diào)查的集群。通過 background.set.threshold 或 n.Cells.Expressed" 參數(shù)選擇單元格改衩。請(qǐng)注意岖常,前者是數(shù)值或以下值之一:Min、1stQ葫督、Mean竭鞍、Median、3thQ橄镜。計(jì)算這些閾值用于所有基因的所有平均表達(dá)值的分布偎快。如果只對(duì)一個(gè)特定的簇進(jìn)行分析,可以設(shè)置參數(shù)cluster.ID*洽胶,如果分層滨砍,可以指定deep.split的值來選擇自定義拆分已經(jīng)使用了聚類。
doEnrichGo 函數(shù)分別為每個(gè)集群生成條形圖妖异、goplots 和點(diǎn)圖惋戏。可以使用 filename 參數(shù)修改文件名他膳。顯示了 PBMC NK 細(xì)胞簇的示例條形圖和點(diǎn)圖响逢。描繪了 PBMC B 細(xì)胞的 Goplots。
KEGG分析
doEnrichKEGG<-function(rca.obj,
annotation=NULL,
org="hsa",
key="kegg",
p.Val=0.05,
q.Val=0.2,
p.Adjust.Method="BH",
gene.label.type="SYMBOL",
filename="KEGGEnrichment.pdf",
background.set="ALL",
background.set.threshold=NULL,
n.Cells.Expressed=NULL,
cluster.ID=NULL,
deep.split=NULL)
Cluster/Cell-type specific quality control RCAv2 offers straightforward ways to perform cluster-specific quality control. We illustrate this functionality using an inhouse dataset of 45926 cells obtained from five bone marrow samples. A link to download the data will be made available here at a later stage. First, we load the data, project it against the global panel and cluster it:
normalBoneMarow<-readRDS("../Documents/DUKE_Normal.RDS")
createRCAObject()
PBMCs<-RCAv2::createRCAObject(normalCML@assays$RNA@data,dataIsNormalized = T)
PBMCs<-RCAv2::dataProject(PBMCs,
method = "GlobalPanel",
corMeth = "pearson")
PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.1)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Control_HeatmapPostQC.pdf")
we identify the cluster IDs as:
cellTypes<-c("Progenitor B","CMP/MEP","CMP/GMP","GMP/Dendritic cells","CD8 T cells","NK cells","CD4 T cells", "B cells", "Erythroid Progenitor","Monocytes","BT")
clusterColors<-c("purple","black","blue","magenta","turquoise","yellow","green","pink","greenyellow","red","brown")
names(cellTypes)<-clusterColors
cellTypeLabels<-cellTypes[PBMCs$clustering.out$dynamicColorsList[[1]]]
and plot cluster quality scores using
plotClusterQuality(PBMCs,width = 15,height = 9,cluster.labels = cellTypeLabels)
Combining RCA with Seurat
Data processing can also be carried out with Seurat. Here is an example how you can combine a RCA analysis with data preprocessed in Seurat.
Load and preprocess data
Using the same 10x data as before, we generate a Seurat object and perform an initial analysis:
library(Seurat)
#Load the data
PBMCs.10x.data<-Seurat::Read10X('../Downloads/10xExample/')
#Generate a Seurat object
pbmc_Seurat <- CreateSeuratObject(counts = PBMCs.10x.data$`Gene Expression`,
min.cells = 3,
min.features = 200,
project = '10X_PBMC',
assay = 'RNA')
#Compute the percentage of mitochondrial rates
mito.genes<-grep(pattern='^MT-',x=rownames(pbmc_Seurat@assays[['RNA']]),value=T)
percent.mito <- Matrix::colSums(pbmc_Seurat@assays[['RNA']][mito.genes, ])/
Matrix::colSums(pbmc_Seurat@assays[['RNA']])
pbmc_Seurat <- AddMetaData(object = pbmc_Seurat, metadata = percent.mito, col.name = 'percent.mito')
#Perform QC using the same parameters as above
pbmc_Seurat <- subset(pbmc_Seurat, nFeature_RNA >300 & nFeature_RNA < 5000 &
nCount_RNA > 400 & nCount_RNA<30000 &
percent.mito > 0.025 & percent.mito < 0.2)
#Normalize the data
pbmc_Seurat <- NormalizeData(object = pbmc_Seurat, normalization.method = 'LogNormalize', scale.factor = 10000)
To run RCA, no further processing steps would be needed. However, we want to also compare the RCA result to the Seurat based clustering, therefore we first go on with a Seurat based analysis:
#Find HVGs
pbmc_Seurat <- FindVariableFeatures(object = pbmc_Seurat,
mean.function = ExpMean,
dispersion.function = LogVMR,
x.low.cutoff = 0.0125,
x.high.cutoff = 3,
y.cutoff = 0.5,
nfeatures = 2000)
#Center and scale the data
pbmc_Seurat <- ScaleData(object = pbmc_Seurat)
#Run PCA on the data
pbmc_Seurat <- RunPCA(object = pbmc_Seurat, npcs = 50, verbose = FALSE)
#Plot different aspsects of the pca
ElbowPlot(object = pbmc_Seurat,ndims = 50)
Based on the Elbowplot (not shown here), we use 20 PCs for further analysis.
#Find Neighbors
pbmc_Seurat <- FindNeighbors(pbmc_Seurat, reduction = 'pca', dims = 1:20)
#Find Clusters
pbmc_Seurat <- FindClusters(pbmc_Seurat, resolution = 0.2, algorithm = 1)
We generate a UMAP of the data stored in the Seurat object using the umap R package:
#Load required libraries
library(umap)
library(ggplot2)
library(randomcoloR)
#Compute Umap from first 20PCs
umap_resultS<- umap(pbmc_Seurat@reductions$pca@cell.embeddings[,c(1:20)])
umap_resultSL<-as.data.frame(umap_resultS$layout)
#Derive distinguishable colors for the seurat clusters
myColors<-distinctColorPalette(length(unique(pbmc_Seurat$seurat_clusters)))
#Generate a UMAP
umapAll_Seurat_RCA<-ggplot(umap_resultSL,aes(x=V1,y=V2,color=pbmc_Seurat$seurat_clusters))+theme_bw(30)+
geom_point(size=1.5)+labs(colour='ClusterID')+theme(legend.title = element_text(size=10))+
guides(colour = guide_legend(override.aes = list(size=4)))+theme(legend.position = 'right')+
theme(legend.text=element_text(size=10))+scale_color_manual(values=myColors)+xlab('UMAP1')+ylab('UMAP2')
umapAll_Seurat_RCA
We obtain the following UMAP: Generate a RCA object and perform RCA analysis
We use the RCA function createRCAObject to generate a RCA object from the raw and optionally also the normalized data stored in our Seurat object.
library(RCAv2)
RCA_from_Seurat<-RCAv2::createRCAObject(pbmc_Seurat@assays$RNA@counts, pbmc_Seurat@assays$RNA@data)
Next, we can compute the projection, cluster the data, and estimate the most likely cell type for each cell as above:
#Compute projection
RCA_from_Seurat<-RCAv2::dataProject(rca.obj = RCA_from_Seurat)
#Cluster the projection
RCA_from_Seurat<-RCAv2::dataClust(RCA_from_Seurat)
#Estimate most likely cell type
RCA_from_Seurat<-RCAv2::estimateCellTypeFromProjection(RCA_from_Seurat)
Using the RCA cell type labels, RCA and Seurat clusters, we generate two new UMAPs whose coordinates are based on the PCs derived from HVGs and that are colored according to RCA clusters and cell type labels.
#Simplify the cell type annotation
SimplifiedAnnotation<-unlist(RCA_from_Seurat$cell.Type.Estimate)
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD33._Myeloid')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD4._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD8._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD14._Monocytes')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='BDCA4._DentriticCells')]<-'Dentritic cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L93_B.Cell_Plasma.Cell')]<- 'B cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L52_Platelet')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L74_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L51_B.Cell_Bone.Marrow')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L75_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L85_NK.Cell_CD56Hi')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD34.')]<-'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L45_CMP_Bone.Marrow')]<- 'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='WholeBlood')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L69_Dendritic.Cell_Monocyte.derived')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L80_T.Cell_CD8.Eff..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L60_Monocyte_CD16')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L86_NK.Cell_CD56Lo')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L73_T.Cell_CD4.Naive')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD56._NKCells')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD19._BCells.neg._sel..')]<- 'B cells'
#Plot a umap colored by the simplified cell type labels
myColors<-distinctColorPalette(length(unique(SimplifiedAnnotation)))
umapAll_Seurat_Estimated_CT<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=SimplifiedAnnotation))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='Cell type')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
scale_color_manual(values=myColors)+
ggtitle('b)')+
xlab('UMAP1')+ylab('UMAP2')+
theme(legend.title = element_text(size=12))
#Plot a umap colored by the RCA cluster ID
umapAll_Seurat_RCA_Clusters<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='RCA Cluster ID')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
xlab('UMAP1')+ylab('UMAP2')+
scale_color_identity(guide=guides(color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
ggtitle('a)')+
theme(legend.title = element_text(size=12))
#Combine the Figures into one
library(gridExtra)
grid.arrange(umapAll_Seurat_RCA_Clusters,umapAll_Seurat_Estimated_CT,nrow=1)
The RCA clusters show a high concordance to the Seurat clusters shown in the previous UMAP.
Add projection and annotations to the Seurat object
For greater convenience the results of RCA can be saved within the Seurat object for further analysis.
pbmc_Seurat[['RCA.clusters']]<-RCA_from_Seurat$clustering.out$dynamicColorsList
pbmc_Seurat[['cellTypeLabel']]<-RCA_from_Seurat$cell.Type.Estimate
pbmc_Seurat[['Projection']]<-CreateAssayObject(data=RCA_from_Seurat$projection.data)
Add a UMAP based on the projection to the Seurat object
Also, a UMAP reduction based on the projection space can be added to the Seurat object:
RCA_from_Seurat<-computeUMAP(RCA_from_Seurat)
pbmc_Seurat[['RCA_umap']]<-CreateDimReducObject(embeddings=as.matrix(RCA_from_Seurat$umap.coordinates),key='RCA_umap_',assay=DefaultAssay(pbmc_Seurat))
Visualizing RNA velocity on RCA result
RNA velocity describes the rate of gene expression change for an individual gene at a given time point based on the ratio of its spliced and unspliced messenger RNA (mRNA). Here, we describe how one can use the scvelo package, in Python, to visualize RNA velocity on the RCA generated result.
To transfer spliced RNA counts to scvelo, first transpose the raw RCA data matrix to get a cells x genes matrix, and export it to a CSV file.
# R
raw.data.counts <- t(rca_obj$raw.data)
write.table(x = raw.data.counts, file = 'raw_counts.csv', append = FALSE, quote = FALSE, sep = ',')
In addition, export the RCA projection and UMAP embeddings to respective CSV files too.
# R
projection.data <- as.matrix(t(rca_obj$projection.data[, -doublet_index]))
write.table(x = projection.data, file = 'projection_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')
umap.data <- as.matrix(rca_obj$umap.coordinates)
write.table(x = umap.data, file = 'umap_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')
Create an iPython notebook in the same folder and import the required packages as below.
# Python
import scvelo as scv
import scanpy as sc
import numpy as np
import pandas as pd
scv.set_figure_params()
Then, create a Scanpy object using the raw counts from the CSV file.
# Python
adata = sc.read_csv(filename='raw_counts.csv')
Populate the PCA slot in the Scanpy object as the projection data from RCA.
# Python
projection_data = np.loadtxt('bm_input/projection_data.csv',delimiter=',')
projection_data.shape
adata.obsm['X_pca'] = projection_data
Populate the UMAP slot in the Scanpy object as the umap coordinates from RCA.
# Python
umap_data = np.loadtxt('bm_input/umap_data.csv',delimiter=',')
umap_data.shape
adata.obsm['X_umap'] = umap_data
Load the unspliced loom object generated by velocyto.
# Python
ldata = scv.read('merged.loom', cache=True)
Then, merge the spliced and unspliced objects together as described below:
# Python
merged_data = scv.utils.merge(adata, ldata)
As recommended by the scvelo tutorial, perform the following steps to compute RNA velocity:
# Python
scv.pp.filter_and_normalize(merged_data)
scv.pp.moments(merged_data)
scv.tl.velocity(merged_data, mode='stochastic')
scv.tl.velocity_graph(merged_data)
It is possible that not all barcodes had sufficient quality of both spliced and unspliced reads, and thus some cells may have been discarded during the merging process. To ensure your cell type labels are still maintained, export the merged data observations from the merged scvelo object to a CSV file.
# Python
merged_data.obs.to_csv('merged_data_obs.csv')
In R, load this CSV file in and extract the RCA labels and filter only those which were considered in the merged data by scvelo.
# R
merged_data_obs <- read.csv(file = 'merged_data_obs.csv', row.names = 1)
rca_clusters <- rca_obj$clustering.out$dynamicColorsList$Clusters
names(rca_clusters) <- colnames(rca_obj$raw.data)
rca_clusters <- rca_clusters[rownames(merged_data_obs)]
Note: If your cell names have underscores in them, scanpy will automatically split the cell name into barcode and sample_batch.
In this case, replace the last line of the above block of code with the following:
# R
merged_barcodes <- paste0(merged_data_obs$sample_batch, rownames(merged_data_obs))
rca_clusters <- rca_clusters[merged_barcodes]
Now export these cluster labels to a CSV file.
# R
rca_cluster_df <- data.frame(Clusters = rca_clusters)
write.table(x = rca_cluster_df, file = 'rca_cluster_df.csv', append = FALSE, quote = FALSE, col.names = T, row.names = F, sep = ',')
Back in the scvelo iPynb, load this RCA cluster annotation table and set it as the observation slot of your merged data.
# Python
rca_clusters = pd.read_csv('rca_cluster_df.csv')
merged_data.obs = rca_clusters
Now, it’s finally time to visualize the RNA velocity results. There are 3 visualization options provided by scvelo, namely velocity_embedding, velocity_embedding_grid and velocity_embedding_stream. Use them as demonstrated below
# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = 'tab20', figsize = (10,10), save = 'embedding.png')
Using RCA colors
Since the RCA clusters already have color annotations, you can use the RCA colors in the palette as described below:
# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = merged_data.obs['Clusters'].sort_values().unique().tolist(), figsize = (10,10), save = 'RCAColor_embedding.png')
生活很好棕孙,有你更好