10X單細(xì)胞(10X空間轉(zhuǎn)錄組)多樣本批次效應(yīng)去除分析之RCA2

hello华畏,國(guó)慶上班了谨敛,不知道大家過的怎么樣,今天我們來分享一個(gè)新的多樣本整合批次去除的好方法抗悍,RCA2驹饺,很好的方法,文章在RCA2: a scalable supervised clustering algorithm that reduces batch effects in scRNA-seq data,2021年9月份發(fā)表于Nucleic Acids Research檐春,影響因子17分逻淌,我們簡(jiǎn)單看看文獻(xiàn)么伯,最后看一下示例代碼疟暖。

說起樣本之間的批次效應(yīng),其實(shí)談了很多了,但是想很好的去除卻并非容易的事情俐巴,下面列舉了一些批次效應(yīng)出去的方法骨望,大家感興趣可以參考。

企業(yè)微信截圖_16337733838949.png

先看看文章的一些重點(diǎn)信息

ABSTRACT

1欣舵、using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects and data quality artifacts.(當(dāng)然之前我們都是無監(jiān)管聚類擎鸠,效果好不好很難評(píng)判。)
2缘圈、RCA2劣光,這是第一個(gè)將參考投影(批量效應(yīng)魯棒性)與基于圖的聚類(可擴(kuò)展性)相結(jié)合的算法。 當(dāng)然糟把,也包括很多下游的分析模塊绢涡。RCA2 also provides new reference panels for human and mouse and supports generation of custom panels.(這個(gè)panel需要我們注意)。
3遣疯、RCA2 facilitates cell type-specific QC, which is essential for accurate clustering of data from heterogeneous tissues(關(guān)于數(shù)據(jù)質(zhì)控雄可,多次強(qiáng)調(diào)過,質(zhì)控是基礎(chǔ)缠犀,必須要做好数苫,基礎(chǔ)不牢,后面分析的再好都是錯(cuò)的)辨液。
4虐急、Scalable supervised clustering methods such as RCA2 will facilitate unified analysis of cohort-scale SC datasets.(這也是單細(xì)胞數(shù)據(jù)追求的目標(biāo))。

INTRODUCTION

單細(xì)胞聚類的兩種方法滔迈,(i) unsupervised (de-novo) clustering, which is the most prevalent(Louvain graph-based clustering algorithm is the most prevalent), and (ii) supervised clustering, which exploits a panel of reference transcriptomes

單細(xì)胞聚類方法存在的挑戰(zhàn)戏仓,(i) cells may cluster by technical variation and batch effects rather than biological properties, (ii) scRNA-seq data tend to be noisy, primarily due to sampling noise and (iii) the gene expression matrix can be very large, since modern datasets commonly include > 100,000 cells.(相信這樣的問題大家都遇到過)。所以不同的聚類方法往往導(dǎo)致不同的分析結(jié)果亡鼠。此外赏殃,從頭聚類需要一個(gè)容易出錯(cuò)、耗時(shí)的手動(dòng)步驟间涵,根據(jù)對(duì)標(biāo)記基因表達(dá)的主觀評(píng)估將細(xì)胞簇分配給細(xì)胞類型(注釋)仁热。 已經(jīng)開發(fā)了監(jiān)督聚類和監(jiān)督細(xì)胞類型注釋算法來解決這些限制。

Unlike the above-mentioned methods, RCA was not primarily designed for cell type annotation. Rather, the objective of RCA is to cluster single cells in the space of reference transcriptome projections. This is fundamentally different from unsupervised clustering approaches, which cluster cells in the space defined by over a thousand feature genes (關(guān)于之前的版本RCA勾哩,大家可以網(wǎng)上找一找抗蠢,之間用過,效果一般)思劳。RCA is the only supervised clustering algorithm for scRNA-seq data. However, the original version of RCA could not scale to datasets larger than 20,000 cells on a high-end laptop, used only a single reference panel, did not implement methods to identify differential gene expression, did not offer KEGG and Gene Ontology (GO) enrichment analysis, was benchmarked on only a single Smart-seq dataset and could not be easily integrated into existing data analysis workflows.(看來新改進(jìn)的RCA2的方法是一個(gè)單細(xì)胞數(shù)據(jù)分析的pipieline)迅矛。

MATERIALS AND METHODS

這個(gè)部分需要很深的數(shù)學(xué)知識(shí)了。

Projection to a reference潜叛,這個(gè)地方是對(duì)參考和query數(shù)據(jù)的預(yù)處理

圖片.png

Clustering and interpreting the projection

圖片.png

Reference panels秽褒,提供了一些固有的參考數(shù)據(jù)集

圖片.png

當(dāng)然壶硅,也很其他的去除批次的聚類方法進(jìn)行了比較,包括SEURAT, SEURAT INTEGRATION, SCTRANSFORM, SCTRANSFORM INTEGRATION, SCRAN (30), SCANPY, MNNCORRECT (31) and SCANORAMA销斟,我們后續(xù)看看結(jié)果如何庐椒。

Silhouette Index for quantifying batch effect(這個(gè)地方用來衡量批次去除的效果如何)。提到一個(gè)SI指數(shù)蚂踊,大家可以查一下约谈。

圖片.png

Result

RCA2的work flow,基本就是一套單細(xì)胞的分析流程

圖片.png

包括RCA2和之前的版本RCA的區(qū)別

圖片.png

多方法比較

圖片.png
  • 注:Note that a robust method should have low batch SI and high cell type SI, indicating that cells are separated by cell type rather than by batch.
RCA2聚類后的差異基因熱圖犁钟,區(qū)分的很明顯
圖片.png

當(dāng)然棱诱,后面還有一些其他數(shù)據(jù)的測(cè)試,我們就不一一減少了涝动,不過有一點(diǎn)很有意思军俊,如下圖,后續(xù)的代碼分享會(huì)介紹到

圖片.png

來吧捧存,看看代碼

library(RCAv2)
PBMCs<-RCAv2::createRCAObjectFrom10X("10xPBMCs/")##官方數(shù)據(jù)為例

Perform basic QC steps and data normalization

PBMCs<-RCAv2::dataFilter(PBMCs,
                  nGene.thresholds = c(300,5000), 
                  nUMI.thresholds = c(400,30000),
                  percent.mito.thresholds = c(0.025,0.2),
                  min.cell.exp = 3,
                  plot=T,
                  filename = "PBMCs_filter_example.pdf")

PBMCs<-RCAv2::dataLogNormalise(PBMCs)
圖片.png

Compute a projection to a reference data set,里面的參數(shù)大家需要注意

PBMCs<-RCAv2::dataProject(PBMCs,
                     method = "GlobalPanel",
                     corMeth = "pearson")
In addition to the GlobalPanel, RCA now provides 12 reference panels:(當(dāng)然粪躬,也可以自己指定)
  • GlobalPanel from the original RCA (Li et al., 2017) containing both primary cell types and tissues. Can be limited to only cell types with "GlobalPanel_CellTypes.
  • ColonEpiPanel: 9 colon epithelial samples from Li et al. (Nature genetics, 2017).
  • MonacoPanel: 29 PBMC cell types from Monaco, G., et al. (Cell reports, 2019).
  • MonacoBCellPanel: 5 B cell sub-types from Monaco, G., et al. (Cell reports, 2019).
  • MonacoMonoPanel: 5 Monocyte sub-types from Monaco, G., et al. (Cell reports, 2019).
  • MonacoTCellPanel: 15 T cell sub-types from Monaco, G., et al. (Cell reports, 2019).
  • CITESeqPanel based on Seurat 4.0 containing 34 cell types.
  • ENCODEHumanPanel: 93 human cell types from ENCODE.
  • NovershternPanel: 15 PBMC cell types from Novershtern et al. (Cell, 2011).
  • NovershternTCellPanel: 6 T cell sub-types from Novershtern et al. (Cell, 2011).
  • ENCODEMousePanel: 15 mouse cell types from ENCODE.
  • ZhangMouseBrainPanel: 7 mouse brain cell types from Tasic et al. (Nature neuroscience, 2016).

To use the custom panel MyPanel.RDS use the following command:

PBMCs<-RCAv2::dataProject(PBMCs,
                     method = "Custom",
             customPath = "MyPanel.RDS",
                     corMeth = "pearson")

To benefit from multiple panels at the same time, users can exploit the dataProjectMultiPanel function:

PBMCs<-RCAv2::dataProjectMultiPanel(PBMCs,method=list("NovershternPanel", 
"MonacoPanel", "GlobalPanel_CellTypes"),corMeth="pearson")

Cluster the projection and visualize it

PBMCs<-RCAv2::dataClust(PBMCs)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Heatmap_PBMCs.pdf",var.thrs=1)
圖片.png
PBMCs<-computeUMAP(PBMCs)
RCAv2::plotRCAUMAP(PBMCs,filename = "UMAP_PBMCs.pdf")
圖片.png
PBMCs<-computeUMAP(PBMCs, nDIMS = 3)
RCAv2::plotRCAUMAP3D(PBMCs,filename = "UMAP3D_PBMCs.html")
圖片.png
#Estimate the most probable cell type label for each cell
PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)
#Generate the cluster composition plot
RCAv2::plotRCAClusterComposition(PBMCs,filename="Cluster_Composition.pdf")
圖片.png
Based on the heatmap as well as the stacked bar plots we can relabel the clusters according to the major cell type annotations:
RCAcellTypes<-PBMCs$clustering.out$dynamicColorsList[[1]]
RCAcellTypes[which(RCAcellTypes=="blue")]<-"Monocytes"
RCAcellTypes[which(RCAcellTypes=="green")]<-"Dentritic cells"
RCAcellTypes[which(RCAcellTypes=="yellow")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="grey")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="brown")]<-"NK cells"
RCAcellTypes[which(RCAcellTypes=="turquoise")]<-"T cells"
RCAcellTypes[which(RCAcellTypes=="red")]<-"Myeloid cells"
RCAcellTypes[which(RCAcellTypes=="black")]<-"Progenitor cells"
范例
CD56Exp<-PBMCs$data[which(rownames(PBMCs$data)=="NCAM1"),]
RCAv2::plotRCAUMAP(PBMCs,cellPropertyList = list(CellTypes=RCAcellTypes,CD56=CD56Exp),filename = "UMAP_PBMCs.pdf")
圖片.png

圖片.png
To obtain cluster based cell type predictions, the user can run the function estimateCellTypeFromProjectionPerCluster:
PBMCs<-RCAv2::createRCAObjectFrom10X("../Documents/10xExample/")
PBMCs<-RCAv2::dataFilter(PBMCs,nGene.thresholds = c(300,4500),
                  percent.mito.thresholds = c(0.025,0.1),
                  min.cell.exp = 3)
PBMCs<-RCAv2::dataLogNormalise(PBMCs)
PBMCs<-RCAv2::dataProject(PBMCs,
                          method = "NovershternPanel",
                          corMeth = "pearson", 
              nPCs=0,
              approx= FALSE)
PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.15)
PBMCs<-estimateCellTypeFromProjectionPerCluster(PBMCs)
圖片.png

Graph based clustering as an alternative to hierarchical clustering

由于單細(xì)胞數(shù)據(jù)集的大小不斷增加,層次聚類需要具有大內(nèi)存的機(jī)器昔穴。 為了克服這個(gè)假定的限制镰官,RCAv2 還使用共享的最近鄰 (snn) 方法提供基于圖的聚類

PBMCs<-RCAv2::dataSNN(PBMCs,k=100,eps=25,minPts=30,dist.fun="All",corMeth="pearson")
This function has three main parameters: * k as the number of considered neighbours per cell, * eps as the minimum number of shared neighbours between to cells, * minPts minimum number of points that share eps neighbours such that a point is considered a core point.
The dist.fun parameter controls whether the distance matrix used for SNN clustering is based on the full correlation distance matrix or, if dist.fun is set to PCA, on a PC reduction of the reference projection. The corMeth sets which correlation function is used for the distance computation (pearson (default), spearman or kendal). As for the hierarchical clustering, heatmaps and umaps can be generated as well.
To help the user choosing the parameters for clustering, we provide a parameter space exploration feature leading to a 3D umap illustrating the number of clusters depending on the three parameters, as shown below.
圖片.png

上圖可用下面的代碼生成

parameterSpaceSNN(PBMCs,kL=c(30:50),epsL=c(5:20),minPtsL=c(5:10),folderpath=".",filename="Graph_based_Clustering_Parameter_Space.html") 

where kL, epsL and minPtsL define the search space for k, eps and minPts respectively. Note that executing this function will take longer for large search spaces.

In addition to those clustering parameters, via the dist.fun parameter one can choose whether a PCA reduction of the projection matrix, or the entire projection matrix should be used to construct the snn graph.

Using Louvain graph clustering implemented in Seurat

PBMCs<-dataSClust(PBMCs,res=0.5)
其中 res 是解析到 Seurat 聚類函數(shù)的分辨率參數(shù)。
基于圖的聚類吗货,RCA 投影熱圖將在沒有列樹狀圖的情況下繪制泳唠。
圖片.png
此函數(shù)使用與原始 Seurat 函數(shù)相同的 PCA 近似值。 將 approx 設(shè)置為 FALSE 將計(jì)算準(zhǔn)確的 PCA宙搬。 如果 corMeth 參數(shù)設(shè)置為 pearson笨腥、spearman 或 kendal,則該函數(shù)將使用相關(guān)距離而不是默認(rèn)的歐幾里得距離來計(jì)算全距離矩陣勇垛。 這也可以與精確的 PCA 結(jié)合使用脖母。 但是,并不是這些選項(xiàng)需要比默認(rèn)值更多的可用內(nèi)存闲孤。 使用肘部繪圖功能谆级,可以生成指導(dǎo)選擇要考慮的 PC 數(shù)量的 ElbowPlot。

Clustering free analysis of the projection

特別是對(duì)于非常大的數(shù)據(jù)集讼积,對(duì)投影進(jìn)行聚類可能具有挑戰(zhàn)性肥照。 對(duì)于這些情況,RCA 包括由 SingleR 和 scMatch 驅(qū)動(dòng)的聚類獨(dú)立細(xì)胞類型分配方法勤众,該方法純粹基于基于參考投影的每個(gè)細(xì)胞 z 分?jǐn)?shù)分布舆绎。 對(duì)函數(shù)的調(diào)用
PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)
將為每個(gè)cell返回最可能的cell類型并將其保存在 PBMCs 對(duì)象中。 通過參數(shù)置信度们颜,可以對(duì)兩種最可能的細(xì)胞類型之間的比率施加閾值(0 和 1 之間)吕朵。 在不確定的情況下猎醇,cell將被標(biāo)記為未知。
table(unlist(PBMCs$cell.Type.Estimate))
BDCA4._DentriticCells                     CD14._Monocytes             CD19._BCells.neg._sel.. 
                                 69                                 108                                 307 
                      CD33._Myeloid                               CD34.                         CD4._Tcells 
                               1363                                   3                                2266 
                      CD56._NKCells                         CD8._Tcells                 L45_CMP_Bone.Marrow 
                                458                                 364                                   4 
             L51_B.Cell_Bone.Marrow 

#Retrieve annotation
SimplifiedAnnotation<-unlist(PBMCs$cell.Type.Estimate)
#Relabel it
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD33._Myeloid")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD4._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD8._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD14._Monocytes")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="BDCA4._DentriticCells")]<-"Dentritic cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L93_B.Cell_Plasma.Cell")]<- "B cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L52_Platelet")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L74_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L51_B.Cell_Bone.Marrow")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L75_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L85_NK.Cell_CD56Hi")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD34.")]<-"Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L45_CMP_Bone.Marrow")]<- "Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="WholeBlood")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L69_Dendritic.Cell_Monocyte.derived")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L80_T.Cell_CD8.Eff..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L60_Monocyte_CD16")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L86_NK.Cell_CD56Lo")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L73_T.Cell_CD4.Naive")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD56._NKCells")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD19._BCells.neg._sel..")]<- "B cells"

umapFigures<-RCAv2::plotRCAUMAP(PBMCs,
                      cellPropertyList = list(`Cell Type`=SimplifiedAnnotation),
                      filename = "UMAP_PBMCs.pdf")
圖片.png

Compute DE genes for RCA clusters

PBMCs<-RCAv2::dataDE(PBMCs,
  logFoldChange = 1.5,
  method = "wilcox",
  mean.Exp = 0.5,
  deep.Split.Values = 1,
  min.pct = 0.25,
  min.diff.pct = -Inf,
  random.seed = 1,
  min.cells.group = 3,
  pseudocount.use = 1,
  p.adjust.methods = "BH",
  top.genes.per.cluster = 10
)
Here, logFoldChange is the required logFoldChange to call a gene to be differentially expressed. The method parameter indicates which statistical test is used. Multiple test correction is perfomed using the method indicated in p.adjust.methods. The parameters mean.Exp and min.pct indicat the minimum expression value as well as the minimum percentage of cells expressing a gene. Furthermore, the pseudocount can be adjusted via the pseudocount.use parameter. The top.genes.per.cluster parameter indicats how many genes should be selected as top DE genes per pairwise comparison for each cluster. Both the entire set of DE genes as well as the top DE genes are stored in the PBMCs rca.obj. The topDE genes can be plotted in a heatmap via the plotDEHeatmap function:
RCAv2::plotDEHeatmap(PBMCs,scale=FALSE)
The scale parameter allows the user to plot either the normalized UMI counts or scaled count (z-transformed). An example is shown below.

Compute enrichment for GO terms and KEGG pathways

使用 clusterProfiler 包边锁,RCAv2 支持分別使用函數(shù) doEnrichGo 和 doEnrichKEGG 對(duì) GO 術(shù)語和 KEGG 通路進(jìn)行富集測(cè)試姑食。 兩者都需要設(shè)置參數(shù)注釋波岛,這對(duì)于 ID 映射和 GO-term 分配都需要茅坛。 人類注釋的一個(gè)例子是 Bioconductor 上的 org.Hs.eg.db。

GO分析

doEnrichGo<-function(rca.obj,
                    annotation=NULL,
                    ontology="BP",
                    p.Val=0.05,
                    q.Val=0.2,
                    p.Adjust.Method="BH",
                    gene.label.type="SYMBOL",
                    filename="GoEnrichment.pdf",
            background.set="ALL",
                    background.set.threshold=NULL,
                    n.Cells.Expressed=NULL,
                    cluster.ID=NULL,
                    deep.split=NULL)
其中本體是 BP则拷、MF 或 CC贡蓖,p.Val 和 q.Val 是 clusterprofiler 使用的閾值,p.Adjust.Method 指示使用哪種方法來糾正多次測(cè)試煌茬。為了給用戶帶來更大的方便斥铺,該功能會(huì)自動(dòng)映射基因 ID。為此坛善,gene.label.type 保存原始標(biāo)簽的類型晾蜘。根據(jù) 10X 數(shù)據(jù),默認(rèn)設(shè)置為 SYMBOL眠屎。背景集要么基于所有集群剔交,要么僅基于調(diào)查的集群。通過 background.set.threshold 或 n.Cells.Expressed" 參數(shù)選擇單元格改衩。請(qǐng)注意岖常,前者是數(shù)值或以下值之一:Min、1stQ葫督、Mean竭鞍、Median、3thQ橄镜。計(jì)算這些閾值用于所有基因的所有平均表達(dá)值的分布偎快。如果只對(duì)一個(gè)特定的簇進(jìn)行分析,可以設(shè)置參數(shù)cluster.ID*洽胶,如果分層滨砍,可以指定deep.split的值來選擇自定義拆分已經(jīng)使用了聚類。
doEnrichGo 函數(shù)分別為每個(gè)集群生成條形圖妖异、goplots 和點(diǎn)圖惋戏。可以使用 filename 參數(shù)修改文件名他膳。顯示了 PBMC NK 細(xì)胞簇的示例條形圖和點(diǎn)圖响逢。描繪了 PBMC B 細(xì)胞的 Goplots。
圖片.png

KEGG分析

doEnrichKEGG<-function(rca.obj,
                     annotation=NULL,
                     org="hsa",
                     key="kegg",
                     p.Val=0.05,
                     q.Val=0.2,
                     p.Adjust.Method="BH",
                     gene.label.type="SYMBOL",
                     filename="KEGGEnrichment.pdf",
             background.set="ALL",
                     background.set.threshold=NULL,
                     n.Cells.Expressed=NULL,
                     cluster.ID=NULL,
                     deep.split=NULL)
圖片.png

Cluster/Cell-type specific quality control RCAv2 offers straightforward ways to perform cluster-specific quality control. We illustrate this functionality using an inhouse dataset of 45926 cells obtained from five bone marrow samples. A link to download the data will be made available here at a later stage. First, we load the data, project it against the global panel and cluster it:

normalBoneMarow<-readRDS("../Documents/DUKE_Normal.RDS")
createRCAObject()
PBMCs<-RCAv2::createRCAObject(normalCML@assays$RNA@data,dataIsNormalized = T)

PBMCs<-RCAv2::dataProject(PBMCs,
                          method = "GlobalPanel",
                          corMeth = "pearson")

PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.1)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Control_HeatmapPostQC.pdf")
圖片.png

we identify the cluster IDs as:

cellTypes<-c("Progenitor B","CMP/MEP","CMP/GMP","GMP/Dendritic cells","CD8 T cells","NK cells","CD4 T cells", "B cells", "Erythroid Progenitor","Monocytes","BT")
clusterColors<-c("purple","black","blue","magenta","turquoise","yellow","green","pink","greenyellow","red","brown")
names(cellTypes)<-clusterColors
cellTypeLabels<-cellTypes[PBMCs$clustering.out$dynamicColorsList[[1]]]

and plot cluster quality scores using

plotClusterQuality(PBMCs,width = 15,height = 9,cluster.labels = cellTypeLabels)
圖片.png

Combining RCA with Seurat

Data processing can also be carried out with Seurat. Here is an example how you can combine a RCA analysis with data preprocessed in Seurat.

Load and preprocess data

Using the same 10x data as before, we generate a Seurat object and perform an initial analysis:

library(Seurat)

#Load the data
PBMCs.10x.data<-Seurat::Read10X('../Downloads/10xExample/')

#Generate a Seurat object
pbmc_Seurat <- CreateSeuratObject(counts = PBMCs.10x.data$`Gene Expression`, 
                  min.cells = 3, 
                  min.features  = 200, 
                  project = '10X_PBMC', 
                  assay = 'RNA')

#Compute the percentage of mitochondrial rates
mito.genes<-grep(pattern='^MT-',x=rownames(pbmc_Seurat@assays[['RNA']]),value=T)
percent.mito <- Matrix::colSums(pbmc_Seurat@assays[['RNA']][mito.genes, ])/
                                Matrix::colSums(pbmc_Seurat@assays[['RNA']])
pbmc_Seurat <- AddMetaData(object = pbmc_Seurat, metadata = percent.mito, col.name = 'percent.mito')

#Perform QC using the same parameters as above
pbmc_Seurat <- subset(pbmc_Seurat, nFeature_RNA >300 & nFeature_RNA < 5000 &
                        nCount_RNA > 400 & nCount_RNA<30000 &
                        percent.mito > 0.025 & percent.mito < 0.2)

#Normalize the data
pbmc_Seurat <- NormalizeData(object = pbmc_Seurat, normalization.method = 'LogNormalize', scale.factor = 10000)

To run RCA, no further processing steps would be needed. However, we want to also compare the RCA result to the Seurat based clustering, therefore we first go on with a Seurat based analysis:


#Find HVGs
pbmc_Seurat <- FindVariableFeatures(object = pbmc_Seurat, 
                   mean.function = ExpMean, 
                   dispersion.function = LogVMR, 
                   x.low.cutoff = 0.0125, 
                   x.high.cutoff = 3, 
                   y.cutoff = 0.5, 
                   nfeatures = 2000)

#Center and scale the data
pbmc_Seurat <- ScaleData(object = pbmc_Seurat)

#Run PCA on the data
pbmc_Seurat <- RunPCA(object = pbmc_Seurat,  npcs = 50, verbose = FALSE)

#Plot different aspsects of the pca
ElbowPlot(object = pbmc_Seurat,ndims = 50)

Based on the Elbowplot (not shown here), we use 20 PCs for further analysis.

#Find Neighbors
pbmc_Seurat <- FindNeighbors(pbmc_Seurat, reduction = 'pca', dims = 1:20)

#Find Clusters
pbmc_Seurat <- FindClusters(pbmc_Seurat, resolution = 0.2, algorithm = 1)

We generate a UMAP of the data stored in the Seurat object using the umap R package:

#Load required libraries
library(umap)
library(ggplot2)
library(randomcoloR)

#Compute Umap from first 20PCs
umap_resultS<- umap(pbmc_Seurat@reductions$pca@cell.embeddings[,c(1:20)])
umap_resultSL<-as.data.frame(umap_resultS$layout)

#Derive distinguishable colors for the seurat clusters
myColors<-distinctColorPalette(length(unique(pbmc_Seurat$seurat_clusters)))

#Generate a UMAP
umapAll_Seurat_RCA<-ggplot(umap_resultSL,aes(x=V1,y=V2,color=pbmc_Seurat$seurat_clusters))+theme_bw(30)+
  geom_point(size=1.5)+labs(colour='ClusterID')+theme(legend.title = element_text(size=10))+
  guides(colour = guide_legend(override.aes = list(size=4)))+theme(legend.position = 'right')+
  theme(legend.text=element_text(size=10))+scale_color_manual(values=myColors)+xlab('UMAP1')+ylab('UMAP2')
umapAll_Seurat_RCA

We obtain the following UMAP:
image.png

Generate a RCA object and perform RCA analysis

We use the RCA function createRCAObject to generate a RCA object from the raw and optionally also the normalized data stored in our Seurat object.

library(RCAv2)
RCA_from_Seurat<-RCAv2::createRCAObject(pbmc_Seurat@assays$RNA@counts, pbmc_Seurat@assays$RNA@data)

Next, we can compute the projection, cluster the data, and estimate the most likely cell type for each cell as above:

#Compute projection
RCA_from_Seurat<-RCAv2::dataProject(rca.obj = RCA_from_Seurat)

#Cluster the projection
RCA_from_Seurat<-RCAv2::dataClust(RCA_from_Seurat)

#Estimate most likely cell type
RCA_from_Seurat<-RCAv2::estimateCellTypeFromProjection(RCA_from_Seurat)

Using the RCA cell type labels, RCA and Seurat clusters, we generate two new UMAPs whose coordinates are based on the PCs derived from HVGs and that are colored according to RCA clusters and cell type labels.

#Simplify the cell type annotation
SimplifiedAnnotation<-unlist(RCA_from_Seurat$cell.Type.Estimate)
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD33._Myeloid')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD4._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD8._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD14._Monocytes')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='BDCA4._DentriticCells')]<-'Dentritic cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L93_B.Cell_Plasma.Cell')]<- 'B cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L52_Platelet')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L74_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L51_B.Cell_Bone.Marrow')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L75_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L85_NK.Cell_CD56Hi')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD34.')]<-'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L45_CMP_Bone.Marrow')]<- 'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='WholeBlood')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L69_Dendritic.Cell_Monocyte.derived')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L80_T.Cell_CD8.Eff..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L60_Monocyte_CD16')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L86_NK.Cell_CD56Lo')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L73_T.Cell_CD4.Naive')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD56._NKCells')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD19._BCells.neg._sel..')]<- 'B cells'

#Plot a umap colored by the simplified cell type labels
myColors<-distinctColorPalette(length(unique(SimplifiedAnnotation)))
umapAll_Seurat_Estimated_CT<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=SimplifiedAnnotation))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='Cell type')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
scale_color_manual(values=myColors)+
ggtitle('b)')+
xlab('UMAP1')+ylab('UMAP2')+
theme(legend.title = element_text(size=12))

#Plot a umap colored by the RCA cluster ID
umapAll_Seurat_RCA_Clusters<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='RCA Cluster ID')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
xlab('UMAP1')+ylab('UMAP2')+
scale_color_identity(guide=guides(color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
ggtitle('a)')+
theme(legend.title = element_text(size=12))

#Combine the Figures into one
library(gridExtra)
grid.arrange(umapAll_Seurat_RCA_Clusters,umapAll_Seurat_Estimated_CT,nrow=1)
image.png

The RCA clusters show a high concordance to the Seurat clusters shown in the previous UMAP.

Add projection and annotations to the Seurat object

For greater convenience the results of RCA can be saved within the Seurat object for further analysis.

pbmc_Seurat[['RCA.clusters']]<-RCA_from_Seurat$clustering.out$dynamicColorsList
pbmc_Seurat[['cellTypeLabel']]<-RCA_from_Seurat$cell.Type.Estimate
pbmc_Seurat[['Projection']]<-CreateAssayObject(data=RCA_from_Seurat$projection.data)

Add a UMAP based on the projection to the Seurat object

Also, a UMAP reduction based on the projection space can be added to the Seurat object:

RCA_from_Seurat<-computeUMAP(RCA_from_Seurat)
pbmc_Seurat[['RCA_umap']]<-CreateDimReducObject(embeddings=as.matrix(RCA_from_Seurat$umap.coordinates),key='RCA_umap_',assay=DefaultAssay(pbmc_Seurat))

Visualizing RNA velocity on RCA result

RNA velocity describes the rate of gene expression change for an individual gene at a given time point based on the ratio of its spliced and unspliced messenger RNA (mRNA). Here, we describe how one can use the scvelo package, in Python, to visualize RNA velocity on the RCA generated result.

To transfer spliced RNA counts to scvelo, first transpose the raw RCA data matrix to get a cells x genes matrix, and export it to a CSV file.

# R
raw.data.counts <- t(rca_obj$raw.data)
write.table(x = raw.data.counts, file = 'raw_counts.csv', append = FALSE, quote = FALSE, sep = ',')

In addition, export the RCA projection and UMAP embeddings to respective CSV files too.

# R
projection.data <- as.matrix(t(rca_obj$projection.data[, -doublet_index]))
write.table(x = projection.data, file = 'projection_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')

umap.data <- as.matrix(rca_obj$umap.coordinates)
write.table(x = umap.data, file = 'umap_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')

Create an iPython notebook in the same folder and import the required packages as below.

# Python
import scvelo as scv
import scanpy as sc
import numpy as np
import pandas as pd
scv.set_figure_params()

Then, create a Scanpy object using the raw counts from the CSV file.

# Python
adata = sc.read_csv(filename='raw_counts.csv')

Populate the PCA slot in the Scanpy object as the projection data from RCA.


# Python
projection_data = np.loadtxt('bm_input/projection_data.csv',delimiter=',')
projection_data.shape

adata.obsm['X_pca'] = projection_data

Populate the UMAP slot in the Scanpy object as the umap coordinates from RCA.

# Python
umap_data = np.loadtxt('bm_input/umap_data.csv',delimiter=',')
umap_data.shape

adata.obsm['X_umap'] = umap_data

Load the unspliced loom object generated by velocyto.

# Python
ldata = scv.read('merged.loom', cache=True)

Then, merge the spliced and unspliced objects together as described below:

# Python
merged_data = scv.utils.merge(adata, ldata)

As recommended by the scvelo tutorial, perform the following steps to compute RNA velocity:

# Python
scv.pp.filter_and_normalize(merged_data)
scv.pp.moments(merged_data)
scv.tl.velocity(merged_data, mode='stochastic')
scv.tl.velocity_graph(merged_data)

It is possible that not all barcodes had sufficient quality of both spliced and unspliced reads, and thus some cells may have been discarded during the merging process. To ensure your cell type labels are still maintained, export the merged data observations from the merged scvelo object to a CSV file.

# Python
merged_data.obs.to_csv('merged_data_obs.csv')

In R, load this CSV file in and extract the RCA labels and filter only those which were considered in the merged data by scvelo.

# R
merged_data_obs <- read.csv(file = 'merged_data_obs.csv', row.names = 1)
rca_clusters <- rca_obj$clustering.out$dynamicColorsList$Clusters
names(rca_clusters) <- colnames(rca_obj$raw.data)
rca_clusters <- rca_clusters[rownames(merged_data_obs)]

Note: If your cell names have underscores in them, scanpy will automatically split the cell name into barcode and sample_batch.

In this case, replace the last line of the above block of code with the following:

# R
merged_barcodes <- paste0(merged_data_obs$sample_batch, rownames(merged_data_obs))
rca_clusters <- rca_clusters[merged_barcodes]

Now export these cluster labels to a CSV file.

# R
rca_cluster_df <- data.frame(Clusters = rca_clusters)
write.table(x = rca_cluster_df, file = 'rca_cluster_df.csv', append = FALSE, quote = FALSE, col.names = T, row.names = F, sep = ',')

Back in the scvelo iPynb, load this RCA cluster annotation table and set it as the observation slot of your merged data.

# Python
rca_clusters = pd.read_csv('rca_cluster_df.csv')
merged_data.obs = rca_clusters

Now, it’s finally time to visualize the RNA velocity results. There are 3 visualization options provided by scvelo, namely velocity_embedding, velocity_embedding_grid and velocity_embedding_stream. Use them as demonstrated below

# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = 'tab20', figsize = (10,10), save = 'embedding.png')
image.png

Using RCA colors

Since the RCA clusters already have color annotations, you can use the RCA colors in the palette as described below:

# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = merged_data.obs['Clusters'].sort_values().unique().tolist(), figsize = (10,10), save = 'RCAColor_embedding.png')

生活很好棕孙,有你更好

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載舔亭,如需轉(zhuǎn)載請(qǐng)通過簡(jiǎn)信或評(píng)論聯(lián)系作者些膨。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市钦铺,隨后出現(xiàn)的幾起案子订雾,更是在濱河造成了極大的恐慌,老刑警劉巖矛洞,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件洼哎,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡沼本,警方通過查閱死者的電腦和手機(jī)噩峦,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來抽兆,“玉大人识补,你說我怎么就攤上這事”韬欤” “怎么了凭涂?”我有些...
    開封第一講書人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)贴妻。 經(jīng)常有香客問我切油,道長(zhǎng),這世上最難降的妖魔是什么揍瑟? 我笑而不...
    開封第一講書人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任白翻,我火速辦了婚禮,結(jié)果婚禮上绢片,老公的妹妹穿的比我還像新娘滤馍。我一直安慰自己,他們只是感情好底循,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開白布巢株。 她就那樣靜靜地躺著,像睡著了一般熙涤。 火紅的嫁衣襯著肌膚如雪阁苞。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,970評(píng)論 1 284
  • 那天祠挫,我揣著相機(jī)與錄音那槽,去河邊找鬼。 笑死等舔,一個(gè)胖子當(dāng)著我的面吹牛骚灸,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播慌植,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼甚牲,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼义郑!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起丈钙,我...
    開封第一講書人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬榮一對(duì)情侶失蹤非驮,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后雏赦,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體劫笙,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年喉誊,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了邀摆。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片纵顾。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡伍茄,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出施逾,到底是詐尸還是另有隱情敷矫,我是刑警寧澤,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布汉额,位于F島的核電站曹仗,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏蠕搜。R本人自食惡果不足惜怎茫,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望妓灌。 院中可真熱鬧轨蛤,春花似錦、人聲如沸虫埂。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽掉伏。三九已至缝呕,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間斧散,已是汗流浹背供常。 一陣腳步聲響...
    開封第一講書人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留鸡捐,地道東北人栈暇。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像闯参,于是被迫代替她去往敵國(guó)和親瞻鹏。 傳聞我的和親對(duì)象是個(gè)殘疾皇子悲立,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容