本文是參考學(xué)習(xí)單細(xì)胞轉(zhuǎn)錄組基礎(chǔ)分析六:偽時(shí)間分析的學(xué)習(xí)筆記。可能根據(jù)學(xué)習(xí)情況有所改動(dòng)娇妓。
Monocle進(jìn)行偽時(shí)間分析的核心技術(shù)是一種機(jī)器學(xué)習(xí)算法——反向圖形嵌入 (Reversed Graph Embedding)。它分析的前提需要一張展現(xiàn)細(xì)胞轉(zhuǎn)錄特征相似性關(guān)系的圖刽酱,Monocle2使用DDTree降維圖勇边,Monocle3使用UMAP降維圖锚沸。Monocle的機(jī)器學(xué)習(xí)算法可以依據(jù)上述降維圖形第晰,學(xué)習(xí)描述細(xì)胞如何從一種狀態(tài)過(guò)渡到另一種狀態(tài)的軌跡碱茁。Monocle假設(shè)軌跡是樹(shù)狀結(jié)構(gòu)裸卫,一端是“根”,另一端是“葉”纽竣。一個(gè)細(xì)胞在生物過(guò)程的開(kāi)始彼城,從根開(kāi)始沿著主干進(jìn)行,直到它到達(dá)第一個(gè)分支退个。然后,該細(xì)胞必須選擇一條路徑调炬,并沿著樹(shù)移動(dòng)越來(lái)越遠(yuǎn)语盈,直到它到達(dá)一片葉子。一個(gè)細(xì)胞的假時(shí)間值是它返回根所需的距離缰泡。降維方面monocle與seurat的過(guò)程大同小異刀荒,首先進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化,其次選擇部分基因代表細(xì)胞轉(zhuǎn)錄特征 棘钞,最后選用適當(dāng)?shù)乃惴ń稻S缠借。對(duì)Monocle原理感興趣的同學(xué)可以登錄官網(wǎng)查看:
http://cole-trapnell-lab.github.io/monocle-release/
數(shù)據(jù)導(dǎo)入與處理
軌跡分析的前提是待分析的細(xì)胞有緊密的發(fā)育關(guān)系,PBMC細(xì)胞不是很好的的示例數(shù)據(jù)宜猜,我們選擇T細(xì)胞群體演示一下泼返。Monocle建議導(dǎo)入原始表達(dá)矩陣,由它完成數(shù)據(jù)標(biāo)準(zhǔn)化和其他預(yù)處理姨拥。
dir.create("pseudotime")
expressionFamily參數(shù)用于指定表達(dá)矩陣的數(shù)據(jù)類型绅喉,有幾個(gè)選項(xiàng)可以選擇:
稀疏矩陣用negbinomial.size(),
FPKM值用tobit()叫乌,
logFPKM值用gaussianff()
mycds是Monocle為我們的數(shù)據(jù)生成的對(duì)象柴罐,相當(dāng)于我們?cè)趕eurat使用的scRNA對(duì)象。數(shù)據(jù)導(dǎo)入后需要進(jìn)行標(biāo)準(zhǔn)化和其他預(yù)處理:
mycds <- estimateSizeFactors(mycds)
與seurat把標(biāo)準(zhǔn)化后的表達(dá)矩陣保存在對(duì)象中不同憨奸,monocle只保存一些中間結(jié)果在對(duì)象中革屠,需要用時(shí)再用這些中間結(jié)果轉(zhuǎn)化。經(jīng)過(guò)上面三個(gè)函數(shù)的計(jì)算排宰,mycds對(duì)象中多了SizeFactors似芝、Dipersions、num_cells_expressed和num_genes_expressed等信息板甘。
選擇代表性基因
完成數(shù)據(jù)導(dǎo)入和預(yù)處理后国觉,就可以考慮選擇哪些基因代表細(xì)胞的發(fā)育特征,Monocle官網(wǎng)教程提供了4個(gè)選擇方法:
選擇發(fā)育差異表達(dá)基因
選擇clusters差異表達(dá)基因
選擇離散程度高的基因
自定義發(fā)育marker基因
前三種都是無(wú)監(jiān)督分析方法虾啦,細(xì)胞發(fā)育軌跡生成完全不受人工干預(yù)麻诀;最后一種是半監(jiān)督分析方法痕寓,可以使用先驗(yàn)知識(shí)輔助分析。第一種方法要求實(shí)驗(yàn)設(shè)計(jì)有不同的時(shí)間點(diǎn)蝇闭,對(duì)起點(diǎn)和終點(diǎn)的樣本做基因表達(dá)差異分析呻率,挑選顯著差異的基因進(jìn)行后續(xù)分析。對(duì)于沒(méi)有時(shí)序設(shè)計(jì)的實(shí)驗(yàn)樣本呻引,可以使用第2礼仗、3種方法挑選基因。第2種方法要先對(duì)細(xì)胞降維聚類逻悠,然后用clusters之間差異表達(dá)的基因開(kāi)展后續(xù)分析元践。Monocle有一套自己的降維聚類方法,與seurat的方法大同小異童谒,很多教程直接使用seurat的差異分析結(jié)果单旁。第3種方法使用離散程度高的基因開(kāi)展分析,seurat有挑選高變基因的方法饥伊,monocle也有自己選擇的算法象浑。本案例數(shù)據(jù)不具備使用第1、4種方法的條件琅豆,因此這里只演示2愉豺、3種方法的使用。
##使用clusters差異表達(dá)基因
選擇不同的基因集茫因,擬時(shí)分析的結(jié)果不同蚪拦,實(shí)踐中可以幾種方法都試一下。
降維及****細(xì)胞排序
使用disp.genes開(kāi)展后續(xù)分析
#降維
使用diff.genes分析的結(jié)果
軌跡圖分面顯示
p1 <- plot_cell_trajectory(mycds, color_by = "State") + facet_wrap(~State, nrow = 1)
Monocle基因可視化
s.genes <- c("ITGB1","CCR7","KLRB1","GNLY")
擬時(shí)相關(guān)基因聚類熱圖
Monocle中differentialGeneTest()函數(shù)可以按條件進(jìn)行差異分析冻押,將相關(guān)參數(shù)設(shè)為fullModelFormulaStr = "~sm.ns(Pseudotime)"時(shí)外盯,可以找到與擬時(shí)先關(guān)的差異基因。我們可以按一定的條件篩選基因后進(jìn)行差異分析翼雀,全部基因都輸入會(huì)耗費(fèi)比較長(zhǎng)的時(shí)間饱苟。建議使用cluster差異基因或高變基因輸入函數(shù)計(jì)算。分析結(jié)果主要依據(jù)qval區(qū)分差異的顯著性狼渊,篩選之后可以用plot_pseudotime_heatmap函數(shù)繪制成熱圖箱熬。
#cluster差異基因
BEAM分析
單細(xì)胞軌跡中通常包括分支,它們的出現(xiàn)是因?yàn)榧?xì)胞的表達(dá)模式不同狈邑。當(dāng)細(xì)胞做出命運(yùn)選擇時(shí)城须,或者遺傳、化學(xué)或環(huán)境擾動(dòng)時(shí)米苹,就會(huì)表現(xiàn)出不同的基因表達(dá)模式糕伐。BEAM(Branched expression analysis modeling)是一種統(tǒng)計(jì)方法,用于尋找以依賴于分支的方式調(diào)控的基因蘸嘶。
disp_table <- dispersionTable(mycds)
> dir.create("pseudotime")
> scRNAsub <- readRDS("scRNAsub.rds") #scRNAsub是上一節(jié)保存的T細(xì)胞子集seurat對(duì)象
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'scRNAsub.rds', probable reason 'No such file or directory'
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936
attached base packages:
[1] splines parallel stats4 stats graphics grDevices utils datasets
[9] methods base
other attached packages:
[1] monocle_2.18.0 DDRTree_0.1.5 irlba_2.3.3
[4] VGAM_1.1-5 Matrix_1.2-18 patchwork_1.1.1
[7] celldex_1.0.0 SingleR_1.4.1 SummarizedExperiment_1.20.0
[10] Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[13] IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0
[16] MatrixGenerics_1.2.0 matrixStats_0.57.0 forcats_0.5.0
[19] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
[22] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4
[25] ggplot2_3.3.3 tidyverse_1.3.0 Seurat_3.2.3
loaded via a namespace (and not attached):
[1] reticulate_1.18 tidyselect_1.1.0
[3] RSQLite_2.2.2 AnnotationDbi_1.52.0
[5] htmlwidgets_1.5.3 docopt_0.7.1
[7] grid_4.0.2 combinat_0.0-8
[9] BiocParallel_1.24.1 Rtsne_0.15
[11] munsell_0.5.0 codetools_0.2-18
[13] ica_1.0-2 future_1.21.0
[15] miniUI_0.1.1.1 withr_2.3.0
[17] fastICA_1.2-2 colorspace_2.0-0
[19] rstudioapi_0.13 ROCR_1.0-11
[21] tensor_1.5 listenv_0.8.0
[23] labeling_0.4.2 slam_0.1-48
[25] GenomeInfoDbData_1.2.4 polyclip_1.10-0
[27] bit64_4.0.5 farver_2.0.3
[29] pheatmap_1.0.12 parallelly_1.23.0
[31] vctrs_0.3.6 generics_0.1.0
[33] BiocFileCache_1.14.0 R6_2.5.0
[35] rsvd_1.0.3 bitops_1.0-6
[37] spatstat.utils_1.20-2 DelayedArray_0.16.0
[39] assertthat_0.2.1 promises_1.1.1
[41] scales_1.1.1 gtable_0.3.0
[43] beachmat_2.6.4 globals_0.14.0
[45] goftest_1.2-2 rlang_0.4.9
[47] lazyeval_0.2.2 broom_0.7.3
[49] BiocManager_1.30.10 yaml_2.2.1
[51] reshape2_1.4.4 abind_1.4-5
[53] modelr_0.1.8 backports_1.2.0
[55] httpuv_1.5.4 tools_4.0.2
[57] ellipsis_0.3.1 RColorBrewer_1.1-2
[59] sessioninfo_1.1.1 ggridges_0.5.3
[61] Rcpp_1.0.5 plyr_1.8.6
[63] sparseMatrixStats_1.2.1 zlibbioc_1.36.0
[65] RCurl_1.98-1.2 densityClust_0.3
[67] rpart_4.1-15 deldir_0.2-3
[69] viridis_0.5.1 pbapply_1.4-3
[71] cowplot_1.1.1 zoo_1.8-8
[73] haven_2.3.1 ggrepel_0.9.0
[75] cluster_2.1.0 fs_1.5.0
[77] magrittr_2.0.1 RSpectra_0.16-0
[79] data.table_1.13.6 scattermore_0.7
[81] lmtest_0.9-38 reprex_0.3.0
[83] RANN_2.6.1 fitdistrplus_1.1-3
[85] hms_0.5.3 mime_0.9
[87] xtable_1.8-4 sparsesvd_0.2
[89] readxl_1.3.1 gridExtra_2.3
[91] HSMMSingleCell_1.10.0 compiler_4.0.2
[93] KernSmooth_2.23-18 crayon_1.3.4
[95] htmltools_0.5.1.1 mgcv_1.8-33
[97] later_1.1.0.1 lubridate_1.7.9.2
[99] DBI_1.1.0 ExperimentHub_1.16.0
[101] dbplyr_2.0.0 MASS_7.3-53
[103] rappdirs_0.3.1 cli_2.2.0
[105] igraph_1.2.6 pkgconfig_2.0.3
[107] plotly_4.9.3 xml2_1.3.2
[109] XVector_0.30.0 rvest_0.3.6
[111] digest_0.6.27 sctransform_0.3.2
[113] RcppAnnoy_0.0.18 spatstat.data_1.7-0
[115] cellranger_1.1.0 leiden_0.3.6
[117] uwot_0.1.10 DelayedMatrixStats_1.12.3
[119] curl_4.3 shiny_1.5.0
[121] lifecycle_0.2.0 nlme_3.1-151
[123] jsonlite_1.7.2 BiocNeighbors_1.8.2
[125] viridisLite_0.3.0 limma_3.46.0
[127] fansi_0.4.1 pillar_1.4.7
[129] lattice_0.20-41 fastmap_1.0.1
[131] httr_1.4.2 survival_3.2-7
[133] interactiveDisplayBase_1.28.0 glue_1.4.2
[135] qlcMatrix_0.9.7 FNN_1.1.3
[137] spatstat_1.64-1 png_0.1-7
[139] BiocVersion_3.12.0 bit_4.0.4
[141] stringi_1.5.3 blob_1.2.1
[143] BiocSingular_1.6.0 AnnotationHub_2.22.0
[145] memoise_1.1.0 future.apply_1.7.0
> #圖片
> ##保存數(shù)據(jù)
> saveRDS(scRNAsub, file="scRNAsub.rds")
> scRNAsub <- readRDS("scRNAsub.rds") #scRNAsub是上一節(jié)保存的T細(xì)胞子集seurat對(duì)象
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> fd <- new('AnnotatedDataFrame', data = fData)
Error in value[[3L]](cond) :
AnnotatedDataFrame 'data' is class 'standardGeneric' but should be or extend 'data.frame'
AnnotatedDataFrame 'initialize' could not update varMetadata:
perhaps pData and varMetadata are inconsistent?
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> pd <- new('AnnotatedDataFrame', data = scRNAsub@meta.data)
> scRNAsub <- readRDS("scRNAsub.rds") #scRNAsub是上一節(jié)保存的T細(xì)胞子集seurat對(duì)象
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> pd <- new('AnnotatedDataFrame', data = scRNAsub@meta.data)
> fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data))
> fd <- new('AnnotatedDataFrame', data = fData)
> mycds <- newCellDataSet(data,
+ phenoData = pd,
+ featureData = fd,
+ expressionFamily = negbinomial.size())
> mycds <- estimateSizeFactors(mycds)
> mycds <- estimateDispersions(mycds, cores=4, relative_expr = TRUE)
Removing 276 outliers
Warning messages:
1: `group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
> ##使用clusters差異表達(dá)基因
> diff.genes <- read.csv('subcluster/diff_genes_wilcox.csv')
> diff.genes <- subset(diff.genes,p_val_adj<0.01)$gene
> mycds <- setOrderingFilter(mycds, diff.genes)
> p1 <- plot_ordering_genes(mycds)
> ##使用seurat選擇的高變基因
> var.genes <- VariableFeatures(scRNAsub)
> mycds <- setOrderingFilter(mycds, var.genes)
> p2 <- plot_ordering_genes(mycds)
> ##使用monocle選擇的高變基因
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.1 & dispersion_empirical >= 1 * dispersion_fit)$gene_id
> mycds <- setOrderingFilter(mycds, disp.genes)
> p3 <- plot_ordering_genes(mycds)
> ##結(jié)果對(duì)比
> p1|p2|p3
Warning messages:
1: Transformation introduced infinite values in continuous y-axis
2: Transformation introduced infinite values in continuous y-axis
3: Transformation introduced infinite values in continuous y-axis
4: Transformation introduced infinite values in continuous y-axis
5: Transformation introduced infinite values in continuous y-axis
> #降維
> mycds <- reduceDimension(mycds, max_components = 2, method = 'DDRTree')
> #排序
> mycds <- orderCells(mycds)
There were 50 or more warnings (use warnings() to see the first 50)
> #State軌跡分布圖
> plot1 <- plot_cell_trajectory(mycds, color_by = "State")
> plot1
> ggsave("pseudotime/State.pdf", plot = plot1, width = 6, height = 5)
> ggsave("pseudotime/State.png", plot = plot1, width = 6, height = 5)
> ##Cluster軌跡分布圖
> plot2 <- plot_cell_trajectory(mycds, color_by = "seurat_clusters")
> ggsave("pseudotime/Cluster.pdf", plot = plot2, width = 6, height = 5)
> ggsave("pseudotime/Cluster.png", plot = plot2, width = 6, height = 5)
> plot2
> ##Pseudotime軌跡圖
> plot3 <- plot_cell_trajectory(mycds, color_by = "Pseudotime")
> plot3
> ggsave("pseudotime/Pseudotime.pdf", plot = plot3, width = 6, height = 5)
> ggsave("pseudotime/Pseudotime.png", plot = plot3, width = 6, height = 5)
> ##合并作圖
> plotc <- plot1|plot2|plot3
> plotc
> ggsave("pseudotime/Combination.pdf", plot = plotc, width = 10, height = 3.5)
> ggsave("pseudotime/Combination.png", plot = plotc, width = 10, height = 3.5)
> ##保存結(jié)果
> write.csv(pData(mycds), "pseudotime/pseudotime.csv")
> p1 <- plot_cell_trajectory(mycds, color_by = "State") + facet_wrap(~State, nrow = 1)
> p2 <- plot_cell_trajectory(mycds, color_by = "seurat_clusters") + facet_wrap(~seurat_clusters, nrow = 1)
> plotc <- p1/p2
> plotc <- p1/p2
> p1
> p2
> plotc
> ggsave("pseudotime/trajectory_facet.png", plot = plotc, width = 6, height = 5)
> #cluster差異基因
> diff.genes <- read.csv('subcluster/diff_genes_wilcox.csv')
> sig_diff.genes <- subset(diff.genes,p_val_adj<0.0001&abs(avg_logFC)>0.75)$gene
> sig_diff.genes <- unique(as.character(sig_diff.genes))
> diff_test <- differentialGeneTest(mycds[sig_diff.genes,], cores = 1,
+ fullModelFormulaStr = "~sm.ns(Pseudotime)")
> sig_gene_names <- row.names(subset(diff_test, qval < 0.01))
> p1 = plot_pseudotime_heatmap(mycds[sig_gene_names,], num_clusters=3,
+ show_rownames=T, return_heatmap=T)
> p1
> ggsave("pseudotime/pseudotime_heatmap1.png", plot = p1, width = 5, height = 8)
> #高變基因
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.5&dispersion_empirical >= 1*dispersion_fit)
> disp.genes <- as.character(disp.genes$gene_id)
> diff_test <- differentialGeneTest(mycds[disp.genes,], cores = 4,
+ fullModelFormulaStr = "~sm.ns(Pseudotime)")
> sig_gene_names <- row.names(subset(diff_test, qval < 1e-04))
> p2 = plot_pseudotime_heatmap(mycds[sig_gene_names,], num_clusters=5,
+ show_rownames=T, return_heatmap=T)
Warning messages:
1: In slot(family, "validparams") :
closing unused connection 7 (<-DESKTOP-2F2KC96:11566)
2: In slot(family, "validparams") :
closing unused connection 6 (<-DESKTOP-2F2KC96:11566)
3: In slot(family, "validparams") :
closing unused connection 5 (<-DESKTOP-2F2KC96:11566)
4: In slot(family, "validparams") :
closing unused connection 4 (<-DESKTOP-2F2KC96:11566)
> ggsave("pseudotime/pseudotime_heatmap2.png", plot = p2, width = 5, height = 10)
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.5&dispersion_empirical >= 1*dispersion_fit)
> disp.genes <- as.character(disp.genes$gene_id)
> mycds_sub <- mycds[disp.genes,]
> plot_cell_trajectory(mycds_sub, color_by = "State")
> beam_res <- BEAM(mycds_sub, branch_point = 1, cores = 8)
Warning messages:
1: In if (progenitor_method == "duplicate") { :
the condition has length > 1 and only the first element will be used
2: In if (progenitor_method == "sequential_split") { :
the condition has length > 1 and only the first element will be used
> beam_res <- beam_res[order(beam_res$qval),]
> beam_res <- beam_res[,c("gene_short_name", "pval", "qval")]
> mycds_sub_beam <- mycds_sub[row.names(subset(beam_res, qval < 1e-4)),]
> plot_genes_
Error: object 'plot_genes_' not found