一供常、寫在前面
最近有粉絲提問,收到了如下的審稿人意見:
審稿人認(rèn)為在單細(xì)胞測(cè)序過程中凰浮,利用findMarker
通過Wilcox
獲得的差異基因雖然考慮到了不同組別細(xì)胞數(shù)量的不同官疲,但是未能考慮到每組樣本數(shù)量的不同泳叠。因此作者希望納入樣本水平的pseudo-bulk
分析能夠有助于確認(rèn)兩種條件下的差異基因耀怜。
首先我個(gè)人覺得審稿人自己的話有些矛盾恢着,使用pseudo-bulk
計(jì)算差異基因,豈不是無法考慮到不同組別中樣本數(shù)量的差異财破?另外這有些吹毛求疵掰派,在Seurat V5
之前,作者甚至沒有在包里集成pseudo-bulk
的函數(shù)與算法(當(dāng)然也可以自己提取矩陣計(jì)算)左痢。難道能說作者發(fā)表的這幾篇Cell和Nature給大家推薦的流程不好嗎:
Hao, Hao, et al., Cell 2021 [Seurat v4]
Stuart, Butler, et al., Cell 2019 [Seurat v3]
Butler, et al., Nat Biotechnol 2018 [Seurat v2] Satija, Farrell, et al., Nat Biotechnol 2015 [Seurat v1]
再者說靡羡,scRNA-seq
比Bulk RNA-Seq
更加的稀疏系洛,將前者模擬為后者參與差異計(jì)算,其實(shí)也沒那么科學(xué)略步。當(dāng)然描扯,審稿人的觀點(diǎn)也不是全無道理,若能夠通過不同的算法得到相同的差異基因結(jié)果趟薄,的確有較高的說服力荆烈。
二、pseudo-bulk
差異分析走起
測(cè)試文件可以自行下載:
鏈接:https://pan.baidu.com/s/12dEGTJy4DnQ7gH2mbxCf-A?pwd=7qfm
提取碼:7qfm
2.1 數(shù)據(jù)載入
# 加載R包library(Seurat)
## 載入需要的程序包:SeuratObject
## 載入需要的程序包:sp
## ## 載入程序包:'SeuratObject'
## The following objects are masked from 'package:base':## ## intersect, t
# 讀取數(shù)據(jù):scRNA <- readRDS('test_data/T1D_scRNA.rds')# 這個(gè)數(shù)據(jù)包含24個(gè)樣本:unique(scRNA$sample)
## [1] "D_503" "H_120" "H_630" "H_3060" "D_609" "H_727" "H_4579" "D_504" ## [9] "H_3128" "H_7108" "D_502" "D_497" "D_506" "H_409" "H_6625" "D_610" ## [17] "D_501" "D_500" "H_4119" "H_1334" "D_498" "H_2928" "D_644" "D_505"
# 包含兩個(gè)組別的數(shù)據(jù):DimPlot(scRNA,split.by = 'Group')
2.2 差異計(jì)算
(1) pseudo-bulk差異計(jì)算
### 生成擬bulk 數(shù)據(jù) ###bulk <- AggregateExpression(scRNA, return.seurat = T, slot = "counts", assays = "RNA", group.by = c("cell_type", "sample", "Group")# 分別填寫細(xì)胞類型竟趾、樣本變量、分組變量的slot名稱 )
## Names of identity class contain underscores ('_'), replacing with dashes ('-')## Centering and scaling data matrix## ## This message is displayed once every 8 hours.
# 生成的是一個(gè)新的Seurat對(duì)象bulk
## An object of class Seurat ## 41056 features across 345 samples within 1 assay ## Active assay: RNA (41056 features, 0 variable features)## 3 layers present: counts, data, scale.data
我們可以像普通scRNA-seq
的Seurat
對(duì)象一樣宫峦,利用FindMarkers()
進(jìn)行差異分析岔帽,我們這里用celltype10
做演示。
# 取出celltype10對(duì)應(yīng)的對(duì)象:ct10.bulk <- subset(bulk, cell_type == "celltype10")# 改變默認(rèn)分類變量:Idents(ct10.bulk) <- "Group"# 下面的額計(jì)算依賴DESeq2导绷,做過Bulk RNA-Seq的同學(xué)都知道:if(!require(DESeq2))BiocManager::install('DESeq2')
## 載入需要的程序包:DESeq2
## 載入需要的程序包:S4Vectors
## 載入需要的程序包:stats4
## 載入需要的程序包:BiocGenerics
## ## 載入程序包:'BiocGenerics'
## The following object is masked from 'package:SeuratObject':## ## intersect
## The following objects are masked from 'package:stats':## ## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':## ## anyDuplicated, aperm, append, as.data.frame, basename, cbind,## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,## tapply, union, unique, unsplit, which.max, which.min
## ## 載入程序包:'S4Vectors'
## The following object is masked from 'package:utils':## ## findMatches
## The following objects are masked from 'package:base':## ## expand.grid, I, unname
## 載入需要的程序包:IRanges
## ## 載入程序包:'IRanges'
## The following object is masked from 'package:sp':## ## %over%
## The following object is masked from 'package:grDevices':## ## windows
## 載入需要的程序包:GenomicRanges
## 載入需要的程序包:GenomeInfoDb
## 載入需要的程序包:SummarizedExperiment
## 載入需要的程序包:MatrixGenerics
## 載入需要的程序包:matrixStats
## ## 載入程序包:'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':## ## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,## colWeightedMeans, colWeightedMedians, colWeightedSds,## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,## rowWeightedSds, rowWeightedVars
## 載入需要的程序包:Biobase
## Welcome to Bioconductor## ## Vignettes contain introductory material; view with## 'browseVignettes()'. To cite Bioconductor, see## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## ## 載入程序包:'Biobase'
## The following object is masked from 'package:MatrixGenerics':## ## rowMedians
## The following objects are masked from 'package:matrixStats':## ## anyMissing, rowMedians
## ## 載入程序包:'SummarizedExperiment'
## The following object is masked from 'package:Seurat':## ## Assays
## The following object is masked from 'package:SeuratObject':## ## Assays
# 差異計(jì)算:bulk_deg <- FindMarkers(ct10.bulk, ident.1 = "D", ident.2 = "H", # 這樣算出來的Fold Change就是D/H slot = "counts", test.use = "DESeq2",# 這里可以選擇其它算法 verbose = F# 關(guān)閉進(jìn)度提示 )
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
head(bulk_deg)# 看一下差異列表
## p_val avg_log2FC pct.1 pct.2 p_val_adj## ENSG00000047346 1.900955e-05 -1.1201537 0.917 1.000 0.780456## ENSG00000168685 7.606182e-05 -0.5817112 1.000 1.000 1.000000## ENSG00000131759 9.696591e-05 -2.1668759 0.667 1.000 1.000000## ENSG00000166750 2.136829e-04 -1.5473545 0.750 0.917 1.000000## ENSG00000163947 3.127113e-04 -0.9136378 0.917 1.000 1.000000## ENSG00000239713 4.090397e-04 -2.2337008 0.250 0.917 1.000000
如何寫循環(huán)計(jì)算所有細(xì)胞類型的差異基因犀勒,就留在這里當(dāng)習(xí)題啦。
(2)細(xì)胞水平的差異計(jì)算
# 整理分組變量:scRNA$CT_Group <- paste(scRNA$cell_type,scRNA$Group,sep = '_')# 查看新的分組變量:unique(scRNA$CT_Group)
## [1] "celltype12_D" "celltype3_H" "celltype13_H" "celltype6_H" "celltype0_D" ## [6] "celltype12_H" "celltype4_H" "celltype11_D" "celltype14_H" "celltype9_H" ## [11] "celltype11_H" "celltype2_H" "celltype0_H" "celltype7_H" "celltype14_D"## [16] "celltype1_D" "celltype4_D" "celltype1_H" "celltype8_H" "celltype3_D" ## [21] "celltype13_D" "celltype8_D" "celltype7_D" "celltype5_H" "celltype6_D" ## [26] "celltype15_H" "celltype2_D" "celltype5_D" "celltype10_H" "celltype9_D" ## [31] "celltype10_D" "celltype15_D"
# 差異計(jì)算:cell_deg <- FindMarkers(scRNA,ident.1 = 'celltype10_D',ident.2 = 'celltype10_H' ,group.by = 'CT_Group')# 同樣得到的是celltype10在D組 vs H組的結(jié)果
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,## (default method for FindMarkers) please install the presto package## --------------------------------------------## install.packages('devtools')## devtools::install_github('immunogenomics/presto')## --------------------------------------------## After installation of presto, Seurat will automatically use the more ## efficient implementation (no further action necessary).## This message will be shown once per session
(3)兩種算法的對(duì)比
library(dplyr)
## ## 載入程序包:'dplyr'
## The following object is masked from 'package:Biobase':## ## combine
## The following object is masked from 'package:matrixStats':## ## count
## The following objects are masked from 'package:GenomicRanges':## ## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':## ## intersect
## The following objects are masked from 'package:IRanges':## ## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':## ## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':## ## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':## ## filter, lag
## The following objects are masked from 'package:base':## ## intersect, setdiff, setequal, union
# 先看一下兩種算法的顯著差異基因數(shù)量 bulk_sig <- filter(bulk_deg,p_val < 0.05)nrow(bulk_sig)
## [1] 308
cell_sig <- filter(cell_deg,p_val < 0.05)nrow(cell_sig)
## [1] 1494
可以看出妥曲,pseudo-bulk得到的差異基因數(shù)量要少很多贾费,畫一個(gè)韋恩圖看看二者交集
if(!require(VennDiagram))install.packages("VennDiagram")
## 載入需要的程序包:VennDiagram
## 載入需要的程序包:grid
## 載入需要的程序包:futile.logger
venn.plot <- venn.diagram( x = list(Bulk = rownames(bulk_sig), Cell = rownames(cell_sig)), category.names = c("Bulk DEG", "Single-Cell DEG"), filename = NULL, output = TRUE, main = "Venn Diagram of Significant Genes")grid.draw(venn.plot)
可以看出包含關(guān)系還是挺明顯的,那我們?cè)儆媒患虻?code>avg_log2FC做一個(gè)線性回歸看看兩次差異分析的相關(guān)性如何:
# 獲得兩次差異分析共同出現(xiàn)的基因:inter_gene <- intersect(rownames(bulk_sig),rownames(cell_sig))# 取出avg_log2FC整理為數(shù)據(jù)框data4plot <- data.frame(Bulk = bulk_sig[inter_gene,'avg_log2FC'], Cell = cell_sig[inter_gene,'avg_log2FC'] )# 線性回歸分析:lm.model <- lm(Bulk ~ Cell,data = data4plot)summary(lm.model)#看一下統(tǒng)計(jì)學(xué)參數(shù)
## ## Call:## lm(formula = Bulk ~ Cell, data = data4plot)## ## Residuals:## Min 1Q Median 3Q Max ## -1.01613 -0.24964 -0.04723 0.17148 2.19351 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.17249 0.02757 -6.257 1.58e-09 ***## Cell 0.70081 0.01959 35.767 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4107 on 263 degrees of freedom## Multiple R-squared: 0.8295, Adjusted R-squared: 0.8288 ## F-statistic: 1279 on 1 and 263 DF, p-value: < 2.2e-16
mypara <- coefficients(lm.model)#得到截距和斜率a <- mypara[2]#斜率b <- mypara[1]#截距a <- round(a,2)#取兩位有效數(shù)字b <- round(b,2)library(ggplot2)library(ggpubr)
## ## 載入程序包:'ggpubr'
## The following object is masked from 'package:VennDiagram':## ## rotate
# 來個(gè)散點(diǎn)圖吧~lmplot <- ggplot( data4plot,aes(x=Bulk, y=Cell))+ geom_point(color="black")+ stat_smooth(method="lm",se=TRUE)+stat_cor(data=data4plot, method = "pearson")+#加上置信區(qū)間檐盟、R值褂萧、P值 ggtitle(label = paste(": y = ", a, " * x + ", b, sep = ""))+geom_rug()+#加上線性回歸方程 labs(x='Bulk DEG', y= 'single-cell DEG')lmplot
## `geom_smooth()` using formula = 'y ~ x'
R=0.91,那么R^2就是0.83葵萎,可以看出二者的相關(guān)性還是不錯(cuò)的导犹,就看能不能過審稿人這關(guān)啦。
大家有什么新的想法羡忘,歡迎在評(píng)論區(qū)留言~
環(huán)境信息
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)## Platform: x86_64-w64-mingw32/x64## Running under: Windows 11 x64 (build 22631)## ## Matrix products: default## ## ## locale:## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 ## [2] LC_CTYPE=Chinese (Simplified)_China.utf8 ## [3] LC_MONETARY=Chinese (Simplified)_China.utf8## [4] LC_NUMERIC=C ## [5] LC_TIME=Chinese (Simplified)_China.utf8 ## ## time zone: Asia/Shanghai## tzcode source: internal## ## attached base packages:## [1] grid stats4 stats graphics grDevices utils datasets ## [8] methods base ## ## other attached packages:## [1] ggpubr_0.6.0 ggplot2_3.5.1 ## [3] VennDiagram_1.7.3 futile.logger_1.4.3 ## [5] dplyr_1.1.4 DESeq2_1.44.0 ## [7] SummarizedExperiment_1.34.0 Biobase_2.64.0 ## [9] MatrixGenerics_1.16.0 matrixStats_1.4.1 ## [11] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 ## [13] IRanges_2.38.1 S4Vectors_0.42.1 ## [15] BiocGenerics_0.50.0 Seurat_5.1.0 ## [17] SeuratObject_5.0.2 sp_2.1-4 ## ## loaded via a namespace (and not attached):## [1] RcppAnnoy_0.0.22 splines_4.4.1 later_1.3.2 ## [4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4 ## [7] lifecycle_1.0.4 rstatix_0.7.2 globals_0.16.3 ## [10] lattice_0.22-6 MASS_7.3-60.2 backports_1.5.0 ## [13] magrittr_2.0.3 plotly_4.10.4 sass_0.4.9 ## [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10 ## [19] httpuv_1.6.15 sctransform_0.4.1 spam_2.10-0 ## [22] spatstat.sparse_3.1-0 reticulate_1.39.0 cowplot_1.1.3 ## [25] pbapply_1.7-2 RColorBrewer_1.1-3 abind_1.4-5 ## [28] zlibbioc_1.50.0 Rtsne_0.17 purrr_1.0.2 ## [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1 ## [34] listenv_0.9.1 spatstat.utils_3.1-0 openintro_2.5.0 ## [37] airports_0.1.0 goftest_1.2-3 RSpectra_0.16-2 ## [40] spatstat.random_3.3-1 fitdistrplus_1.2-1 parallelly_1.38.0 ## [43] leiden_0.4.3.1 codetools_0.2-20 DelayedArray_0.30.1 ## [46] tidyselect_1.2.1 UCSC.utils_1.0.0 farver_2.1.2 ## [49] spatstat.explore_3.3-2 jsonlite_1.8.8 progressr_0.14.0 ## [52] ggridges_0.5.6 survival_3.6-4 tools_4.4.1 ## [55] ica_1.0-3 Rcpp_1.0.13 glue_1.7.0 ## [58] gridExtra_2.3 SparseArray_1.4.8 mgcv_1.9-1 ## [61] xfun_0.47 withr_3.0.1 formatR_1.14 ## [64] fastmap_1.2.0 fansi_1.0.6 digest_0.6.37 ## [67] R6_2.5.1 mime_0.12 colorspace_2.1-1 ## [70] scattermore_1.2 tensor_1.5 spatstat.data_3.1-2 ## [73] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3 ## [76] data.table_1.16.0 usdata_0.3.1 httr_1.4.7 ## [79] htmlwidgets_1.6.4 S4Arrays_1.4.1 uwot_0.2.2 ## [82] pkgconfig_2.0.3 gtable_0.3.5 lmtest_0.9-40 ## [85] XVector_0.44.0 htmltools_0.5.8.1 carData_3.0-5 ## [88] dotCall64_1.1-1 scales_1.3.0 png_0.1-8 ## [91] spatstat.univar_3.0-1 knitr_1.48 lambda.r_1.2.4 ## [94] rstudioapi_0.16.0 tzdb_0.4.0 reshape2_1.4.4 ## [97] nlme_3.1-164 cachem_1.1.0 zoo_1.8-12 ## [100] stringr_1.5.1 KernSmooth_2.23-24 parallel_4.4.1 ## [103] miniUI_0.1.1.1 pillar_1.9.0 vctrs_0.6.5 ## [106] RANN_2.6.2 promises_1.3.0 car_3.1-2 ## [109] xtable_1.8-4 cluster_2.1.6 evaluate_0.24.0 ## [112] readr_2.1.5 cli_3.6.3 locfit_1.5-9.10 ## [115] compiler_4.4.1 futile.options_1.0.1 rlang_1.1.4 ## [118] crayon_1.5.3 future.apply_1.11.2 ggsignif_0.6.4 ## [121] labeling_0.4.3 plyr_1.8.9 stringi_1.8.4 ## [124] viridisLite_0.4.2 deldir_2.0-4 BiocParallel_1.38.0 ## [127] munsell_0.5.1 lazyeval_0.2.2 spatstat.geom_3.3-2 ## [130] Matrix_1.7-0 RcppHNSW_0.6.0 hms_1.1.3 ## [133] patchwork_1.2.0 future_1.34.0 shiny_1.9.1 ## [136] highr_0.11 ROCR_1.0-11 broom_1.0.6 ## [139] igraph_2.0.3 bslib_0.8.0 cherryblossom_0.1.0
歡迎致謝
如果以上內(nèi)容對(duì)你有幫助谎痢,歡迎在文章的Acknowledgement中加上這一段,聯(lián)系客服微信可以發(fā)放獎(jiǎng)勵(lì):
Since Biomamba and his wechat public account team produce bioinformatics tutorials and share code with annotation, we thank Biomamba for their guidance in bioinformatics and data analysis for the current study.
歡迎在發(fā)文/畢業(yè)時(shí)向我們分享你的喜悅~