一呼猪、背景
許多Functional Class Scoring (FCS)方法逃贝,如GSEA
, GSVA
,PLAGE
, addModuleScore
, SCSE
, Vision
, VAM
, gficf
, pagoda2
和Sargent
,都會受數(shù)據(jù)集組成的影響程剥。數(shù)據(jù)集組成的輕微變化將改變細(xì)胞的基因集富集分?jǐn)?shù)劝枣。假如將新的單細(xì)胞數(shù)據(jù)集整合到現(xiàn)有數(shù)據(jù)中,使用這些FCS方法需要重新計(jì)算每個(gè)細(xì)胞的基因集富集分?jǐn)?shù)织鲸。這個(gè)步驟可能是繁瑣且資源密集的
舔腾。相反,基于單個(gè)細(xì)胞表達(dá)等級的FCS昙沦,如AUCell
琢唾、UCell
载荔、singscore
盾饮、ssGSEA
、JASMINE
和Viper
懒熙,只需要計(jì)算新添加的單細(xì)胞數(shù)據(jù)集的富集分?jǐn)?shù)丘损,而無需重新計(jì)算所有細(xì)胞的基因集富集分?jǐn)?shù)。原因是這些方法生成的富集分?jǐn)?shù)僅依賴于單個(gè)細(xì)胞水平上的相對基因表達(dá)工扎,與數(shù)據(jù)集組成無關(guān)徘钥。因此,這些方法可以節(jié)省大量的時(shí)間肢娘。
二呈础、審視結(jié)果
在這里,我們審視了17種
常見的FCS方法:
-
GSEA
檢測排序基因列表頂部或底部的基因集富集程度橱健,該列表是分組后計(jì)算排序基因信噪比或排序基因倍數(shù)變化得到的而钞; -
GSVA
估計(jì)所有細(xì)胞之間每個(gè)基因的累積密度函數(shù)的核。 這個(gè)過程中需要考慮所有樣本拘荡,容易受到樣本背景信息的影響臼节; -
PLAGE
對跨細(xì)胞的基因表達(dá)矩陣進(jìn)行標(biāo)準(zhǔn)化,并提取奇異值分解作為基因集富集分?jǐn)?shù)珊皿; -
Zscore
聚合了基因集中所有基因的表達(dá)网缝,通過細(xì)胞間的平均值和標(biāo)準(zhǔn)差縮放表達(dá); -
AddModuleScore
需要先計(jì)算基因集中所有基因的平均值蟋定,再根據(jù)平均值把表達(dá)矩陣切割成若干份粉臊,然后從切割后的每一份中隨機(jī)抽取對照基因(基因集外的基因)作為背景值。因此驶兜,在整合不同樣本的情況下扼仲,即使使用相同基因集為相同細(xì)胞打分果元,也會產(chǎn)生不同的富集評分; -
SCSE
使用基因集所有基因的歸一化的總和來量化基因集富集分?jǐn)?shù)犀盟; -
Vision
使用隨機(jī)簽名的預(yù)期均值和方差對基因集富集分?jǐn)?shù)進(jìn)行 z 歸一化從而校正基因集富集分?jǐn)?shù)而晒; -
VAM
根據(jù)經(jīng)典Mahalanobis多元距離從單細(xì)胞 RNA 測序數(shù)據(jù)生成基因集富集分?jǐn)?shù); -
Gficf
利用通過非負(fù)矩陣分解獲得的基因表達(dá)值的潛在因子的信息生物信號阅畴; -
Pagoda2
擬合每個(gè)細(xì)胞的誤差模型倡怎,并使用其第一個(gè)加權(quán)主成分量化基因集富集分?jǐn)?shù); -
AUCell
基于單個(gè)樣本中的基因表達(dá)排名,使用曲線下面積來評估輸入基因集是否在單個(gè)樣本的前5%表達(dá)基因內(nèi)富集贱枣; -
UCell
基于單個(gè)樣本的基因表達(dá)排名监署,使用Mann-Whitney U統(tǒng)計(jì)量計(jì)算單個(gè)樣本的基因集富集分?jǐn)?shù); -
Singscore
根據(jù)基因表達(dá)等級評估距單個(gè)細(xì)胞中心的距離纽哥。 基因集中的基因根據(jù)單個(gè)細(xì)胞中的轉(zhuǎn)錄本豐度進(jìn)行排序钠乏。 平均等級相對于理論最小值和最大值單獨(dú)標(biāo)準(zhǔn)化,以零為中心春塌,然后聚合晓避,所得分?jǐn)?shù)代表基因集的富集分?jǐn)?shù); -
ssGSEA
根據(jù)每個(gè)細(xì)胞的基因表達(dá)等級計(jì)算內(nèi)部和外部基因集之間的經(jīng)驗(yàn)累積分布的差異分?jǐn)?shù)只壳。 使用全局表達(dá)譜對差異分?jǐn)?shù)進(jìn)行標(biāo)準(zhǔn)化俏拱。 標(biāo)準(zhǔn)化這一步容易受樣本構(gòu)成的影響。 -
JASMINE
根據(jù)在單個(gè)細(xì)胞中表達(dá)基因中的基因排名和表達(dá)基因中基因集的富集度計(jì)算近似平均值吼句。 這兩個(gè)值均標(biāo)準(zhǔn)化為 0-1 范圍锅必,并通過平均進(jìn)行組合,得出基因集的最終富集分?jǐn)?shù)惕艳。 -
Viper
通過根據(jù)細(xì)胞間基因表達(dá)的排名執(zhí)行three-tailed計(jì)算來估計(jì)基因集的富集分?jǐn)?shù)搞隐。 -
Sargent
將給定細(xì)胞的非零表達(dá)基因從高表達(dá)到低表達(dá)進(jìn)行排序,并將輸入的基因逐細(xì)胞表達(dá)矩陣轉(zhuǎn)換為相應(yīng)的gene-set-by-cell assignment score matrix远搪。 但Sargent
需要計(jì)算細(xì)胞間的gini-index后劣纲,將按gene-set-by-cell assignment score matrix轉(zhuǎn)換為distribution of indexes。
三终娃、工作流程
使用
AUCell味廊、UCell、singscore棠耕、ssgsea余佛、JASMINE 和 viper
分別對各個(gè)細(xì)胞進(jìn)行評分,得到不同的富集評分矩陣窍荧。通過wilcoxon檢驗(yàn)
計(jì)算不同的富集評分矩陣中每個(gè)細(xì)胞亞群差異表達(dá)的基因集辉巡。up或down
表示該細(xì)胞簇內(nèi)差異基因集的富集程度高于或低于其他簇。 單一的基因集富集分析方法不僅只能反映有限的信息蕊退,而且也容易帶來誤差
郊楣。我們期待從多個(gè)角度解釋復(fù)雜的生物學(xué)問題憔恳,并找到生物學(xué)問題中的共性部分。簡單地為多種基因集富集分析方法的結(jié)果取共同交集净蚤,不僅容易得到少而保守的結(jié)果钥组,而且忽略了富集分析方法中很多的其他信息,例如不同基因集的相對富集程度信息今瀑。我們希望目標(biāo)基因集在大部分富集分析方法中都是富集且富集程度沒有明顯差異
程梦。因此,我們通過RobustRankAggreg包中的秩聚合算法(robust rank aggregation, RRA)
對差異分析的結(jié)果進(jìn)行評估橘荠,篩選出在6種方法中表現(xiàn)出相似的富集程度的差異基因集屿附。
四、安裝
1.irGSEA安裝(基礎(chǔ)配置)
僅使用 AUCell, UCell, singscore, ssGSEA, JASMINE和viper
# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools",
"doParallel", "doRNG", "dplyr", "ggfun", "gghalves",
"ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
"magrittr", "Matrix", "msigdbr", "pagoda2", "pointr",
"purrr", "RcppML", "readr", "reshape2", "reticulate",
"rlang", "RMTstat", "RobustRankAggreg", "roxygen2",
"Seurat", "SeuratObject", "stringr", "tibble", "tidyr",
"tidyselect", "tidytree", "VAM")
for (i in cran.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
install.packages(i, ask = F, update = F)
}
}
# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap",
"decoupleR", "fgsea", "ggtree", "GSEABase",
"GSVA", "Nebulosa", "scde", "singscore",
"SummarizedExperiment", "UCell",
"viper","sparseMatrixStats")
for (i in bioconductor.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
BiocManager::install(i, ask = F, update = F)
}
}
# install packages from Github
if (!requireNamespace("irGSEA", quietly = TRUE)) {
devtools::install_github("chuiqin/irGSEA", force =T)
}
2.irGSEA安裝(進(jìn)階配置)
想使用VISION, gficf, Sargent, ssGSEApy, GSVApy等其他方法(這部分是可選安裝)
# VISION
if (!requireNamespace("VISION", quietly = TRUE)) {
devtools::install_github("YosefLab/VISION", force =T)
}
# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) {
devtools::install_github("imbs-hl/ranger", force =T)
}
# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
message("The version of RcppML should greater than 0.3.7 and install RcppML package from Github")
devtools::install_github("zdebruine/RcppML", force =T)
}
# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) {
devtools::install_github("gambalab/gficf", force =T)
}
# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) {
devtools::install_github("mojaveazure/seurat-disk", force =T)
}
# sargent
if (!requireNamespace("sargent", quietly = TRUE)) {
devtools::install_github("Sanofi-Public/PMCB-Sargent", force =T)
}
# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) {
devtools::install_github("hms-dbmi/scde", force =T)
}
# if error1 (functio 'sexp_as_cholmod_sparse' not provided by package 'Matrix')
# or error2 (functio 'as_cholmod_sparse' not provided by package 'Matrix') occurs
# when you perform pagoda2, please check the version of irlba and Matrix
# It's ok when I test as follow:
# R 4.2.2 irlba(v 2.3.5.1) Matrix(1.5-3)
# R 4.3.1 irlba(v 2.3.5.1) Matrix(1.6-1.1)
# R 4.3.2 irlba(v 2.3.5.1) Matrix(1.6-3)
#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs,
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
reticulate::conda_create("irGSEA")
}
# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in seq_along(require.package)) {
if (i %in% python.package) {
reticulate::conda_install(envname = "irGSEA", packages = i, pip = T)
}
}
3.國內(nèi)鏡像加速安裝
安裝github包和pip包有困難的參考這里哥童,我把github包克隆到了gitee上
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))
# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools",
"doParallel", "doRNG", "dplyr", "ggfun", "gghalves",
"ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
"magrittr", "Matrix", "msigdbr", "pagoda2", "pointr",
"purrr", "RcppML", "readr", "reshape2", "reticulate",
"rlang", "RMTstat", "RobustRankAggreg", "roxygen2",
"Seurat", "SeuratObject", "stringr", "tibble", "tidyr",
"tidyselect", "tidytree", "VAM")
for (i in cran.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
BiocManager::install(i, ask = F, update = F)
}
}
# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap",
"decoupleR", "fgsea", "ggtree", "GSEABase",
"GSVA", "Nebulosa", "scde", "singscore",
"SummarizedExperiment", "UCell", "viper")
for (i in bioconductor.packages) {
if (!requireNamespace(i, quietly = TRUE)) {
install.packages(i, ask = F, update = F)
}
}
# install packages from git
if (!requireNamespace("irGSEA", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/irGSEA.git", force =T)
}
# VISION
if (!requireNamespace("VISION", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/VISION.git", force =T)
}
# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/ranger.git", force =T)
}
# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
message("The version of RcppML should greater than 0.3.7 and install RcppML package from Git")
devtools::install_git("https://gitee.com/fan_chuiqin/RcppML.git", force =T)
}
# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/gficf.git", force =T)
}
# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/seurat-disk.git",
force =T)}
# sargent
if (!requireNamespace("sargent", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/PMCB-Sargent.git",
force =T)}
# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) {
devtools::install_git("https://gitee.com/fan_chuiqin/scde.git", force =T)
}
#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs,
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
reticulate::conda_create("irGSEA")
}
# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in require.package) {
if (! i %in% python.package) {
reticulate::conda_install(envname = "irGSEA", packages = i, pip = T,
pip_options = "-i https://pypi.tuna.tsinghua.edu.cn/simple")
}
}
五挺份、使用教程
1.irGSEA支持Seurat 對象(V5或V4),Assay對象(V5或V4)
# 我們通過SeuratData包加載示例數(shù)據(jù)集(注釋好的PBMC數(shù)據(jù)集)作為演示
#### Seurat V4對象 ####
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final)
>[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper"
#### Seurat V5對象 ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final2 <- CreateSeuratObject(counts = CreateAssay5Object(GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts")),
meta.data = pbmc3k.final[[]])
pbmc3k.final2 <- NormalizeData(pbmc3k.final2)
pbmc3k.final2 <- irGSEA.score(object = pbmc3k.final2,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final2)
[1] "RNA" "AUCell" "UCell" "singscore" "ssgsea" "JASMINE" "viper"
#### Assay5 對象 ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final3 <- CreateAssay5Object(counts = GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts"))
pbmc3k.final3 <- NormalizeData(pbmc3k.final3)
pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final3,assay = "RNA",
slot = "counts", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
Assays(pbmc3k.final3)
# Assay5對象存放在RNA assay中
> cat(object.size(pbmc3k.final) / (1024^2), "M")
339.955 M
> cat(object.size(pbmc3k.final2) / (1024^2), "M")
69.33521 M
> cat(object.size(pbmc3k.final3) / (1024^2), "M")
69.27851 M
# 看起來Seurat V5和Assay 5確實(shí)存放數(shù)據(jù)會比較小
2.irGSEA支持的基因集打分方法
測試了不同數(shù)據(jù)大小下各種評分方法使用50個(gè)Hallmark基因集進(jìn)行打分所需的時(shí)間和內(nèi)存峰值贮懈, 大家根據(jù)自己的電腦和時(shí)間進(jìn)行酌情選擇匀泊;
GSVApy、ssGSEApy 和 viperpy 分別代表 GSVA错邦、ssGSEA 和 viper 的 Python 版本探赫。Python版本的GSVA比R版本的GSVA節(jié)約太多時(shí)間了型宙。
我們對singscore撬呢、ssGSEA、JASMINE妆兑、viper的內(nèi)存峰值進(jìn)行了優(yōu)化
魂拦。 對于超過 50000 個(gè)細(xì)胞的數(shù)據(jù)集,我們實(shí)施了一種策略搁嗓,將它們劃分為5000 個(gè)細(xì)胞/單元進(jìn)行評分芯勘。 雖然這可以緩解內(nèi)存峰值問題,但確實(shí)會延長處理時(shí)間腺逛。
3.irGSEA支持的基因集打分方法
為了方便用戶獲取MSigDB數(shù)據(jù)庫中預(yù)先定義好的基因集荷愕,我們內(nèi)置了msigdbr包
進(jìn)行MSigDB的基因集數(shù)據(jù)的獲取。msigdbr包
支持多個(gè)物種的基因集獲取棍矛,以及多種基因格式
的表達(dá)矩陣的輸入安疗。
①.irGSEA通過內(nèi)置的msigdbr包進(jìn)行打分
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "C2",
subcategory = "CP:KEGG", geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "C5",
subcategory = "GO:BP", geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
②.irGSEA使用最新版的MSigDB基因集進(jìn)行打分
遺憾的是,
msigdbr包
內(nèi)置的MSigDB的版本是MSigDB 2022.1
够委。然而荐类,目前MSigDB數(shù)據(jù)庫已經(jīng)更新到了2023.2
,包括2023.2.Hs
和2023.2.Mm
茁帽。我們可以從這個(gè)鏈接(https://data.broadinstitute.org/gsea-msigdb/msigdb/release/)下載2023.2.Hs
的gmt文件
或者db.zip文件
玉罐。相比gmt文件
屈嗤,db.zip文件
包含了基因集的描述,可以用來篩選XX功能相關(guān)基因吊输。下面的例子中饶号,我將介紹如何篩選血管生成相關(guān)的基因集
。
#### work with newest Msigdb ####
# https://data.broadinstitute.org/gsea-msigdb/msigdb/release/
# In this page, you can download human/mouse gmt file or db.zip file
# The db.zip file contains metadata information for the gene set
# load library
library(clusterProfiler)
library(tidyverse)
library(DBI)
library(RSQLite)
### db.zip ###
# download zip file and unzip zip file
zip_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb_v2023.2.Hs.db.zip"
local_zip_path <- "./msigdb_v2023.2.Hs.db.zip"
download.file(zip_url, local_zip_path)
unzip(local_zip_path, exdir = "./")
# code modified by https://rdrr.io/github/cashoes/sear/src/data-raw/1_parse_msigdb_sqlite.r
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = './msigdb_v2023.2.Hs.db')
DBI::dbListTables(con)
# define tables we want to combine
geneset_db <- dplyr::tbl(con, 'gene_set') # standard_name, collection_name
details_db <- dplyr::tbl(con, 'gene_set_details') # description_brief, description_full
geneset_genesymbol_db <- dplyr::tbl(con, 'gene_set_gene_symbol') # meat and potatoes
genesymbol_db <- dplyr::tbl(con, 'gene_symbol') # mapping from ids to gene symbols
collection_db <- dplyr::tbl(con, 'collection') %>% dplyr::select(collection_name, full_name) # collection metadata
# join tables
msigdb <- geneset_db %>%
dplyr::left_join(details_db, by = c('id' = 'gene_set_id')) %>%
dplyr::left_join(collection_db, by = 'collection_name') %>%
dplyr::left_join(geneset_genesymbol_db, by = c('id' = 'gene_set_id')) %>%
dplyr::left_join(genesymbol_db, by = c('gene_symbol_id' = 'id')) %>%
dplyr::select(collection = collection_name, subcollection = full_name, geneset = standard_name, description = description_brief, symbol) %>%
dplyr::as_tibble()
# clean up
DBI::dbDisconnect(con)
unique(msigdb$collection)
# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA"
# [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB"
# [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY"
# [10] "C4:3CA" "C4:CGN" "C4:CM"
# [13] "C6" "C7:IMMUNESIGDB" "C7:VAX"
# [16] "C8" "C5:GO:BP" "C5:GO:CC"
# [19] "C5:GO:MF" "H" "C5:HPO"
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP"
unique(msigdb$subcollection)
# [1] "C1" "C2:CGP" "C2:CP:BIOCARTA"
# [4] "C2:CP:KEGG_LEGACY" "C2:CP:PID" "C3:MIR:MIRDB"
# [7] "C3:MIR:MIR_LEGACY" "C3:TFT:GTRD" "C3:TFT:TFT_LEGACY"
# [10] "C4:3CA" "C4:CGN" "C4:CM"
# [13] "C6" "C7:IMMUNESIGDB" "C7:VAX"
# [16] "C8" "C5:GO:BP" "C5:GO:CC"
# [19] "C5:GO:MF" "H" "C5:HPO"
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME" "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP"
# convert to list[Hallmark] required by irGSEA package
msigdb.h <- msigdb %>%
dplyr::filter(collection=="H") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.h$geneset <- factor(msigdb.h$geneset)
msigdb.h <- msigdb.h %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.h$geneset))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.h,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[go bp] required by irGSEA package
msigdb.go.bp <- msigdb %>%
dplyr::filter(collection=="C5:GO:BP") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.go.bp$geneset <- factor(msigdb.go.bp$geneset)
msigdb.go.bp <- msigdb.go.bp %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.go.bp$geneset))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.go.bp,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>%
dplyr::filter(collection=="C2:CP:KEGG_MEDICUS") %>%
dplyr::select(c("geneset", "symbol"))
msigdb.kegg$geneset <- factor(msigdb.kegg$geneset)
msigdb.kegg <- msigdb.kegg %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.kegg$geneset))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.kegg,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# Look for the gene sets associated with angiogenesis from gene sets names and
# gene sets descriptions
category <- c("angiogenesis", "vessel")
msigdb.vessel <- list()
for (i in category) {
# Ignore case matching
find.index.description <- stringr::str_detect(msigdb$description, pattern = regex(all_of(i), ignore_case=TRUE))
find.index.name <- stringr::str_detect(msigdb$geneset, pattern = regex(all_of(i), ignore_case=TRUE))
msigdb.vessel[[i]] <- msigdb[find.index.description | find.index.name, ] %>% mutate(category = i)
}
msigdb.vessel <- do.call(rbind, msigdb.vessel)
head(msigdb.vessel)
# # A tibble: 6 × 6
# collection subcollection geneset description symbol category
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … HECW1 angioge…
# 2 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … JADE2 angioge…
# 3 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … SEMA3C angioge…
# 4 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … STUB1 angioge…
# 5 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … FAH angioge…
# 6 C2:CGP Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … COL7A1 angioge…
length(unique(msigdb.vessel$geneset))
# [1] 112
# convert gene sets associated with angiogenesis to list
# required by irGSEA package
msigdb.vessel <- msigdb.vessel %>%
dplyr::select(c("geneset", "symbol"))
msigdb.vessel$geneset <- factor(msigdb.vessel$geneset)
msigdb.vessel <- msigdb.vessel %>%
dplyr::group_split(geneset, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.vessel$geneset))
#### 血管生成相關(guān)基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.vessel,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
### gmt file ###
# download gmt file
gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb.v2023.2.Hs.symbols.gmt"
local_gmt <- "./msigdb.v2023.2.Hs.symbols.gmt"
download.file(gmt_url , local_gmt)
msigdb <- clusterProfiler::read.gmt("./msigdb.v2023.2.Hs.symbols.gmt")
# convert to list[hallmarker] required by irGSEA package
msigdb.h <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("HALLMARK_", ignore_case=TRUE)))
msigdb.h$term <- factor(msigdb.h$term)
msigdb.h <- msigdb.h %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.h$term))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.h,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[go bp] required by irGSEA package
msigdb.go.bp <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("GOBP_", ignore_case=TRUE)))
msigdb.go.bp$term <- factor(msigdb.go.bp$term)
msigdb.go.bp <- msigdb.go.bp %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.go.bp$term))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.go.bp,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>%
dplyr::filter(str_detect(term, pattern = regex("KEGG_", ignore_case=TRUE)))
msigdb.kegg$term <- factor(msigdb.kegg$term)
msigdb.kegg <- msigdb.kegg %>%
dplyr::group_split(term, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
purrr::set_names(levels(msigdb.kegg$term))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = msigdb.kegg,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
③.irGSEA使用clusterProfiler包同款基因集進(jìn)行打分
#### work with clusterProfiler package ####
# load library
library(clusterProfiler)
library(tidyverse)
### kegg ###
# download kegg pathway (human) and write as gson file
kk <- clusterProfiler::gson_KEGG(species = "hsa")
gson::write.gson(kk, file = "./KEGG_20231128.gson")
# read gson file
kk2 <- gson::read.gson("./KEGG_20231123.gson")
# Convert to a data frame
kegg.list <- dplyr::left_join(kk2@gsid2name,
kk2@gsid2gene,
by = "gsid")
head(kegg.list)
# gsid name gene
# 1 hsa01100 Metabolic pathways 10
# 2 hsa01100 Metabolic pathways 100
# 3 hsa01100 Metabolic pathways 10005
# 4 hsa01100 Metabolic pathways 10007
# 5 hsa01100 Metabolic pathways 100137049
# 6 hsa01100 Metabolic pathways 10020
# Convert gene ID to gene symbol
gene_name <- clusterProfiler::bitr(kegg.list$gene,
fromType = "ENTREZID",
toType = "SYMBOL",
OrgDb = "org.Hs.eg.db")
kegg.list <- dplyr::full_join(kegg.list,
gene_name,
by = c("gene"="ENTREZID"))
# remove NA value if exist
kegg.list <- kegg.list[complete.cases(kegg.list[, c("gene", "SYMBOL")]), ]
head(kegg.list)
# gsid name gene SYMBOL
# 1 hsa01100 Metabolic pathways 10 NAT2
# 2 hsa01100 Metabolic pathways 100 ADA
# 3 hsa01100 Metabolic pathways 10005 ACOT8
# 4 hsa01100 Metabolic pathways 10007 GNPDA1
# 5 hsa01100 Metabolic pathways 100137049 PLA2G4B
# 6 hsa01100 Metabolic pathways 10020 GNE
# convert to list required by irGSEA package
kegg.list$name <- factor(kegg.list$name)
kegg.list <- kegg.list %>%
dplyr::group_split(name, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
purrr::set_names(levels(kegg.list$name))
head(kegg.list)
### 整理好之后就可以進(jìn)行KEGG打分
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = kegg.list,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
### go bp ###
# download go bp (human) and write as gson file
go <- clusterProfiler::gson_GO(OrgDb = "org.Hs.eg.db", ont = "BP")
gson::write.gson(go, file = "./go_20231128.gson")
# read gson file
go2 <- gson::read.gson("./go_20231128.gson")
# Convert to a data frame
go.list <- dplyr::left_join(go2@gsid2name,
go2@gsid2gene,
by = "gsid")
head(go.list)
# gsid name gene
# 1 GO:0000001 mitochondrion inheritance <NA>
# 2 GO:0000002 mitochondrial genome maintenance 142
# 3 GO:0000002 mitochondrial genome maintenance 291
# 4 GO:0000002 mitochondrial genome maintenance 1763
# 5 GO:0000002 mitochondrial genome maintenance 1890
# 6 GO:0000002 mitochondrial genome maintenance 2021
# Convert gene ID to gene symbol
go.list <- dplyr::full_join(go.list,
go2@gene2name,
by = c("gene"="ENTREZID"))
# remove NA value if exist
go.list <- go.list[complete.cases(go.list[, c("gene", "SYMBOL")]), ]
head(go.list)
# gsid name gene SYMBOL
# 2 GO:0000002 mitochondrial genome maintenance 142 PARP1
# 3 GO:0000002 mitochondrial genome maintenance 291 SLC25A4
# 4 GO:0000002 mitochondrial genome maintenance 1763 DNA2
# 5 GO:0000002 mitochondrial genome maintenance 1890 TYMP
# 6 GO:0000002 mitochondrial genome maintenance 2021 ENDOG
# 7 GO:0000002 mitochondrial genome maintenance 3980 LIG3
# convert to list required by irGSEA package
go.list$name <- factor(go.list$name)
go.list <- go.list %>%
dplyr::group_split(name, .keep = F) %>%
purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
purrr::set_names(levels(go.list$name))
head(go.list)
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
custom = T, geneset = go.list,
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
kcdf = 'Gaussian')
3.irGSEA的完整流程
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
# 基因集打分
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA",
slot = "data", seeds = 123, ncores = 1,
min.cells = 3, min.feature = 0,
custom = F, geneset = NULL, msigdb = T,
species = "Homo sapiens", category = "H",
subcategory = NULL, geneid = "symbol",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
aucell.MaxRank = NULL, ucell.MaxRank = NULL,
kcdf = 'Gaussian')
# 計(jì)算差異基因集季蚂,并進(jìn)行RRA
# 如果報(bào)錯讨韭,考慮加句代碼options(future.globals.maxSize = 100000 * 1024^5)
result.dge <- irGSEA.integrate(object = pbmc3k.final,
group.by = "seurat_annotations",
method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"))
# 查看RRA識別的在多種打分方法中都普遍認(rèn)可的差異基因集
geneset.show <- result.dge$RRA %>%
dplyr::filter(pvalue <= 0.05) %>%
dplyr::pull(Name) %>% unique(.)
4.可視化展示
1). 全局展示
①.熱圖
你還可以把method從'RRA"換成“ssgsea”,展示特定基因集富集分析方法中差異上調(diào)或差異下調(diào)的基因集癣蟋;
irGSEA.heatmap.plot <- irGSEA.heatmap(object = result.dge,
method = "RRA",
top = 50,
show.geneset = NULL)
irGSEA.heatmap.plot
默認(rèn)展示前50透硝,你也可以展示你想展示的基因集。
例如疯搅,我想展示RRA識別的差異基因集濒生。
irGSEA.heatmap.plot1 <- irGSEA.heatmap(object = result.dge,
method = "RRA",
show.geneset = geneset.show)
irGSEA.heatmap.plot1
②.氣泡圖
irGSEA.bubble.plot <- irGSEA.bubble(object = result.dge,
method = "RRA",
show.geneset = geneset.show)
irGSEA.bubble.plot
③.upset plot
upset圖展示了綜合評估中每個(gè)細(xì)胞亞群具有統(tǒng)計(jì)學(xué)意義差異的基因集的數(shù)目,以及不同細(xì)胞亞群之間具有交集的差異基因集數(shù)目幔欧;左邊不同顏色的條形圖代表不同的細(xì)胞亞群罪治;上方的條形圖代表具有交集的差異基因集的數(shù)目;中間的氣泡圖單個(gè)點(diǎn)代表單個(gè)細(xì)胞亞群礁蔗,多個(gè)點(diǎn)連線代表多個(gè)細(xì)胞亞群取交集()這里只展示兩兩取交集觉义;
irGSEA.upset.plot <- irGSEA.upset(object = result.dge,
method = "RRA",
mode = "intersect",
upset.width = 20,
upset.height = 10,
set.degree = 2,
pt_size = grid::unit(2, "mm"))
irGSEA.upset.plot
④.堆疊條形圖
堆疊柱狀圖具體展示每種基因集富集分析方法中每種細(xì)胞亞群中上調(diào)、下調(diào)和沒有統(tǒng)計(jì)學(xué)差異的基因集數(shù)目浴井;上方的條形代表每個(gè)亞群中不同方法中差異的基因數(shù)目晒骇,紅色代表上調(diào)的差異基因集,藍(lán)色代表下調(diào)的差異基因集磺浙;中間的柱形圖代表每個(gè)亞群中不同方法中上調(diào)洪囤、下調(diào)和沒有統(tǒng)計(jì)學(xué)意義的基因集的比例;
irGSEA.barplot.plot <- irGSEA.barplot(object = result.dge,
method = c("AUCell", "UCell", "singscore",
"ssgsea", "JASMINE", "viper", "RRA"))
irGSEA.barplot.plot
2). 局部展示
①.密度散點(diǎn)圖
密度散點(diǎn)圖將基因集的富集分?jǐn)?shù)和細(xì)胞亞群在低維空間的投影結(jié)合起來撕氧,展示了特定基因集在空間上的表達(dá)水平瘤缩。其中,顏色越黃伦泥,密度分?jǐn)?shù)越高剥啤,代表富集分?jǐn)?shù)越高;
scatterplot <- irGSEA.density.scatterplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE",
reduction = "umap")
scatterplot
②.半小提琴圖
半小提琴圖同時(shí)以小提琴圖(左邊)和箱線圖(右邊)進(jìn)行展示不脯。不同顏色代表不同的細(xì)胞亞群府怯;
halfvlnplot <- irGSEA.halfvlnplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
halfvlnplot
除此之外,還可以
vlnplot <- irGSEA.vlnplot(object = pbmc3k.final,
method = c("AUCell", "UCell", "singscore", "ssgsea",
"JASMINE", "viper"),
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
vlnplot
③.山巒圖
山巒圖中上方的核密度曲線展示了數(shù)據(jù)的主要分布跨新,下方的條形編碼圖展示了細(xì)胞亞群具體的數(shù)量富腊。不同顏色代表不同的細(xì)胞亞群,而橫坐標(biāo)代表不同的表達(dá)水平域帐;
ridgeplot <- irGSEA.ridgeplot(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
ridgeplot
④.密度熱圖
密度熱圖展示了具體差異基因在不同細(xì)胞亞群中的表達(dá)和分布水平赘被。顏色越紅是整,代表富集分?jǐn)?shù)越高;
densityheatmap <- irGSEA.densityheatmap(object = pbmc3k.final,
method = "UCell",
show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
densityheatmap
六民假、參考文獻(xiàn)
'AUCell (https://doi.org/10.1038/nmeth.4463)'浮入;
'UCell (https://doi.org/10.1016/j.csbj.2021.06.043)':;
'singscore (https://doi.org/10.1093/nar/gkaa802)'羊异;
'ssgsea (https://doi.org/10.1038/nature08460)'事秀;
'JASMINE (https://doi.org/10.7554/eLife.71994)'
'VAM (https://doi.org/10.1093/nar/gkaa582)';
'scSE (https://doi.org/10.1093/nar/gkz601)'野舶;
'VISION (https://doi.org/10.1038/s41467-019-12235-0)'易迹;
'wsum (https://doi.org/10.1093/bioadv/vbac016)';
'wmean (https://doi.org/10.1093/bioadv/vbac016)'平道;
'mdt (https://doi.org/10.1093/bioadv/vbac016)'睹欲;
'viper (https://doi.org/10.1038/ng.3593)';
'GSVApy (https://doi.org/10.1038/ng.3593)': 一屋;
'gficf (https://doi.org/10.1093/nargab/lqad024)'窘疮;
'GSVA (https://doi.org/10.1186/1471-2105-14-7)';
'zscore (https://doi.org/10.1371/journal.pcbi.1000217)'冀墨;
'plage (https://doi.org/10.1186/1471-2105-6-225)'闸衫;
'ssGSEApy (https://doi.org/10.1093/bioinformatics/btac757)';
'viperpy (https://doi.org/10.1093/bioinformatics/btac757)'诽嘉;
'AddModuleScore (https://doi.org/10.1126/science.aad0501)'蔚出;
'pagoda2 (https://doi.org/10.1038/nbt.4038)':;
'Sargent (https://doi.org/10.1016/j.mex.2023.102196)'含懊;