說在前面
在國內(nèi)做生信的小伙伴或多或少都聽過一個大名鼎鼎的名字---Y叔笨腥,想當(dāng)初小編初入生信大門時還專門百度查了一下拓哺,他就是南方醫(yī)科大學(xué)生物信息學(xué)系主任余光創(chuàng)教授。Y叔在國際也有很高的知名度脖母,以至于ClusterProfiler 4在發(fā)出后有很多國際知名生信學(xué)家評論士鸥。
生物信息學(xué)軟件很多都死在故紙堆里,發(fā)表文章之后谆级,就走向毀滅烤礁,這叫publish and perish。clusterProfiler之所以能夠成功肥照,因為Y叔對它充滿感情脚仔,10年如一日地維護,但是自2012年發(fā)表一篇文章以來舆绎,雖然軟件一直在更新鲤脏,但相應(yīng)的文章一直沒有再發(fā)一版。時隔9年,Y叔終于再出一篇文章猎醇,而且是把論文寫在祖國大地上窥突,clusterProfiler 4.0發(fā)表在中科院青促會主辦的期刊The Innovation。下面附上文章信息:T Wu#, E Hu#, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141. doi: 10.1016/j.xinn.2021.100141硫嘶。
截至目前阻问,第一版的ClusterProfiler已經(jīng)被引用了5000次,文章在G Yu, LG Wang, Y Han and QY He*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287. doi: 10.1089/omi.2011.0118沦疾,國內(nèi)很難再有一個單獨R包可以達到這個高度称近。那么,新版的ClusterProfiler 4.0又給我們帶來什么樣的驚喜你額曹鸠,今天小編就來簡單介紹一下這個偉大R包的新版本。
GO富集分析
ClusterProfiler最經(jīng)典斥铺、最重要的一個功能就是GO富集分析了彻桃,新版的GO分析有g(shù)roupGO(),enricogo()和gseGO()三個函數(shù)晾蜘,而且支持OrgDb收錄的所有物種邻眷。下面使用DOSE包中的geneList作為示例進行演示
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
ego <- enrichGO(gene = gene, universe = names(geneList), OrgDb = org.Hs.eg.db, ont = "CC", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05, readable = TRUE)
#將輸入基因轉(zhuǎn)為Symbolhead(ego)#因為有一些Symbol對應(yīng)多個ENSEMBL,可以直接用ID進行富集分析gene.df <- bitr(gene, fromType = "ENTREZID", toType = c("ENSEMBL", "SYMBOL"), OrgDb = org.Hs.eg.db)
ego2 <- enrichGO(gene = gene.df$ENSEMBL, OrgDb = org.Hs.eg.db, keyType = 'ENSEMBL', ont = "CC", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05)head(ego2, 3)
#下面就是新增的GSEA富集分析了剔交,需要同時輸入基因名和變化倍數(shù)ego3 <- gseGO(geneList = geneList, OrgDb = org.Hs.eg.db, ont = "CC", minGSSize = 100,#限制最少基因數(shù) maxGSSize = 500,#限制最多基因數(shù) pvalueCutoff = 0.05, verbose = FALSE)
#最后就是將富集的通路做一個有向無環(huán)圖進行可視化了
goplot(ego)
從這個圖中我們可以去除冗余通路的信息肆饶,找到關(guān)鍵通路的上下游,從而更準確的定位到關(guān)鍵信號通路岖常。
KEGG富集分析
KEGG相對于GO更加準確囊括了最關(guān)鍵的通路信息驯镊,在很多情況下都是我們最想關(guān)注的通路信號,下面就演示一下KEGG的通路富集分析竭鞍。
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
kk <- enrichKEGG(gene = gene, organism = 'hsa', pvalueCutoff = 0.05)
head(kk)
kk2 <- gseKEGG(geneList = geneList, organism = 'hsa', minGSSize = 120, pvalueCutoff = 0.05, verbose = FALSE)
head(kk2)
#上面兩個是和GO富集分析一樣的常規(guī)操作板惑,對于KEGG,還增添了對通路的模塊分析偎快,這對揭示一些功能單元有更直接的效果冯乘。
mkk <- enrichMKEGG(gene = gene, organism = 'hsa', pvalueCutoff = 1, qvalueCutoff = 1)
head(mkk)
mkk2 <- gseMKEGG(geneList = geneList, organism = 'hsa', pvalueCutoff = 1)
head(mkk2)
#KEGG富集分析的可視化
browseKEGG(kk, 'hsa04110')
#BiocManager::install("pathview")
library("pathview")
hsa04110 <- pathview(gene.data = geneList, pathway.id = "hsa04110", species = "hsa", limit = list(gene=max(abs(geneList)), cpd=1))
比如我們想研究hsa04110(Cell cycle)這個通路,通過對關(guān)鍵通路進行可視化我們可以找到差異變化的基因在通路中的位置和作用晒夹。
其它通路富集分析
GO和KEGG是我們最常用的基因集裆馒,但是還有一些其它揭示細胞某些特征的基因集,如:WikiPathways丐怯,Reactome,Mesh和Disease等喷好,在新版本的clusterProfiler 4都可以基于以上基因集進行富集分析。
#------------------WikiPathways----------------
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
enrichWP(gene, organism = "Homo sapiens")
gseWP(geneList, organism = "Homo sapiens")
#---------------------Reactome------------------
library(ReactomePA)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 1.5]
head(de)
x <- enrichPathway(gene=de, pvalueCutoff = 0.05, readable=TRUE)
y <- gsePathway(geneList, pvalueCutoff = 0.2, pAdjustMethod = "BH", verbose = FALSE)
#可視化
viewPathway("E2F mediated regulation of DNA replication", readable = TRUE, foldChange = geneList)
#---------------------Mesh------------------
library(meshes)
data(geneList, package="DOSE")
de <- names(geneList)[1:100]
x <- enrichMeSH(de, MeSHDb = db, database='gendoo', category = 'C')
y <- gseMeSH(geneList, MeSHDb = db, database = 'gene2pubmed', category = "G")
#---------------------Disease------------------
library(DOSE)
data(geneList)gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
x <- enrichDO(gene = gene, ont = "DO", pvalueCutoff = 0.05, pAdjustMethod = "BH", universe = names(geneList), minGSSize = 5, maxGSSize = 500, qvalueCutoff = 0.05, readable = FALSE)
y <- gseDO(geneList, minGSSize = 120, pvalueCutoff = 0.2, pAdjustMethod = "BH", verbose = FALSE)
#專門對癌基因進行富集分析
gene2 <- names(geneList)[abs(geneList) < 3]
ncg <- enrichNCG(gene2) head(ncg)
EnrichR富集分析
上述提到的基因集都是非常經(jīng)典的基因集读跷,而在實際應(yīng)用中我們還需要對基于自己數(shù)據(jù)構(gòu)建的基因集進行富集分析绒窑,clusterProfiler 4的EnrichR函數(shù)就能實現(xiàn)這個目的。
data(geneList, package="DOSE")
head(geneList)
## 4312 8318 10874 55143 55388 991
## 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
#cell marker
cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>% dplyr::select(cellName, geneID) %>% dplyr::mutate(geneID = strsplit(geneID, ', ')) %>% tidyr::unnest()x <- enricher(gene, TERM2GENE = cells)
y <- GSEA(geneList, TERM2GENE = cells)
多組平行樣本間的比較
要說這次更新最大的亮點舔亭,就是接下來要說的多組平行樣本間的富集分析些膨。對于一般只有兩組實驗(實驗組和對照組)蟀俊,我們只需要計算出差異基因直接進行富集分析即可。但是實際實驗中订雾,如藥物處理的樣本肢预,我們往往會設(shè)置多個梯度的平行分組,如何同時對所有組直接進行比較洼哎,clusterProfiler 4給出了完美的解決方式烫映。
data(gcSample)
str(gcSample)
#這里有8組數(shù)據(jù)## List of 8
## $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...## $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...## $ X3: chr [1:392] "894" "7057" "22906" "3339" ...## $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...## $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...## $ X6: chr [1:585] "5337" "9295" "4035" "811" ...## $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...## $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
ck <- compareCluster(geneCluster = gcSample, fun = enrichKEGG)
ck <- setReadable(ck, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"
formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichKEGG")
head(formula_res)
#對結(jié)果進行可視化
dotplot(ck)
dotplot(formula_res)
如果你覺得這樣不方便觀察,還可以分開作圖
dotplot(formula_res, x="group") + facet_grid(~othergroup)
#我們還可以對所有基因做一個網(wǎng)狀圖
cnetplot(ck)
可視化分析匯總
上面對clusterProfiler 4所有的功能做了個簡要的介紹噩峦,大家還可以通過閱讀原文去深入學(xué)習(xí)锭沟。最后我們對新版的clusterProfiler的可視化功能做一個總結(jié)。
library(DOSE)
data(geneList)
de <- names(geneList)[abs(geneList) > 2]
edo <- enrichDGN(de)
library(enrichplot)
barplot(edo, showCategory=20) mutate(edo, qscore = -log(p.adjust, base=10)) %>% barplot(x="qscore")
## convert gene ID to Symbol
edox <- setReadable(edo, 'org.Hs.eg.db', 'ENTREZID')
p1 <- cnetplot(edox, foldChange=geneList)## categorySize can be scaled by 'pvalue' or 'geneNum'
p2 <- cnetplot(edox, categorySize="pvalue", foldChange=geneList)
p3 <- cnetplot(edox, foldChange=geneList, circular = TRUE, colorEdge = TRUE)
p1|p2|p3
p1 <- cnetplot(edox, node_label="category", cex_label_category = 1.2)
p2 <- cnetplot(edox, node_label="gene", cex_label_gene = 0.8)
p3 <- cnetplot(edox, node_label="all")
p4 <- cnetplot(edox, node_label="none", color_category='firebrick', color_gene='steelblue')
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
set.seed(123)
x <- list(A = letters[1:10], B=letters[5:12], C=letters[sample(1:26, 15)])
p1 <- cnetplot(x)set.seed(123)d <- setNames(rnorm(26), letters)
p2 <- cnetplot(x, foldChange=d) + scale_color_gradient2(name='associated data', low='darkgreen', high='firebrick')
cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:2])
p1 <- heatplot(edox, showCategory=5)
p2 <- heatplot(edox, foldChange=geneList, showCategory=5)
cowplot::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:2])
edox2 <- pairwise_termsim(edox)
p1 <- treeplot(edox2)
p2 <- treeplot(edox2, hclust_method = "average")
aplot::plot_list(p1, p2, tag_levels='A')
edo <- pairwise_termsim(edo)
p1 <- emapplot(edo)
p2 <- emapplot(edo, cex_category=1.5)
p3 <- emapplot(edo, layout="kk")
p4 <- emapplot(edo, cex_category=1.5,layout="kk")
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
library(clusterProfiler)
data(gcSample)
xx <- compareCluster(gcSample, fun="enrichKEGG", organism="hsa", pvalueCutoff=0.05)xx <- pairwise_termsim(xx)
p1 <- emapplot(xx)
p2 <- emapplot(xx, legend_n=2)
p3 <- emapplot(xx, pie="count")
p4 <- emapplot(xx, pie="count", cex_category=1.5, layout="kk")
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
upsetplot(edo)
upsetplot(kk2)
ridgeplot(edo2)
p1 <- gseaplot(edo2, geneSetID = 1, by = "runningScore", title = edo2$Description[1])
p2 <- gseaplot(edo2, geneSetID = 1, by = "preranked", title = edo2$Description[1])
p3 <- gseaplot(edo2, geneSetID = 1, title = edo2$Description[1])
cowplot::plot_grid(p1, p2, p3, ncol=1, labels=LETTERS[1:3])
gseaplot2(edo2, geneSetID = 1, title = edo2$Description[1])
gseaplot2(edo2, geneSetID = 1:3)
p1 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1)
p2 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1:2)
cowplot::plot_grid(p1, p2, ncol=1, labels=LETTERS[1:2])
library(ggplot2)
library(cowplot)
pp <- lapply(1:3, function(i) { anno <- edo2[i, c("NES", "pvalue", "p.adjust")] lab <- paste0(names(anno), "=", round(anno, 3), collapse="\n") gsearank(edo2, i, edo2[i, 2]) + xlab(NULL) +ylab(NULL) + annotate("text", 10000, edo2[i, "enrichmentScore"] * .75, label = lab, hjust=0, vjust=0)})plot_grid(plotlist=pp, ncol=1)
terms <- edo$Description[1:5]
p <- pmcplot(terms, 2010:2020)
p2 <- pmcplot(terms, 2010:2020, proportion=FALSE)
plot_grid(p, p2, ncol=2)
小結(jié)
高通量組學(xué)數(shù)據(jù)功能解讀中识补,功能富集分析是至關(guān)重要的一步族淮,相關(guān)軟件繁多但大多數(shù)僅針對極少量的模式生物開發(fā),無法支持大量非模式生物的分析訴求凭涂。功能分析依賴準確的功能注釋祝辣,但許多軟件在發(fā)表文章之后并未及時更新內(nèi)置的功能注釋。2016年切油,Nature Methods文章指出蝙斜,高達42%的相關(guān)工具內(nèi)置注釋超過五年未更新,用戶基于此類工具的數(shù)據(jù)挖掘澎胡,結(jié)論反應(yīng)的僅是學(xué)界五年前的生物學(xué)知識積累孕荠,頗有時光倒流的感覺。尤為重要的是攻谁,基于舊有注釋岛琼,大約只能捕獲到最新數(shù)據(jù)庫中26%的生物學(xué)過程或通路。
ClusterProfiler 4相對于前一版不僅更新了參考基因集巢株,而且新加入了很多功能槐瑞,如GSEA分析、DO富集分析以及多組間樣本的平行比較等阁苞。新版本尤其實現(xiàn)多組數(shù)據(jù)間自由比較困檩,如不同條件、處理等那槽,并內(nèi)置系列流行輔助工具悼沿,如數(shù)據(jù)處理包dplyr、可視化包ggplot2等骚灸,方便分析人員用熟悉的方式自由探索糟趾,實現(xiàn)數(shù)據(jù)高效解讀。目前,clusterProfiler已被整合進超過30個的同行分析軟件中义郑,致力于解決不同場景下的功能分析蝶柿,相信clusterProfiler4.0未來將發(fā)揮更大的作用,助力研究者更高效地解讀生物醫(yī)學(xué)數(shù)據(jù)及建立更可靠的機制假說非驮。