clusterProfiler是業(yè)界大神Y叔寫的R包允瞧,看過了它的繪圖就會使人沉迷于其顏值無法自拔皱蹦。本文很多內(nèi)容直接翻譯自官方文檔掰担,加上了我的測試數(shù)據(jù)(還是以果蠅為例)续捂。
一垦垂、術(shù)語(Terminology)
(1)基因集(gene set)和通路(pathway):基因集是功能相關(guān)的基因的無序集合宦搬。忽略基因間的功能關(guān)系,可以將通路解釋為一個基因集劫拗。
(2)Gene Ontology(GO):GO定義了描述基因功能的概念/類(concepts/classes)间校。它將功能分為三個方面:
- MF: Molecular Function - molecular activities of gene products
- CC: Cellular Component - where gene products are active
- BP: Biological Process - pathways and larger processes made up of the activities of multiple gene products
GO terms組織在一個有向無環(huán)圖(directed acyclic graph)中,terms之間的邊表示父子關(guān)系(parent-child relationship)页慷。
(3)Kyoto Encyclopedia of Genes and Genomes(KEGG):KEGG是手工繪制的代表分子相互作用和反應(yīng)網(wǎng)絡(luò)的路徑圖(pathway maps)的集合憔足。這些途徑涵蓋了廣泛的生化過程,可分為7大類:新陳代謝酒繁、遺傳和環(huán)境信息處理滓彰、細(xì)胞過程、機(jī)體系統(tǒng)州袒、人類疾病和藥物開發(fā)(metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development)揭绑。
二、功能富集分析方法(Functional Enrichment Analysis Methods)
(1)Over Representation Analysis(ORA)
過表征分析(ORA)(Boyle et al. 2004)是一種廣泛使用的方法郎哭,用于確定已知的生物學(xué)功能或過程是否在實(shí)驗(yàn)得到的基因列表中被過度表達(dá)(=富集)他匪,例如差異表達(dá)基因列表(DEGs)。
p值可以通過超幾何分布來計(jì)算彰居。N是背景基因總數(shù)诚纸,M是被注釋到感興趣基因集(GO term)的基因數(shù)量撰筷,n是感興趣的基因列表的大小陈惰,k是該列表中被標(biāo)注到基因集的基因數(shù)。默認(rèn)情況下毕籽,背景基因是所有具有注釋的基因抬闯。為進(jìn)行多重比較,應(yīng)調(diào)整p值关筒。
舉個栗子——one-sided version of Fisher’s exact test
(2)Gene Set Enrichment Analysis——Functional Class Scoring(FCS)
通常尋找感興趣的差異表達(dá)基因并進(jìn)行富集分析溶握,這種方法能夠檢測差異很大的基因,但無法檢測差異很小的基因蒸播,而這些差異比較小的基因也有可能參與了共同調(diào)節(jié)睡榆。基因集富集分析(GSEA)(Subramanian et al. 2005)直接解決了這一局限性袍榆。所有基因均可用于GSEA胀屿;GSEA匯總了一個基因集中每個基因的統(tǒng)計(jì)數(shù)據(jù),因此包雀,它可以檢測預(yù)先定義的基因集中所有基因以一種較小但協(xié)調(diào)(small but coordinated)的方式發(fā)生變化的情況宿崭。因?yàn)楹芏嘞嚓P(guān)的表型差異很可能是通過一組基因中微小但一致的變化(small but consistent changes)來表現(xiàn)的。
基因根據(jù)表型進(jìn)行排序才写。給定先驗(yàn)定義的基因集S(例如擁有相同DO類別的基因)葡兑,GSEA的目標(biāo)是確定S的成員是隨機(jī)分布在排序的基因列表(L)中奖蔓,還是主要分布在頂部或底部。
GSEA方法有三個關(guān)鍵元素:
-
Calculation of an Enrichment Score
enrichment score(ES)代表集合S在排序列表L的頂部或底部被過表達(dá)的程度讹堤。這個分?jǐn)?shù)是通過遍歷列表L來計(jì)算的吆鹤,當(dāng)我們遇到一個在S中的基因時增加一個running-sum statistic,當(dāng)遇到的基因不在S中時減少蜕劝。增量的大小取決于基因統(tǒng)計(jì)(例如基因與表型的相關(guān)性)檀头。ES為random walk中遇到的與零的最大偏差(maximum deviation from zero);它對應(yīng)一個加權(quán)的類Kolmogorov-Smirnovlike統(tǒng)計(jì)量(Subramanian et al. 2005)岖沛。 -
Esimation of Significance Level of ES
利用置換檢驗(yàn)(permutation test)計(jì)算ES的p值暑始。具體地說,我們對基因列表L的gene labels進(jìn)行重新排列(permute)婴削,并為排列后的數(shù)據(jù)重新計(jì)算基因集的ES廊镜,從而為ES生成一個null distribution。然后相對于這個零分布計(jì)算觀察到的ES的p值唉俗。 -
Adjustment for Multiple Hypothesis Testing
當(dāng)整個基因集被評估時嗤朴,為多重檢驗(yàn)調(diào)整顯著性水平,q值是通過FDR調(diào)整計(jì)算的虫溜。
(3)Leading edge analysis and core enriched genes
Leading edge分析報告了:Tags說明對富集分?jǐn)?shù)有貢獻(xiàn)的基因的百分比雹姊,List指出在列表中獲得富集分?jǐn)?shù)的位置,Signal是富集信號的強(qiáng)度衡楞。獲得有助于富集的核心富集基因也將是非常有趣的吱雏。
三、通用的富集分析(Universal enrichment analysis)
clusterProfiler支持對許多ontology/pathway的hypergeometric test和gene set enrichment analyses瘾境,但是還有很多用戶想分析他們自己的數(shù)據(jù)歧杏,包括不支持的物種、不支持的ontologies/pathways或自定義注釋等迷守。clusterProfiler提供了用于hypergeometric test的enricher函數(shù)和用于基因集富集分析的GSEA函數(shù)犬绒,用于接受用戶定義的注釋。它們接受另外兩個參數(shù)TERM2GENE和TERM2NAME兑凿。從參數(shù)名可以看出凯力,TERM2GENE是一個data.frame,第一列為term ID礼华,第二列為對應(yīng)映射基因咐鹤;TERM2NAME是一個data.frame,第一列為term ID卓嫂,第二列為對應(yīng)term name慷暂。TERM2NAME是可選的。
(1)Input data
對于ORA,我們所需要的是一個gene vector行瑞,這是一個基因ID的向量奸腺。這些基因ID可以通過差異表達(dá)分析(如DESeq2 package)獲得。
對于GSEA血久,我們需要一個基因排序列表(a ranked list of genes)突照。geneList有3個特點(diǎn):
- numeric vector: fold change or other type of numerical variable
- named vector: every number was named by the corresponding gene ID
- sorted vector: number should be sorted in decreasing order
可以這樣獲得geneList:
d <- read.csv(your_csv_file)
## assume that 1st column is ID (no duplicated allowed)
## 2nd column is fold change
## feature 1: numeric vector
geneList <- d[,2]
## feature 2: named vector
names(geneList) <- as.character(d[,1])
## feature 3: decreasing order
geneList <- sort(geneList, decreasing = TRUE)
我的測試數(shù)據(jù):
> head(mydata,3)
gene_name female male logFC
1 CG32548 0.02310383 72.43205 11.61428
2 CG15892 0.02624160 57.22716 11.09063
3 CR43803 0.02474626 34.09726 10.42823
> ## 基因列表
> #ORA的gene vector
> DEGdata <- subset(mydata,mydata$logFC>2)
> DEgenelist <- as.character(DEGdata$gene_name)
> head(DEgenelist)
[1] "CG32548" "CG15892" "CR43803" "CG15136" "CG4983" "CG13989"
> #GSEA的基因排序列表
> FCgenelist <- mydata$logFC #numeric vector
> names(FCgenelist) <- as.character(mydata$gene_name) #named vector
> FCgenelist <- sort(FCgenelist,decreasing=T) #decreasing order
> head(FCgenelist)
CG32548 CG15892 CR43803 CG15136 CG4983 CG13989
11.61428 11.09063 10.42823 10.34305 10.29130 10.00569
(2)MSigDb analysis
Molecular Signatures Database包含了8種預(yù)定義的基因集合。
可以下載GMT文件氧吐,然后使用read.gmt來解析這些文件讹蘑,并用于enricher()和GSEA()。R包msigdbr已經(jīng)將MSigDB基因集打包成整齊的數(shù)據(jù)格式筑舅,可以直接在clusterProfiler中使用座慰。
> ## Molecular Signatures Database
> #library(msigdbr)
> msigdbr_show_species() #支持的物種
[1] "Bos taurus" "Caenorhabditis elegans"
[3] "Canis lupus familiaris" "Danio rerio"
[5] "Drosophila melanogaster" "Gallus gallus"
[7] "Homo sapiens" "Mus musculus"
[9] "Rattus norvegicus" "Saccharomyces cerevisiae"
[11] "Sus scrofa"
> Dm_msigdbr <- msigdbr(species="Drosophila melanogaster")
> head(Dm_msigdbr, 2) %>% as.data.frame
gs_id gs_name gs_cat gs_subcat human_gene_symbol
1 M12609 AAACCAC_MIR140 C3 MIR:MIR_Legacy ABCC4
2 M12609 AAACCAC_MIR140 C3 MIR:MIR_Legacy ABCC4
species_name entrez_gene gene_symbol
1 Drosophila melanogaster 47905 l(2)03659
2 Drosophila melanogaster 35366 CG9270
sources
1 HomoloGene,PhylomeDB,Ensembl,Panther,OrthoDB
2 HomoloGene,Inparanoid,Ensembl,Panther,OrthoDB
> DmGO <- msigdbr(species="Drosophila melanogaster",category="C5") %>%
+ dplyr::select(gs_name, entrez_gene, gene_symbol)
> head(DmGO)
# A tibble: 6 x 3
gs_name entrez_gene gene_symbol
<chr> <int> <chr>
1 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~ 39097 CG3552
2 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~ 36955 Mtap
3 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~ 33386 GlyP
4 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~ 33386 GlyP
5 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~ 33386 GlyP
6 GO_1_ACYLGLYCEROPHOSPHOCHOLINE_O_ACYLTRANS~ 31899 LPCAT
> ## 通用的富集分析
> #TERM2GENE=gmt,TERM2NAME=NA 都是兩列的數(shù)據(jù)框
> #GENE是基因名,TERM表示GO term編號,NAME表示Description
> em <- enricher(DEgenelist,TERM2GENE=DmGO[,c(1,3)])
> head(em,1)
ID Description GeneRatio
GO_CILIUM_MOVEMENT GO_CILIUM_MOVEMENT GO_CILIUM_MOVEMENT 8/57
BgRatio pvalue p.adjust qvalue
GO_CILIUM_MOVEMENT 46/6055 7.342408e-09 5.903188e-06 5.76089e-06
geneID
GO_CILIUM_MOVEMENT CG17564/CG10750/gudu/CG15144/CG15143/Dhc36C/dtr/CG17450
Count
GO_CILIUM_MOVEMENT 8
> em1 <- GSEA(FCgenelist,TERM2GENE=DmGO[,c(1,3)])
> head(em1,1)
ID Description setSize enrichmentScore NES
GO_CILIUM GO_CILIUM GO_CILIUM 16 0.9049122 1.54287
pvalue p.adjust qvalues rank
GO_CILIUM 0.001055966 0.001126126 NA 498
leading_edge
GO_CILIUM tags=100%, list=10%, signal=90%
core_enrichment
GO_CILIUM CG17564/CG10750/CG9222/TrxT/gudu/CG15144/CG15143/Pkd2/Mks1/Dhc36C/CG6614/dtr/CG17450/B9d2/Erk7/Dhc16F
四、Gene Ontology Analysis
(1)支持的物種
GO分析groupGO()翠拣、richer GO()和gseGO()支持擁有可用OrgDb的物種版仔。Bioconductor已經(jīng)為大約20個物種提供了OrgDb,例如:人類org.Hs.eg.db误墓,果蠅org.Dm.eg.db蛮粮,擬南芥org.At.tair.db,小鼠org.Mm.eg.db谜慌。
如果用戶擁有GO注釋數(shù)據(jù)(data.frame格式然想,第一列為gene ID,第二列為GO ID)欣范,則可以使用enricher()和gseGO()函數(shù)執(zhí)行ORA和GSEA变泄。如果基因被直接注釋(direction annotation),它也應(yīng)該被其祖先GO節(jié)點(diǎn)(ancestor GO nodes)間接注釋(indirect annation)熙卡。如果用戶只有直接注釋杖刷,則可以將其注釋傳遞給buildGOmap函數(shù)励饵,該函數(shù)將推斷間接注釋并生成適合enricher()和gseGO()的data.frame驳癌。
> columns(org.Dm.eg.db) #keytypes(org.Dm.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
[5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
[9] "EVIDENCEALL" "FLYBASE" "FLYBASECG" "FLYBASEPROT"
[13] "GENENAME" "GO" "GOALL" "MAP"
[17] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PMID"
[21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
> Dm <- org.Dm.eg.db
> head(keys(org.Dm.eg.db)) #默認(rèn)是ENTREZID
[1] "30970" "30971" "30972" "30973" "30975" "30976"
> length(keys(Dm)) #25009
[1] 25009
> length(keys(Dm,"SYMBOL")) #25009
[1] 25009
> select(org.Dm.eg.db,keys="mle",keytype="SYMBOL", columns=c("ENTREZID"))
'select()' returned 1:1 mapping between keys and columns
SYMBOL ENTREZID
1 mle 35523
(2)GO classification
在clusterProfiler中,groupGO是根據(jù)GO在特定水平上的分布來進(jìn)行基因分類的役听。
groupGO(gene, OrgDb, keyType = "ENTREZID", ont = "CC", level = 2, readable = FALSE)
> ggo <- groupGO(gene=DEgenelist, OrgDb=org.Dm.eg.db,
+ ont="BP", level=2, readable=F)
> head(ggo,3)
ID Description Count GeneRatio geneID
GO:0000003 GO:0000003 reproduction 0 0/929
GO:0008152 GO:0008152 metabolic process 0 0/929
GO:0001906 GO:0001906 cell killing 0 0/929
(3)GO over-representation test
一個基因集的GO富集分析颓鲜。給定一個基因向量,enrichGO函數(shù)將返回FDR控制后的GO富集類別典予。使用參數(shù)readable=TRUE或setReadable函數(shù)將基因ID映射到基因Symbol甜滨。例如ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)
。
enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500, readable = FALSE, pool = FALSE)
- universe:background genes. If missing, the all genes listed in the database (eg TERM2GENE table) will be used as background.
- qvalueCutoff:qvalue cutoff on enrichment tests to report as significant. Tests must pass i) pvalueCutoff on unadjusted pvalues, ii) pvalueCutoff on adjusted pvalues and iii) qvalueCutoff on qvalues to be reported.
- readable :whether mapping gene ID to gene Name
- pool :If ont='ALL', whether pool 3 GO sub-ontologies
> ## enrichGO
> egoMF <- enrichGO(DEgenelist, OrgDb=org.Dm.eg.db, ont='MF',
+ pAdjustMethod='BH', pvalueCutoff=0.05,
+ qvalueCutoff=0.2, keyType='SYMBOL')
> head(egoMF,1);dim(egoMF)
ID Description GeneRatio BgRatio
GO:0045503 GO:0045503 dynein light chain binding 7/487 25/12127
pvalue p.adjust qvalue
GO:0045503 4.128344e-05 0.01080618 0.01041095
geneID Count
GO:0045503 CG1571/Dhc36C/Dhc16F/Sdic4/btv/Sdic2/CG10859 7
[1] 11 9
> egoall <- enrichGO(DEgenelist, OrgDb=org.Dm.eg.db, ont='ALL',
+ pAdjustMethod='BH', pvalueCutoff=0.05,
+ qvalueCutoff=0.2, keyType='SYMBOL')
> head(egoall,1);dim(egoall)
ONTOLOGY ID Description GeneRatio BgRatio
GO:0005929 CC GO:0005929 cilium 20/492 144/11938
pvalue p.adjust qvalue
GO:0005929 1.780602e-06 0.0003404754 0.0003124169
geneID
GO:0005929 CG1571/CG12395/Mks1/Dhc36C/dtr/B9d2/Dhc16F/Gas8/TbCMF46/CG14020/IFT57/Sdic4/tilB/unc/CG10958/btv/CG13999/CG31803/Sdic2/CG10859
Count
GO:0005929 20
[1] 21 10
> sum(egoall$ONTOLOGY=="BP") #Biological process基因產(chǎn)物參與的生物路徑或機(jī)制
[1] 0
> sum(egoall$ONTOLOGY=="CC") #Cellular component基因產(chǎn)物在細(xì)胞內(nèi)外的位置
[1] 10
> sum(egoall$ONTOLOGY=="MF") #Molecular function基因產(chǎn)物分子層次的功能
[1] 11
(4)reduce redundancy of enriched GO terms
GO以parent-child結(jié)構(gòu)組織瘤袖,因此父術(shù)語與所有子術(shù)語有很大比例的重疊衣摩。這可能導(dǎo)致冗余的結(jié)果摧扇。為了解決這一問題,clusterProfiler實(shí)現(xiàn)了簡化方法simplify惩坑,以減少enrichGO 和gseGO產(chǎn)生的冗余的GO術(shù)語筐高。函數(shù)內(nèi)部稱為GOSemSim (Yu et al. 2010),用于計(jì)算GO項(xiàng)之間的語義相似度泡嘴,并通過保留一個代表性項(xiàng)來去除高度相似的項(xiàng)甫恩。
> egosimp <- simplify(egoMF,cutoff=0.7,by="p.adjust",
+ select_fun=min,measure="Wang")
> head(egosimp);dim(egosimp)
> #方法1:基于它們的共有父條目的注釋統(tǒng)計(jì),計(jì)算語義相似性得分
> #包含Resnik、Lin酌予、Jiang 和Schlicker四種方法
> #方法2:基于GO圖形結(jié)構(gòu),Wang
> #進(jìn)行GO terms集的相似性分析時一般采取基于Resnik和Lin兩種方法的綜合方法,簡稱為simRel方法
(5)GO Gene Set Enrichment Analysis
gseGO( geneList, ont = "BP", OrgDb, keyType = "ENTREZID", exponent = 1, nPerm = 1000, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", verbose = TRUE, seed = FALSE, by = "fgsea")
- exponent:weight of each step
- nPerm:permutation numbers
- verbose:print message or not
- seed:logical
- by:one of 'fgsea' or 'DOSE'
> egseGO <- gseGO(FCgenelist, OrgDb=org.Dm.eg.db,
+ ont='MF',keyType="SYMBOL",
+ nPerm=1000, minGSSize=100, maxGSSize=500,
+ pvalueCutoff=0.05, verbose=FALSE, by="fgsea")
> head(egseGO,1);dim(egseGO)
ID Description setSize enrichmentScore
GO:0003674 GO:0003674 molecular_function 323 0.9462411
NES pvalue p.adjust qvalues rank
GO:0003674 1.716837 0.000999001 0.000999001 NA 579
leading_edge
GO:0003674 tags=100%, list=11%, signal=95%
core_enrichment
GO:0003674 CG15892/CG15136/CG4983/CG43851/…
[1] 53 11
> head(data.frame(egseGO$ID,egseGO$Description))
egseGO.ID egseGO.Description
1 GO:0003674 molecular_function
2 GO:0005488 binding
五磺箕、KEGG analysis
注釋包KEGG.db從2012年就沒有更新了。它現(xiàn)在相當(dāng)老了抛虫,在clusterProfiler中松靡,enrichKEGG(KEGG通路)和enrichMKEGG(KEGG模塊)支持下載最新的在線版本的KEGG數(shù)據(jù)進(jìn)行富集分析。通過將use_internal_data參數(shù)設(shè)置為TRUE建椰,也支持使用KEGG.db击困,但是不建議這樣做。有了這個新特性广凸,物種不再局限于以前版本中支持的物種阅茶,可以是KEGG數(shù)據(jù)庫中有KEGG注釋數(shù)據(jù)的任何物種。使用organism參數(shù)提供物種的學(xué)名縮寫谅海。clusterProfiler提供search_kegg_organism()函數(shù)脸哀,幫助搜索支持的物種。
(1)ID轉(zhuǎn)換
> gene.df <- bitr(DEgenelist,fromType="SYMBOL",toType=c("ENTREZID","ENSEMBL"),
+ OrgDb = org.Dm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(DEgenelist, fromType = "SYMBOL", toType = c("ENTREZID", :
17.18% of input gene IDs are fail to map...
> head(gene.df)
SYMBOL ENTREZID ENSEMBL
1 CG32548 32826 FBgn0052548
2 CG15892 3772603 FBgn0029859
4 CG15136 35033 FBgn0032625
> gene.kegg <- bitr_kegg(gene.df$ENTREZID,fromType="ncbi-geneid",
+ toType="kegg",organism='dme')
Warning message:
In bitr_kegg(gene.df$ENTREZID, fromType = "ncbi-geneid", toType = "kegg", :
98.13% of input gene IDs are fail to map...
> head(gene.kegg)
ncbi-geneid kegg
1 35106 Dmel_CG10211
2 35132 Dmel_CG10348
3 35183 Dmel_CG10700
(2)KEGG over-representation test
enrichKEGG(gene, organism = "dme", keyType = "kegg", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, minGSSize = 10, maxGSSize = 500,
qvalueCutoff = 0.2, use_internal_data = FALSE)
- keyType:one of "kegg",’ncbi-geneid’,’ncib-proteinid’ and ’uniprot’
- use_internal_data=FALSE:logical, use KEGG.db or latest online KEGG data
> #使用在線數(shù)據(jù)
> ekegg <- enrichKEGG(gene.df$ENTREZID, organism='dme',keyType="ncbi-geneid",
+ pvalueCutoff=0.05,pAdjustMethod='BH',qvalueCutoff=0.2,
+ minGSSize=10,maxGSSize=500,use_internal_data=F)
> ekeggx <- setReadable(ekegg,'org.Dm.eg.db','ENTREZID')
enrichKEGG使用在線數(shù)據(jù)速度實(shí)在是太慢了扭吁,所以可以先使用createKEGGdb生成本地KEGG.db包撞蜂。果蠅的kegg ID看上去是FLYBASECG前面加上"Dmel_"。
remotes::install_github("YuLab-SMU/createKEGGdb")
library(createKEGGdb)
create_kegg_db("dme")
install.packages("KEGG.db_1.0.tar.gz",repos=NULL,type="source")
library(KEGG.db)
> ekegg <- enrichKEGG(gene.kegg$kegg, organism='dme',keyType="kegg",
+ pvalueCutoff=0.5,pAdjustMethod='BH',qvalueCutoff=0.5,
+ minGSSize=10,maxGSSize=500,use_internal_data=T)
> head(ekegg,1);dim(ekegg)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
dme04145 dme04145 Phagosome 6/43 83/3236 0.0006770752 0.02234348 0.01781777
geneID Count
dme04145 Dmel_CG12403/Dmel_CG15148/Dmel_CG15719/Dmel_CG1924/Dmel_CG33497/Dmel_CG33499 6
[1] 12 9
> ekegg@gene <- gsub("Dmel_","",ekegg@gene)
> ekegg@result$geneID <- gsub("Dmel_","",ekegg@result$geneID)
> ekeggx <- setReadable(ekegg,'org.Dm.eg.db','FLYBASECG')
> head(ekeggx,1);dim(ekegg)
(3)KEGG Gene Set Enrichment Analysis
gene.tx <- bitr(names(FCgenelist),fromType="SYMBOL",toType=c("ENTREZID"),
OrgDb = org.Dm.eg.db)
colnames(gene.tx)[1] <- "gene_name"
gene.tx <- merge(gene.tx,mydata,by="gene_name")
FCgenelist <- mydata$logFC #numeric vector
names(FCgenelist) <- as.character(gene.tx$ENTREZID) #named vector
FCgenelist <- sort(FCgenelist,decreasing=T) #decreasing order
egseKEGG <- gseKEGG(FClist2,organism='dme',keyType="ncbi-geneid",
nPerm=1000, minGSSize=10, maxGSSize=500,
pvalueCutoff=0.05, pAdjustMethod = "BH")
head(egseKEGG,1);dim(egseKEGG)
(4)KEGG Module over-representation test
例如:mkk <- enrichMKEGG(gene = gene, organism = 'hsa')
(5)KEGG Module Gene Set Enrichment Analysis
例如:mkk2 <- gseMKEGG(geneList = geneList, organism = 'hsa')
六侥袜、Visualization of Functional Enrichment Result
(1)Bar plot 條形圖
#library(enrichplot)
#橫軸為基因個數(shù)蝌诡,縱軸為富集到的GO Terms的描述信息
#顏色對應(yīng)p.adjust值,紅色p值小枫吧,藍(lán)色p值大
#showCategory指定展示的GO Terms的個數(shù)浦旱,默認(rèn)為10,即p.adjust最小的10個
barplot(egoMF,showCategory=10)
(2)Dot plot 氣泡圖
#dotplot(object,x="GeneRatio",color="p.adjust",showCategory=10,
# size=NULL,split=NULL,font.size=12,title="",...)
#橫軸為GeneRatio九杂,代表該GO term下富集到的基因個數(shù)占列表基因總數(shù)的比例
#縱軸為富集到的GO Terms的描述信息颁湖,showCategory指定展示的GO Terms的個數(shù)
dotplot(egoall,showCategory=10)
dotplot(egoall,title='Top5 GO terms of each sub-class',
showCategory=5,split='ONTOLOGY')+
facet_grid(ONTOLOGY~.,scale="free")
(3)GO terms關(guān)系網(wǎng)絡(luò)圖 Enrichment Map
#對于富集到的GO terms之間的基因重疊關(guān)系進(jìn)行展示
#每個節(jié)點(diǎn)是一個富集到的GO term,默認(rèn)畫top30個富集到的GO terms
#節(jié)點(diǎn)大小對應(yīng)該GO terms下富集到的基因個數(shù)例隆,節(jié)點(diǎn)的顏色對應(yīng)p.adjust的值甥捺,紅色小藍(lán)色大
#如果兩個GO terms的差異基因存在重疊,說明這兩個節(jié)點(diǎn)存在overlap關(guān)系镀层,用線條連接起來
emapplot(egoMF,showCategory=10)
#文檔例子
data(gcSample)
xx <- compareCluster(gcSample, fun="enrichKEGG",
organism="hsa", pvalueCutoff=0.05)
p1 <- emapplot(xx)
p2 <- emapplot(xx,legend_n=2)
p3 <- emapplot(xx,pie="count")
p4 <- emapplot(xx,pie="count", pie_scale=1.5, layout="kk")
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
(4)GO term與差異基因關(guān)系網(wǎng)絡(luò)圖 Gene-Concept Network
#對于基因和富集的GO terms之間的對應(yīng)關(guān)系進(jìn)行展示
#圖中灰色的點(diǎn)代表基因镰禾,黃色的點(diǎn)代表富集到的GO terms
#如果一個基因位于一個GO Terms下,則將該基因與GO連線
#黃色節(jié)點(diǎn)的大小對應(yīng)富集到的基因個數(shù),默認(rèn)畫top5富集到的GO terms
cnetplot(egoMF,showCategory=5)
#圓形布局吴侦,給線條上色
cnetplot(egoall,showCategory=10,foldChange=FClist,circular=TRUE,colorEdge=TRUE)
p1 <- cnetplot(egoMF,showCategory=3,node_label="category")
p2 <- cnetplot(egoMF,showCategory=3,node_label="gene")
p3 <- cnetplot(egoMF,showCategory=3,node_label="all")
p4 <- cnetplot(egoMF,showCategory=3,node_label="none")
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
(5)UpSet Plot
upsetplot(egoMF) #著重于不同基因集間基因的重疊情況
upsetplot(egseGO) #對于GSEA結(jié)果將繪制不同類別的fold change分布
(6)Heatmap-like functional classification
heatplot(egoMF)
heatplot(egoall,foldChange=FCgenelist)
(7)有向無環(huán)圖 GO DAG graph
#investigate how the significant GO terms are distributed over the GO graph.
#The goplot function shows subgraph induced by most significant GO terms.
goplot(egoMF,showCategory=5)
(8)山脊線圖 ridgeline plot for expression distribution of GSEA result
ridgeplot(egseGO)
(9)running score and preranked list of GSEA result
p1 <- gseaplot(egseGO,geneSetID=1,by="runningScore",title=egseGO$Description[1])
p2 <- gseaplot(egseGO,geneSetID=1,by="preranked",title=egseGO$Description[1])
p3 <- gseaplot(egseGO,geneSetID=4,title=egseGO$Description[4])
cowplot::plot_grid(p1, p2, p3, ncol=2, labels=LETTERS[1:3])
p4 <- gseaplot2(egseGO,3,title=egseGO$Description[3])
p5 <- gseaplot2(egseGO,2:4,subplots = 1)
p6 <- gseaplot2(egseGO,geneSetID=2:4, pvalue_table=TRUE,
color = c("#E495A5", "#86B875", "#7DB0DD"),
ES_geom = "dot")
cowplot::plot_grid(p4, p5, p6, ncol=1, labels=LETTERS[1:3])