Following tutorial from 2019 Nature Protocol (https://www.nature.com/articles/s41596-018-0103-9)
Prerequisite
Project directory: Pathway_Enrichment_Turorial/
下載后解壓于Pathway_Enrichment_Tutorial/data/
,壓縮包內(nèi)包括以下文件:
- Supplementary_Methods_Supplementary_protocols.pdf
- Supplementary_Table1_Cancer_drivers.txt
- Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk
- Supplementary_Table3_Human_GOBP_AllPathways_no_GO_iea_July_01_2017_symbol.gmt
- Supplementary_Table4_gprofiler_results.txt
- Supplementary_Table5_hsapiens.pathways.NAME.gmt
- Supplementary_Table6_TCGA_OV_RNAseq_expression.txt
- Supplementary_Table7_TCGA_OV_RNAseq_classes.cls
- Supplementary_Table8_gsea_report_for_na_pos.xls
- Supplementary_Table9_gsea_report_for_na_neg.xls
- Supplementary_Table10_TCGA_RNASeq_rawcounts.txt
- Supplementary_Table11_RNASeq_classdefinitions.txt
- Supplementary_Table12_TCGA_Microarray_rmanormalized.txt
- Supplementary_Table13_Microarray_classdefinitions.txt
Software:
- g:Profiler (R client)
- Java Standard Edition (http://www.oracle.com/technetwork/java/javase/downloads/index.html) is required to run GSEA and Cytoscape.
- GSEA (optional)
- The Cytoscape desktop application (http://www.cytoscape.org/download.php)
- Cytoscape plugins:"EnrichmentMap Pipeline Collection" (http://apps.cytoscape.org/apps/enrichmentmappipelinecollection ); 可以在Cytoscape app store中搜到
R | Pathway enrichment analysis
using g:Profiler
這里使用g:Profiler的R包進(jìn)行富集分析癣猾,也可以使用它的網(wǎng)頁(yè)版(https://biit.cs.ut.ee/gprofiler/gost進(jìn)行分析
- 首先料滥,使用Supplementary_Table1_Cancer_drivers.txt中的基因列表進(jìn)行常規(guī)的通路富集类垫。
In R
if (!requireNamespace("gprofiler2")) install.packages("gprofiler2")
library(gprofiler2)
# read in query gene list
queryG <- read.delim('data/supplementary_files/Supplementary_Table1_Cancer_drivers.txt', header=F)
gores1 <- gost(queryG$V1,
organism='hsapiens',
exclude_iea=TRUE,
ordered_query=TRUE,
evcodes=TRUE,
sources=c('GO:BP', 'REAC'))
這里選擇物種為"hsapiens" - Homo sapiens.
exclude_iea
: 排除沒(méi)有經(jīng)過(guò)人工審核的不靠譜GO注釋 (less reliable GO annotations, IEAs)
ordered_query
: 把輸入的基因認(rèn)為是排序過(guò)的基因列表榄攀,以一種GSEA的方式進(jìn)行通路富集
evcodes
: 返回輸入基因與相應(yīng)通路產(chǎn)生交集的基因ID
sources
: 富集的數(shù)據(jù)庫(kù)谍倦,這里選擇GO的BP項(xiàng)和Reactome的通路娜扇。
Available data sources and their abbreviations are:
- Gene Ontology (GO or by branch GO:MF, GO:BP, GO:CC)
- KEGG (KEGG)
- Reactome (REAC): an open-source, open access, manually curated and peer-reviewed pathway database.
- WikiPathways (WP): an open, collaborative platform dedicated to the curation of biological pathways.
- TRANSFAC (TF): is the database of eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles.
- miRTarBase (MIRNA): miRTarBase has accumulated more than three hundred and sixty thousand miRNA-target interactions (MTIs), which are collected by manually surveying pertinent literature after NLP of the text systematically to filter research articles related to functional studies of miRNAs.
- Human Protein Atlas (HPA): including all the human proteins in cells, tissues, and organs using an integration of various omics technologies.
- CORUM (CORUM): provides a resource of manually annotated protein complexes from mammalian organisms.
- Human phenotype ontology (HP): provides a standardized vocabulary of phenotypic abnormalities encountered in human disease.
同時(shí),我們對(duì)富集到的通路進(jìn)行篩選季俩,排除包含過(guò)多基因的通路(大于350個(gè))慈缔,過(guò)少基因的通路(小于5個(gè)),以及與輸入基因交集過(guò)小的通路(小于3個(gè))
# filtered by term_size and intersection_size
# term_size - number of genes that are annotated to the term
# intersection_size - the number of genes in the input query that are annotated to the term
gores1_filtered <- list("result" = subset(gores1$result,
term_size > 5 &
term_size < 350 &
intersection_size > 3),
"meta" = gores1$meta)
理論上种玛,我們應(yīng)當(dāng)在進(jìn)行通路富集之前進(jìn)行排除,這樣才可以通過(guò)排除這類通路來(lái)通過(guò)統(tǒng)計(jì)檢驗(yàn)的效力(power)瓤檐,但是gprofiler2::gost()
并沒(méi)有提供這樣的參數(shù)以供我們?cè)O(shè)定相應(yīng)的cutoff
富集的結(jié)果保存在$table
中
# pathway enrichment results table
head(gores1_filtered$result, n=3)
## query significant p_value term_size query_size intersection_size
## 95 query_1 TRUE 3.82e-13 142 81 15
## 124 query_1 TRUE 6.31e-12 197 120 18
## 125 query_1 TRUE 6.62e-12 230 120 19
## precision recall term_id source term_name
## 95 0.185 0.1056 GO:0048732 GO:BP gland development
## 124 0.150 0.0914 GO:0048762 GO:BP mesenchymal cell differentiation
## 125 0.158 0.0826 GO:0060485 GO:BP mesenchyme development
## effective_domain_size source_order parents
## 95 16592 15077 GO:0048513
## 124 16592 15103 GO:0030154, GO:0060485
## 125 16592 17301 GO:0009888, GO:0048513
## evidence_codes
## 95 ISS,IDA,ISS,ISS,IMP,ISS,TAS,IEP ISS,IGI ISS,IDA,IMP ISS,TAS,ISS,ISS,IDA
## 124 IMP,IMP,IMP ISS,IMP IGI TAS,IMP,ISS,ISS,TAS,ISS,IMP,IMP,ISS,ISS,IMP,IDA ISS,ISS,IMP,IDA
## 125 IMP,IMP,IMP ISS,IMP IGI TAS,IMP,ISS,ISS,ISS,TAS,ISS,IMP,IMP,ISS,ISS,IMP,IDA ISS,IMP ISS,IMP,IDA
## intersection
## 95 NF1,GATA3,FBXW7,ARHGAP35,TBX3,CEBPA,AKT1,SOX9,WT1,HGF,ERBB4,ARID5B,ELF3,FGFR2,BRCA2
## 124 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
## 125 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,WT1,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
The result
data.frame
contains the following columns:
- term_id - unique term identifier (e.g GO:0005005)
- p_values - hypergeometric p-values after correction for multiple testing in the order of input queries
- significant - indicators in the order of input queries for statistically significant results
- term_size - number of genes that are annotated to the term
- query_sizes - number of genes that were included in the query in the order of input queries
- intersection_sizes - the number of genes in the input query that are annotated to the corresponding term in the order of input queries
- source - the abbreviation of the data source for the term (e.g. GO:BP)
- term_name - the short name of the function
- effective_domain_size - the total number of genes "in the universe" used for the hypergeometric test
- source_order - numeric order for the term within its data source (this is important for drawing the results)
- source_order - numeric order for the term within its data source (this is important for drawing the results)
- parents - list of term IDs that are hierarchically directly above the term. For non-hierarchical data sources this points to an artificial root node.
gProfiler2
提供Manhattan plot進(jìn)行通路富集結(jié)果的可視化
gostplot(gores1_filtered, capped = TRUE, interactive = TRUE)
Preparing data for EnrichmentMap
Creating a Generic Enrichment Map (GEM) file
GEM格式文件是Cytoscape EnrichmentMap app中需要的通路富集結(jié)果文件赂韵。其需要包括GO ID, Description, p-value, FDR, Phenotype and Genes.
由于gost
函數(shù)輸出的p-value已經(jīng)是多重檢驗(yàn)校正后的值,所以這里直接復(fù)制兩列一樣的p-value挠蛉,其實(shí)都是FDR
gem <- gores1_filtered$result[,c("term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
gem <- gem[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]
head(gem, n=3)
## GO.ID Description p.Val FDR Phenotype
## 95 GO:0048732 gland development 3.82e-13 3.82e-13 +1
## 124 GO:0048762 mesenchymal cell differentiation 6.31e-12 6.31e-12 +1
## 125 GO:0060485 mesenchyme development 6.62e-12 6.62e-12 +1
## Genes
## 95 NF1,GATA3,FBXW7,ARHGAP35,TBX3,CEBPA,AKT1,SOX9,WT1,HGF,ERBB4,ARID5B,ELF3,FGFR2,BRCA2
## 124 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
## 125 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,WT1,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
保存該文件祭示,在Cytoscape中使用
write.table(gem, file = "extdata/gProfiler_gem.txt", sep = "\t", quote = F, row.names = F)
the Gene Matrix Transposed file format (GMT)
GMT格式文件是一種描述基因集的文件格式;其中每一行代表一個(gè)基因集 谴古,GMT示例如下圖:
其中第一列為基因集的ID质涛,第二列是基因集的描述,后續(xù)的列為該基因集中的基因ID
如果是使用gprofiler
進(jìn)行分析掰担,可以在其網(wǎng)站上下載對(duì)應(yīng)輸入基因ID的GMT文件(https://biit.cs.ut.ee/gprofiler/gost)
首先汇陆,在Organism的下拉菜單中選擇相應(yīng)物種,我們這里選擇Homo sapiens
然后带饱,打開(kāi)[Data sources]菜單毡代,下載相應(yīng)基因ID的GMT
我們這里使用symbol作為基因ID,下載name.gmt.zip
這個(gè)文件
我們這里只使用GO:BP
和REAC
的注釋勺疼,故合并這兩個(gè)文件為hsapiens.GOBP_REAC.name.gmt
作為后續(xù)分析使用的文件 教寂,文件的前幾行如下所示:
P.S. combined.name.gmt.zip
提供的是各個(gè)數(shù)據(jù)庫(kù)合并后的GMT
另外,gprofiler提供的gmt不包含KEGG和TRANSFAC的數(shù)據(jù)
"Note that this file does not include annotations from KEGG and Transfac as we are restricted by data source licenses that do not allow us to share these two data sources with our users."
Cytoscape | Visualization of enrichment results with EnrichmentMap
接下來(lái)的操作需要在Cytoscape中進(jìn)行执庐,確保你的電腦安裝了Cytoscape酪耕,以及其中的EnrichmentMap app
- 打開(kāi)Cytoscape后,點(diǎn)擊上方的[Apps]選項(xiàng)卡
- 打開(kāi)[EnrichmentMap]
在彈出的Enrichment Map窗口中轨淌,
- 點(diǎn)擊"+"新建分析
- 修改分析的名字
- 選擇富集分析的類型迂烁,這里我們選擇"Generic/gProfiler/Enrichr"
- 輸入GEM格式的富集分析結(jié)果,這里輸入我們前面保存的
extdata/gProfiler_gem.txt
- 輸入GMT格式的基因集注釋文件猿诸,這里輸入合并的
hsapiens.GOBP_REAC.name.gmt
- (可選)輸入基因表達(dá)的排序文件婚被,這里我們用文獻(xiàn)提供的
Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk
- (可選)輸入基因表達(dá)矩陣文件,這里我們用文獻(xiàn)提供的
Supplementary_Table6_TCGA_OV_RNAseq_expression.txt
- (可選)根據(jù)表達(dá)文件進(jìn)行基因篩選梳虽,勾選該選項(xiàng)會(huì)對(duì)GMT文件的基因進(jìn)行篩選址芯,排除不在表達(dá)文件中的基因。
- (可選)設(shè)置每個(gè)富集條目的FDR cutoff,高于該cutoff的富集項(xiàng)將會(huì)被排除谷炸,默認(rèn)是0.05北专,這里設(shè)置為0.01
- (可選)對(duì)網(wǎng)絡(luò)連接度進(jìn)行調(diào)整。越往左生成的網(wǎng)絡(luò)越稀疏旬陡,越往右生成的網(wǎng)絡(luò)越密集拓颓。本質(zhì)上是對(duì)富集項(xiàng)之間的相關(guān)性閾值進(jìn)行調(diào)整,從而減少/增加顯示的邊描孟。點(diǎn)擊[Show advanced options]會(huì)顯示更多關(guān)于網(wǎng)絡(luò)的調(diào)整項(xiàng)目
最后驶睦,點(diǎn)擊[Build]生成網(wǎng)絡(luò)
在[Layout]菜單中,可以嘗試更換為不同的網(wǎng)絡(luò)布局匿醒,看哪一種布局展示的網(wǎng)絡(luò)效果最佳场航,默認(rèn)為"the unweighted Prefuse Force Directed layout".
Cytoscape | Auto annotate clustered enriched terms
Enrichment maps typically include clusters of similar pathways representing major biological themes.
富集譜圖(Enrichment Map)常常包括了代表主要的生物學(xué)主題的
這一節(jié),我們將使用Cytoscape中的"AutoAnnotate" app進(jìn)行通路clusters的自動(dòng)注釋廉羔。
在進(jìn)行AutoAnnotate前溉痢,要確認(rèn)網(wǎng)絡(luò)的layout是較優(yōu)的方案。這是由于它會(huì)根據(jù)當(dāng)前的layout進(jìn)行l(wèi)abel憋他。
- 點(diǎn)擊[Apps] --> [AutoAnnotate] --> [New Annotation Set...]
- 在"Quick Start"頁(yè)面中孩饼,點(diǎn)擊[Create Annotations]
運(yùn)行后,網(wǎng)絡(luò)上功能相近的通路節(jié)點(diǎn)會(huì)被圈為一個(gè)cluster竹挡,并標(biāo)注上相應(yīng)的biological themes科盛。標(biāo)注的字體越大說(shuō)明該cluster包含的節(jié)點(diǎn)越多娱俺。另外,在Control panel中會(huì)多了"AutoAnnotations"這一個(gè)頁(yè)面。
這樣生成的注釋結(jié)果一般是沒(méi)法看的壶运,我們需要將重疊的圈和標(biāo)注錯(cuò)開(kāi)坏匪,將"無(wú)關(guān)的"cluster注釋去除喳资,最終達(dá)到發(fā)表級(jí)別的樣子
最后檀头,我們可以將運(yùn)行的cytoscape session保存為.cys
文件,方便之后再分析坎怪。
在生成富集譜圖(Enrichment Map)后更為關(guān)鍵的是對(duì)網(wǎng)絡(luò)的解讀罢坝!
通路信息天然的具有冗余的特質(zhì),因此通路富集分析往往會(huì)給出對(duì)同一個(gè)通路不同描述的注釋結(jié)果搅窿。富集譜圖 (Enrichment Map) 簡(jiǎn)化了冗余的通路信息嘁酿,將其展示為有代表性的生物學(xué)主題 (biological themes),更有利于我們對(duì)基因通路富集結(jié)果的解讀男应。但這只是相當(dāng)于幫助我們凝練了通路富集的信息闹司,而對(duì)于信息的解讀還是依賴于我們的生物學(xué)知識(shí)。我們需要依據(jù)實(shí)驗(yàn)的假設(shè)沐飘、課題的背景做出合理的解釋游桩。例如在我們這篇文章的例子中對(duì)癌癥常見(jiàn)的驅(qū)動(dòng)突變基因進(jìn)行分析牲迫,就發(fā)現(xiàn)了許多與腫瘤生長(zhǎng)相關(guān)的通路。當(dāng)全面展現(xiàn)出基因集的通路信息后借卧,我們還可以根據(jù)感興趣的通路來(lái)尋找相關(guān)的基因進(jìn)行后續(xù)分析盹憎。總而言之铐刘,進(jìn)行數(shù)據(jù)分析更為關(guān)鍵的是對(duì)數(shù)據(jù)結(jié)果的解讀陪每。
Ref
2019 Nature Protocol: https://doi.org/10.1038/s41596-018-0103-9
gprofiler2 vignettes: https://cran.r-project.org/web/packages/gprofiler2/vignettes/gprofiler2.html
Cytoscape: https://cytoscape.org/