基因通路富集分析

Following tutorial from 2019 Nature Protocol (https://www.nature.com/articles/s41596-018-0103-9)

Prerequisite

Project directory: Pathway_Enrichment_Turorial/

Data: https://static-content.springer.com/esm/art%3A10.1038%2Fs41596-018-0103-9/MediaObjects/41596_2018_103_MOESM1_ESM.zip

下載后解壓于Pathway_Enrichment_Tutorial/data/,壓縮包內(nèi)包括以下文件:

  • Supplementary_Methods_Supplementary_protocols.pdf
  • Supplementary_Table1_Cancer_drivers.txt
  • Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk
  • Supplementary_Table3_Human_GOBP_AllPathways_no_GO_iea_July_01_2017_symbol.gmt
  • Supplementary_Table4_gprofiler_results.txt
  • Supplementary_Table5_hsapiens.pathways.NAME.gmt
  • Supplementary_Table6_TCGA_OV_RNAseq_expression.txt
  • Supplementary_Table7_TCGA_OV_RNAseq_classes.cls
  • Supplementary_Table8_gsea_report_for_na_pos.xls
  • Supplementary_Table9_gsea_report_for_na_neg.xls
  • Supplementary_Table10_TCGA_RNASeq_rawcounts.txt
  • Supplementary_Table11_RNASeq_classdefinitions.txt
  • Supplementary_Table12_TCGA_Microarray_rmanormalized.txt
  • Supplementary_Table13_Microarray_classdefinitions.txt

Software:

R | Pathway enrichment analysis

using g:Profiler

這里使用g:Profiler的R包進(jìn)行富集分析癣猾,也可以使用它的網(wǎng)頁(yè)版(https://biit.cs.ut.ee/gprofiler/gost進(jìn)行分析

  1. 首先料滥,使用Supplementary_Table1_Cancer_drivers.txt中的基因列表進(jìn)行常規(guī)的通路富集类垫。

In R

if (!requireNamespace("gprofiler2")) install.packages("gprofiler2")
library(gprofiler2)
# read in query gene list
queryG <- read.delim('data/supplementary_files/Supplementary_Table1_Cancer_drivers.txt', header=F)

gores1 <- gost(queryG$V1, 
               organism='hsapiens',
               exclude_iea=TRUE, 
               ordered_query=TRUE,
               evcodes=TRUE,
               sources=c('GO:BP', 'REAC')) 

這里選擇物種為"hsapiens" - Homo sapiens.

exclude_iea: 排除沒(méi)有經(jīng)過(guò)人工審核的不靠譜GO注釋 (less reliable GO annotations, IEAs)

ordered_query: 把輸入的基因認(rèn)為是排序過(guò)的基因列表榄攀,以一種GSEA的方式進(jìn)行通路富集

evcodes: 返回輸入基因與相應(yīng)通路產(chǎn)生交集的基因ID

sources: 富集的數(shù)據(jù)庫(kù)谍倦,這里選擇GO的BP項(xiàng)和Reactome的通路娜扇。

Available data sources and their abbreviations are:

  • Gene Ontology (GO or by branch GO:MF, GO:BP, GO:CC)
  • KEGG (KEGG)
  • Reactome (REAC): an open-source, open access, manually curated and peer-reviewed pathway database.
  • WikiPathways (WP): an open, collaborative platform dedicated to the curation of biological pathways.
  • TRANSFAC (TF): is the database of eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles.
  • miRTarBase (MIRNA): miRTarBase has accumulated more than three hundred and sixty thousand miRNA-target interactions (MTIs), which are collected by manually surveying pertinent literature after NLP of the text systematically to filter research articles related to functional studies of miRNAs.
  • Human Protein Atlas (HPA): including all the human proteins in cells, tissues, and organs using an integration of various omics technologies.
  • CORUM (CORUM): provides a resource of manually annotated protein complexes from mammalian organisms.
  • Human phenotype ontology (HP): provides a standardized vocabulary of phenotypic abnormalities encountered in human disease.

同時(shí),我們對(duì)富集到的通路進(jìn)行篩選季俩,排除包含過(guò)多基因的通路(大于350個(gè))慈缔,過(guò)少基因的通路(小于5個(gè)),以及與輸入基因交集過(guò)小的通路(小于3個(gè))

# filtered by term_size and intersection_size
# term_size - number of genes that are annotated to the term
# intersection_size - the number of genes in the input query that are annotated to the term
gores1_filtered <- list("result" = subset(gores1$result, 
                                          term_size > 5 & 
                                            term_size < 350 & 
                                            intersection_size > 3),
                        "meta" = gores1$meta)

理論上种玛,我們應(yīng)當(dāng)在進(jìn)行通路富集之前進(jìn)行排除,這樣才可以通過(guò)排除這類通路來(lái)通過(guò)統(tǒng)計(jì)檢驗(yàn)的效力(power)瓤檐,但是gprofiler2::gost()并沒(méi)有提供這樣的參數(shù)以供我們?cè)O(shè)定相應(yīng)的cutoff

富集的結(jié)果保存在$table

# pathway enrichment results table
head(gores1_filtered$result, n=3)
##       query significant  p_value term_size query_size intersection_size
## 95  query_1        TRUE 3.82e-13       142         81                15
## 124 query_1        TRUE 6.31e-12       197        120                18
## 125 query_1        TRUE 6.62e-12       230        120                19
##     precision recall    term_id source                        term_name
## 95      0.185 0.1056 GO:0048732  GO:BP                gland development
## 124     0.150 0.0914 GO:0048762  GO:BP mesenchymal cell differentiation
## 125     0.158 0.0826 GO:0060485  GO:BP           mesenchyme development
##     effective_domain_size source_order                parents
## 95                  16592        15077             GO:0048513
## 124                 16592        15103 GO:0030154, GO:0060485
## 125                 16592        17301 GO:0009888, GO:0048513
##                                                                                      evidence_codes
## 95                          ISS,IDA,ISS,ISS,IMP,ISS,TAS,IEP ISS,IGI ISS,IDA,IMP ISS,TAS,ISS,ISS,IDA
## 124         IMP,IMP,IMP ISS,IMP IGI TAS,IMP,ISS,ISS,TAS,ISS,IMP,IMP,ISS,ISS,IMP,IDA ISS,ISS,IMP,IDA
## 125 IMP,IMP,IMP ISS,IMP IGI TAS,IMP,ISS,ISS,ISS,TAS,ISS,IMP,IMP,ISS,ISS,IMP,IDA ISS,IMP ISS,IMP,IDA
##                                                                                                    intersection
## 95                          NF1,GATA3,FBXW7,ARHGAP35,TBX3,CEBPA,AKT1,SOX9,WT1,HGF,ERBB4,ARID5B,ELF3,FGFR2,BRCA2
## 124     PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
## 125 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,WT1,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2

The result data.frame contains the following columns:

  • term_id - unique term identifier (e.g GO:0005005)
  • p_values - hypergeometric p-values after correction for multiple testing in the order of input queries
  • significant - indicators in the order of input queries for statistically significant results
  • term_size - number of genes that are annotated to the term
  • query_sizes - number of genes that were included in the query in the order of input queries
  • intersection_sizes - the number of genes in the input query that are annotated to the corresponding term in the order of input queries
  • source - the abbreviation of the data source for the term (e.g. GO:BP)
  • term_name - the short name of the function
  • effective_domain_size - the total number of genes "in the universe" used for the hypergeometric test
  • source_order - numeric order for the term within its data source (this is important for drawing the results)
  • source_order - numeric order for the term within its data source (this is important for drawing the results)
  • parents - list of term IDs that are hierarchically directly above the term. For non-hierarchical data sources this points to an artificial root node.

gProfiler2提供Manhattan plot進(jìn)行通路富集結(jié)果的可視化

gostplot(gores1_filtered, capped = TRUE, interactive = TRUE)

Preparing data for EnrichmentMap

Creating a Generic Enrichment Map (GEM) file

GEM格式文件是Cytoscape EnrichmentMap app中需要的通路富集結(jié)果文件赂韵。其需要包括GO ID, Description, p-value, FDR, Phenotype and Genes.

由于gost函數(shù)輸出的p-value已經(jīng)是多重檢驗(yàn)校正后的值,所以這里直接復(fù)制兩列一樣的p-value挠蛉,其實(shí)都是FDR

gem <- gores1_filtered$result[,c("term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
gem <- gem[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]
head(gem, n=3)
##          GO.ID                      Description    p.Val      FDR Phenotype
## 95  GO:0048732                gland development 3.82e-13 3.82e-13        +1
## 124 GO:0048762 mesenchymal cell differentiation 6.31e-12 6.31e-12        +1
## 125 GO:0060485           mesenchyme development 6.62e-12 6.62e-12        +1
##                                                                                                           Genes
## 95                          NF1,GATA3,FBXW7,ARHGAP35,TBX3,CEBPA,AKT1,SOX9,WT1,HGF,ERBB4,ARID5B,ELF3,FGFR2,BRCA2
## 124     PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2
## 125 PTEN,GATA3,NOTCH1,CTNNB1,TBX3,EPHA3,SOX9,WT1,HGF,ERBB4,MTOR,FOXA1,FGFR2,SMAD4,FOXA2,TGFBR2,SMAD2,AXIN2,EZH2

保存該文件祭示,在Cytoscape中使用

write.table(gem, file = "extdata/gProfiler_gem.txt", sep = "\t", quote = F, row.names = F)

the Gene Matrix Transposed file format (GMT)

GMT格式文件是一種描述基因集的文件格式;其中每一行代表一個(gè)基因集 谴古,GMT示例如下圖:


其中第一列為基因集的ID质涛,第二列是基因集的描述,后續(xù)的列為該基因集中的基因ID

如果是使用gprofiler進(jìn)行分析掰担,可以在其網(wǎng)站上下載對(duì)應(yīng)輸入基因ID的GMT文件(https://biit.cs.ut.ee/gprofiler/gost

首先汇陆,在Organism的下拉菜單中選擇相應(yīng)物種,我們這里選擇Homo sapiens

然后带饱,打開(kāi)[Data sources]菜單毡代,下載相應(yīng)基因ID的GMT

我們這里使用symbol作為基因ID,下載name.gmt.zip這個(gè)文件

我們這里只使用GO:BPREAC的注釋勺疼,故合并這兩個(gè)文件為hsapiens.GOBP_REAC.name.gmt作為后續(xù)分析使用的文件 教寂,文件的前幾行如下所示:

P.S. combined.name.gmt.zip提供的是各個(gè)數(shù)據(jù)庫(kù)合并后的GMT

另外,gprofiler提供的gmt不包含KEGG和TRANSFAC的數(shù)據(jù)

"Note that this file does not include annotations from KEGG and Transfac as we are restricted by data source licenses that do not allow us to share these two data sources with our users."

Cytoscape | Visualization of enrichment results with EnrichmentMap

接下來(lái)的操作需要在Cytoscape中進(jìn)行执庐,確保你的電腦安裝了Cytoscape酪耕,以及其中的EnrichmentMap app

  1. 打開(kāi)Cytoscape后,點(diǎn)擊上方的[Apps]選項(xiàng)卡
  2. 打開(kāi)[EnrichmentMap]

在彈出的Enrichment Map窗口中轨淌,

  1. 點(diǎn)擊"+"新建分析
  2. 修改分析的名字
  3. 選擇富集分析的類型迂烁,這里我們選擇"Generic/gProfiler/Enrichr"
  4. 輸入GEM格式的富集分析結(jié)果,這里輸入我們前面保存的extdata/gProfiler_gem.txt
  5. 輸入GMT格式的基因集注釋文件猿诸,這里輸入合并的hsapiens.GOBP_REAC.name.gmt
  6. (可選)輸入基因表達(dá)的排序文件婚被,這里我們用文獻(xiàn)提供的Supplementary_Table2_MesenvsImmuno_RNASeq_ranks.rnk
  7. (可選)輸入基因表達(dá)矩陣文件,這里我們用文獻(xiàn)提供的Supplementary_Table6_TCGA_OV_RNAseq_expression.txt
  8. (可選)根據(jù)表達(dá)文件進(jìn)行基因篩選梳虽,勾選該選項(xiàng)會(huì)對(duì)GMT文件的基因進(jìn)行篩選址芯,排除不在表達(dá)文件中的基因。
  9. (可選)設(shè)置每個(gè)富集條目的FDR cutoff,高于該cutoff的富集項(xiàng)將會(huì)被排除谷炸,默認(rèn)是0.05北专,這里設(shè)置為0.01
  10. (可選)對(duì)網(wǎng)絡(luò)連接度進(jìn)行調(diào)整。越往左生成的網(wǎng)絡(luò)越稀疏旬陡,越往右生成的網(wǎng)絡(luò)越密集拓颓。本質(zhì)上是對(duì)富集項(xiàng)之間的相關(guān)性閾值進(jìn)行調(diào)整,從而減少/增加顯示的邊描孟。點(diǎn)擊[Show advanced options]會(huì)顯示更多關(guān)于網(wǎng)絡(luò)的調(diào)整項(xiàng)目

最后驶睦,點(diǎn)擊[Build]生成網(wǎng)絡(luò)

在[Layout]菜單中,可以嘗試更換為不同的網(wǎng)絡(luò)布局匿醒,看哪一種布局展示的網(wǎng)絡(luò)效果最佳场航,默認(rèn)為"the unweighted Prefuse Force Directed layout".

Cytoscape | Auto annotate clustered enriched terms

Enrichment maps typically include clusters of similar pathways representing major biological themes.

富集譜圖(Enrichment Map)常常包括了代表主要的生物學(xué)主題的

這一節(jié),我們將使用Cytoscape中的"AutoAnnotate" app進(jìn)行通路clusters的自動(dòng)注釋廉羔。

在進(jìn)行AutoAnnotate前溉痢,要確認(rèn)網(wǎng)絡(luò)的layout是較優(yōu)的方案。這是由于它會(huì)根據(jù)當(dāng)前的layout進(jìn)行l(wèi)abel憋他。

  1. 點(diǎn)擊[Apps] --> [AutoAnnotate] --> [New Annotation Set...]
  1. 在"Quick Start"頁(yè)面中孩饼,點(diǎn)擊[Create Annotations]

運(yùn)行后,網(wǎng)絡(luò)上功能相近的通路節(jié)點(diǎn)會(huì)被圈為一個(gè)cluster竹挡,并標(biāo)注上相應(yīng)的biological themes科盛。標(biāo)注的字體越大說(shuō)明該cluster包含的節(jié)點(diǎn)越多娱俺。另外,在Control panel中會(huì)多了"AutoAnnotations"這一個(gè)頁(yè)面。

這樣生成的注釋結(jié)果一般是沒(méi)法看的壶运,我們需要將重疊的圈和標(biāo)注錯(cuò)開(kāi)坏匪,將"無(wú)關(guān)的"cluster注釋去除喳资,最終達(dá)到發(fā)表級(jí)別的樣子

最后檀头,我們可以將運(yùn)行的cytoscape session保存為.cys文件,方便之后再分析坎怪。


在生成富集譜圖(Enrichment Map)后更為關(guān)鍵的是對(duì)網(wǎng)絡(luò)的解讀罢坝!

通路信息天然的具有冗余的特質(zhì),因此通路富集分析往往會(huì)給出對(duì)同一個(gè)通路不同描述的注釋結(jié)果搅窿。富集譜圖 (Enrichment Map) 簡(jiǎn)化了冗余的通路信息嘁酿,將其展示為有代表性的生物學(xué)主題 (biological themes),更有利于我們對(duì)基因通路富集結(jié)果的解讀男应。但這只是相當(dāng)于幫助我們凝練了通路富集的信息闹司,而對(duì)于信息的解讀還是依賴于我們的生物學(xué)知識(shí)。我們需要依據(jù)實(shí)驗(yàn)的假設(shè)沐飘、課題的背景做出合理的解釋游桩。例如在我們這篇文章的例子中對(duì)癌癥常見(jiàn)的驅(qū)動(dòng)突變基因進(jìn)行分析牲迫,就發(fā)現(xiàn)了許多與腫瘤生長(zhǎng)相關(guān)的通路。當(dāng)全面展現(xiàn)出基因集的通路信息后借卧,我們還可以根據(jù)感興趣的通路來(lái)尋找相關(guān)的基因進(jìn)行后續(xù)分析盹憎。總而言之铐刘,進(jìn)行數(shù)據(jù)分析更為關(guān)鍵的是對(duì)數(shù)據(jù)結(jié)果的解讀陪每。

Ref

2019 Nature Protocol: https://doi.org/10.1038/s41596-018-0103-9

gprofiler2 vignettes: https://cran.r-project.org/web/packages/gprofiler2/vignettes/gprofiler2.html

Cytoscape: https://cytoscape.org/

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市镰吵,隨后出現(xiàn)的幾起案子檩禾,更是在濱河造成了極大的恐慌,老刑警劉巖疤祭,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件锌订,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡画株,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)啦辐,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)谓传,“玉大人,你說(shuō)我怎么就攤上這事芹关⌒” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵侥衬,是天一觀的道長(zhǎng)诗祸。 經(jīng)常有香客問(wèn)我,道長(zhǎng)轴总,這世上最難降的妖魔是什么直颅? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮怀樟,結(jié)果婚禮上功偿,老公的妹妹穿的比我還像新娘。我一直安慰自己往堡,他們只是感情好械荷,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著虑灰,像睡著了一般吨瞎。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上穆咐,一...
    開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1 283
  • 那天颤诀,我揣著相機(jī)與錄音字旭,去河邊找鬼。 笑死着绊,一個(gè)胖子當(dāng)著我的面吹牛谐算,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播归露,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼洲脂,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了剧包?” 一聲冷哼從身側(cè)響起恐锦,我...
    開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎疆液,沒(méi)想到半個(gè)月后一铅,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡堕油,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年潘飘,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片掉缺。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡卜录,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出眶明,到底是詐尸還是另有隱情艰毒,我是刑警寧澤,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布搜囱,位于F島的核電站丑瞧,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏蜀肘。R本人自食惡果不足惜绊汹,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望扮宠。 院中可真熱鬧灸促,春花似錦、人聲如沸涵卵。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)轿偎。三九已至典鸡,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間坏晦,已是汗流浹背萝玷。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工嫁乘, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人球碉。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓蜓斧,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親睁冬。 傳聞我的和親對(duì)象是個(gè)殘疾皇子挎春,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容