文獻(xiàn):REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms
發(fā)表年份:Received March 2, 2011; Accepted June 7, 2011; Published July 18, 2011
期刊:PLoS one
引用: 1980
DOI: https://doi.org/10.1371/journal.pone.0021800
背景摘要
原文:
Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.
重點:
GO富集分析的結(jié)果會出現(xiàn)高度的冗余,因此很難解釋。
REViGO可以通過使用依賴于語義相似性度量(representative subset of the terms using a simple clustering algorithm)來獲取具有代表性的GO terms欠肾。從而將GO enrichment結(jié)果去冗余谅辣。
REViGO可以將這些非冗余的結(jié)果可視化挠进,幫助解析這些這些語義關(guān)系(semantic relationships)和層級關(guān)系陕贮。
還提供了樹圖(treemaps)和標(biāo)簽云圖(treemaps)作為可選的可視化內(nèi)容
介紹
Para1: 講GO富集分析唯笙,沒用不看
Para2:語義冗余如何影響結(jié)果
As high-throughput techniques become cheaper and more accurate, they detect even slight changes in gene expression or other measured properties. The lists of relevant genes will grow in size, and so will the derived lists of GO terms. Additionally, the redundancy in the resulting set of GO terms confounds interpre- tation and inflates the perceived number of biologically relevant results. This is frequently the case when analyzing terms in a parent- child relationship, e.g. the parent term ‘‘GO:0009058 biosynthetic process’’ fully encompasses its child term ‘‘GO:0008610 lipid biosynthetic process’’. In a list of terms enriched with overexpressed genes, if the child term has highly statistically significant enrichment, the parent term might appear significantly enriched purely as a consequence of including all the genes from the child term.
最常見的例子:比如說GO:0009058 biosynthetic process是GO:0008610 lipid biosynthetic process的父層姜凄,比如說在脂質(zhì)合成代謝這個GO Term的基因有超高顯著的統(tǒng)計學(xué)意義上的富集,那么合成代謝這個GO term的顯著富集是完全就是因為脂質(zhì)合成代謝的富集而變的顯著坛芽。
Para3:介紹一些去冗余工具: GOrilla
,RedundancyMiner
不解釋了留储。
Para4: Go slim 介紹和利弊
In the same vein, researchers may attempt to simplify long GO term lists by replacing the full Gene Ontology with ‘‘GO Slims’’, cut-down versions of the Gene Ontology. The GO slims are, however, limited to general (high-level) GO terms which are typically less interesting than the more fine-grained terms – the ones that have been removed from the GO slims. Thus, the problem of weeding out the redundant GO terms is not easily solved by removing the GO terms’ descendants (or ancestors) in this manner. The complex structure of the GO warrants a solution that takes into account the terms’ proximity in the GO graph, quantified by the GO term ‘semantic similarity’ measures [8].
- GO silms, 可以稱之為精簡版(閹割版)的基因本體論。詳細(xì)看:GO silm quickgo geneontology-go-subset
精簡后的GO slim確實在一定程度上達(dá)到了去冗余效果咙轩,但是這種從整個GO整體中抽取一部分子集的做法會忽略掉許多細(xì)節(jié)获讳,而這些細(xì)節(jié)的重要性往往要比那些high-level的GOterm更有意義。
總的來說活喊,GO slim確實表面上大大的減少了工作量丐膝,但是忽略掉了更多重要的東西,直接CJ也在群里聊過钾菊,并不盲目推崇GO slim帅矗。
作者這里講,對于GO slim的解讀需要一種更加復(fù)雜的科學(xué)的方法煞烫,這里他們提出用語義相似性來解決浑此。
Para5: REViGO作用
We have implemented a computational approach that (a) summarizes long GO lists by reducing functional redundancies, and (b) visualizes the remaining GO terms in two-dimensional plots, interactive graphs, treemaps or tag clouds. Both the summarization and the visualization step draw on the concept of GO term semantic similarity, reviewed in [8]. In particular, several common measures of semantic similarity [9] that employ the ‘most informative common ancestor’ approach are supported. The implementation is freely available as the REVIGO Web server at http://revigo.irb.hr/.
首先通過減少冗余來簡化復(fù)雜的GO term list。
將包留下來的的GO term進(jìn)行多中可視化滞详。
網(wǎng)站:http://revigo.irb.hr/.
結(jié)果和討論
A simple algorithm to reduce redundancy within lists of GO terms
To mitigate the problem of large and redundant lists, we aim to find a single representative GO term for each of these clusters. REVIGO performs a simple clustering procedure which is in concept similar to the hierarchical (agglomerative) clustering methods such as the neighbor joining approach [10]. A flowchart of the steps in the algorithm is given in Fig. 1.
REViGO 會對冗余的GO terms列表先聚類凛俱,然后給每個cluster找到一個代表性的GO term。 這種簡單的聚類方法參照鄰接法(neighbor joining approach)
The intuition behind this procedure is to form groups of highly similar GO terms, where the choice of the groups’ representatives is guided by the p-values, enrichments or similar values that the user supplies alongside the GO terms (Fig. 1).
這個過程其實就是將高度相似的GO terms分組料饥,這種分組是由pvalue蒲犬,enrichment等這些數(shù)值來指導(dǎo)的。
If the p-values are quite close and one term is a child node of the other, REVIGO will tend to choose the parent term, with a possible exception when the terms are deemed to be de facto equivalent (Fig. 1, see caption). Note that REVIGO generally does not prioritize higherlevel or lower-level GO terms as cluster representatives – instead, the user-supplied p-values/enrichments are used to guide the selection, if possible.
如果p值相近岸啡,而且一個term是另一個的子節(jié)點的時候原叮,會傾向于選擇父項。當(dāng)然,如果兩個term被視為事實上等同的時候會例外奋隶。
軟件通常情況下不會優(yōu)先把高等級或者低等級的GO terms作為cluster的代表沛慢,在可能的情況下,用戶提供的p值或者enrichments會指導(dǎo)這個選擇达布。
Very general GO terms, however, are always avoided as cluster representatives (Fig. 1) as they tend to be uninformative. It is also possible to manually override the choice of the representative GO term using the ‘pin’ option in case the default solution is not satisfactory for the user e.g. when a more general, higher-level term is desired to represent the group.
The user does not necessarily need to provide previously determined pvalues or another numerical value alongside the GO terms. In that case, REVIGO will prioritize the terms with higher ‘uniqueness’ the negative of average similarity of a term to all other terms.
軟件盡量避免讓通用的GO terms作為cluster的代表,因為他們反應(yīng)的信息很有限(uninformative)逾冬,比如說:催化反應(yīng)... 軟件支持用pin來替換代表性的GO term黍聂,比如當(dāng)你想用那些高級別的GO term作為cluster代表的時候。
當(dāng)然身腻,pvalue也不是必須的产还,如果沒有p值的話,軟件傾向與取有較高唯一性的term嘀趟。
The terms that remain in the list after the algorithm has finished are the cluster representatives, where it is guaranteed that no two representatives will be more similar than a user-provided cutoff value C. In other words, a lower (more stringent) value of C will result in a shorter, but also a more semantically diverse list. To offer some bearing on the relationship of C to statistical significance, we conducted a simulation where we drew random pairs of GO terms and recorded the distribution of the SimRel semantic similarity measure [11] (default in REVIGO). One percent of randomly generated GO term pairs have SimRel.0.53. Therefore, at C= 0.53 there is a 99% chance an abovebackground similarity exists between each pair of terms in a cluster. REVIGO offers four pre-defined values of C (0.9, 0.7, 0.5 and 0.4) to the user. The lowest value of C= 0.4 – corresponding to the ‘‘tiny’’ list size – should be used with caution, as many GO terms might be removed from the list without strong statistical support for their redundancy with respect to other terms. The values of C= 0.7 (default) and 0.9 are much more conservative in this respect, but may not shorten the list enough.
- 這里提到一個用戶提供的閾值C脐区,在分析完成后,沒有兩個代表性的GO term閾值會小于C她按。換句話說牛隅,C值越小,列表越短(越嚴(yán)格)酌泰,語義也更加多樣化媒佣。
- 通過隨機(jī)模擬發(fā)現(xiàn),當(dāng)C=0.53的時候陵刹,cluster中每對GO term之間的背景相似性(abovebackground similarity)為99%
- 不推薦C=0.4默伍,這樣會丟失許多微小的cluster。
- C=0.9或者C=0.7(default)是保守的選擇衰琐,但是這個閾值不會有效的減少GO terms數(shù)量也糊。
Figure1 A flowchart describing the REVIGO algorithm to remove redundant GO terms from the provided GO term list.
- 對于一個GO通路中的所有GO terms,成對的計算語義相似性羡宙。
- 找到最相似的兩個Terms *ti狸剃,tj
- 如果*ti,tj的相似性比閾值C低辛辨,結(jié)束捕捂。
- 如果*ti,tj大于閾值C,需要去掉一個斗搞,根據(jù)下面的規(guī)則
- 如果這個Term只有一個非常廣義的解釋(frequency > 5),拒絕掉這個非常廣義的term指攒,重新返回第一層,再找一個term和保留下來的term做判斷僻焚。
- 如果上面為假(不是general term)允悦,看兩個go term的p值,扔掉p值不顯著的那個。剩下的輪回(或者都扔掉隙弛?)
- 如果p值接近(且顯著架馋?)判斷*ti,tj 是否為父子關(guān)系全闷,如果是扔掉go level低的叉寂,parent去輪回。
- 如果不是总珠,隨機(jī)扔掉一個屏鳍,剩下的那個去輪回。
- 最終剩下一個就是這個cluster的代表GO term局服。
Visualization in scatterplots and interactive graphs
In drawing scatterplots (Fig. 2), the challenge lies in assigning x and y coordinates to each term so that more semantically similar GO terms are also closer in the plot. Here, we employ a multidimensional scaling procedure which initially places the terms using an eigenvalue decomposition of the terms’ pairwise distance matrix. This is followed by a stress minimization step which iteratively improves the agreement between the GO terms’ semantic similarities and their closeness in the displayed twodimensional space. The GO terms’ and associated data (term descriptions, p-values/enrichments, uniqueness, etc.) can be exported to a convenient text table and downloaded.
首先說散點圖(scatterplots)钓瞭,我們會對每個GO Term一個x,y坐標(biāo)值淫奔,這樣可以保證語義相似的GOterm在圖上更加接近山涡。后面一堆blah blah聽不懂?_?.... 最終的目的就是達(dá)到剛才說的這個,而且這些GO terms以及他們相關(guān)的值可以在網(wǎng)站下的表中找到唆迁,而且可以下載鸭丛。
REVIGO also allows the user to make a graph-based visualization (Fig. 3). Each of the GO terms is a node in the graph, and 3% of the strongest GO term pairwise similarities are designated as edges in the graph. The threshold value of 3% was derived empirically; we found it strikes a good balance between over-connected graphs with no visible subgroups on the one hand, and very fragmented graphs with too many small groups on the other hand. The placement of the nodes is determined by the ForceDirected layout algorithm as implemented in Cytoscape Web [12]. In addition to being viewed in the Web browser, the graph may be exported to a XGMML file, or opened in the standalone Cytoscape program [13] via Java Web Start to produce high resolution, publication-quality images. Both visualizations indicate the generality of the GO terms by the bubble radius, where smaller bubbles imply more specific terms; the user-supplied p-values/ enrichments are shown using color shading.
- node屬性:所有的GO terms
- edge屬性:GO term成對相似性(pairwise similarities)的前3%,3%為經(jīng)驗值。
- 大小和GOterm的層級有關(guān)媒惕,越詳細(xì)越小系吩,越籠統(tǒng)越大
- 顏色和p值或者enrichments有關(guān)。
Two additional views of the user’s data are supported in REVIGO. Treemaps (Fig. 4) show a two-level hierarchy of GO terms – the cluster representatives from the scatterplot and the graph are here joined into several very high-level groups. Tag clouds show (a) keywords which are overrepresented in the GO terms’ descriptions in the GO term list provided by the user (Fig. 5)
樹形圖展示了GO term的層級關(guān)系妒蔚,這里會吧這些go terms分配到層級較高的幾個大cluster中, 而且把散點圖中代表性的term展現(xiàn)出來穿挨。
詞云圖則和普通的詞云圖一樣,把高頻GO term對應(yīng)的description中出現(xiàn)的詞匯突出出來肴盏。
An example use-case: summarizing the putative targets of a transcription factor
To illustrate how REVIGO’s redundancy elimination algorithm (Fig. 1) works, we turn to a ‘toy example’ which has seven GO categories with associated p-values (Fig. 6). This dataset [14] lists gene functional categories co-expressed with the human gene coding for the transcription factor ZNF417, but not with the highly related protein ZNF587, measured using Affymetrix U133plus2 microarrays. The ZNF417 is an evolutionarily recent, great ape-specific transcription factor of which the ZNF587 is a more ancient homolog [14]; gene functions associated specifically to ZNF417 were found to be associated with brain development.
這個示例數(shù)據(jù):這個基因set與轉(zhuǎn)錄因子ZNF417
共表達(dá)科盛,但是不與ZNF587
共表達(dá),587比417更加古老菜皂,417是在類人猿中新出現(xiàn)的贞绵。與417相關(guān)的基因可能和大腦發(fā)育相關(guān)。
A casual inspection reveals subgroups of redundant gene functions. For instance, the GO term ‘‘cerebral cortex neuron differentiation’’ has a high semantic similarity (SimRel = 0.72) to ‘‘telencephalon development’’ and is therefore removed by merging it into the cluster represented by the term having a more significant p-value (Fig. 6). The removed term is assigned a ‘dispensability’ value of 0.72, a relatively high value reflecting the removed term’s strong redundancy with respect to the chosen representative. In the next group of terms, ‘‘a(chǎn)strocyte differentiation’’ and ‘‘negative regulation of neuron differentiation’’ are similar (0.74 and 0.62, respectively) to ‘‘negative regulation of glial cell differentiation’’. Due to a weaker p-value, the first two terms are merged into a cluster represented by the last term (Fig. 6). Note how the choice of cluster representatives is unaffected by whether terms are more general or more specific. The highest remaining pairwise similarity (here, 0.40) is below the user-defined threshold C, here set to 0.5, and the clustering algorithm stops. In other words, after having removed the redundant terms, the ones that remain as the cluster representatives are those terms having dispensability values below C. The example list of seven GO terms has been reduced to four clusters, of which two are singletons
cerebral cortex neuron differentiation相對于telencephalon development 來說語義相似性達(dá)到0.72(SimRel = 0.72)恍飘,那么cerebral cortex neuron differentiation會被合并到telencephalon development 中從主干上移除掉,被移除的術(shù)語被分配了0.72的“可有可有性”值榨崩,該值相對較高,反映了被移除的詞語相對于所選代表的強(qiáng)冗余性章母。而合并后的cluster SimRel值變?yōu)?.4母蛛,已經(jīng)小于閾值0.5,那么就終止循環(huán)了乳怎,這一個cluster就包含了他自己和cerebral cortex neuron differentiation
同樣negative regulation of glial cell differentiation語義相似的兩個term分別有0.62和0.74的SimRel值彩郊,在合并后成了0.
就是根據(jù)figure1的復(fù)雜流程對這些go term聚類,計算SimRel值,來達(dá)到去語義去冗余的目的, 例子中將7個go term減少到4個
如果C值設(shè)置的比較高秫逝,比如0.7或者0.9恕出,就無法很好的去冗余。
A possible alternative for REVIGO’s summarization procedure are the frequently used ‘‘GO slims’’. Here, the seven terms are quite specific and consequently none of them is in the ‘‘generic’’ or ‘‘PIR’’ GO slims (http://www.geneontology.org/GO.slims.shtml). Therefore, the GO slim approach would not apply to this dataset, illustrating the general principle of how summarizing the list by filtering out the more specific (or equivalently, higher information content) GO terms results in a loss of the potentially more interesting results.
這里點名了GO slim违帆,由于這7個term太具體了浙巫,GO slim其實在這個例子中沒法用。
In addition to the ‘dispensability’ values, REVIGO provides ‘uniqueness’ values. These two values are anticorrelated, though not perfectly, since ‘uniqueness’ measures whether the term is an outlier when compared semantically to the whole list (without regard for the p-values), while the ‘dispensability’ compares a term to other semantically close terms and is assigned based both on the semantic distance and the supplied p-values.
提出了唯一性**刷后,可分性(dispensability)和唯一性(uniqueness)這兩個值是完全程現(xiàn)反相關(guān)的狈醉,盡管不完美,但是唯一性值可以判斷這個go term和整體相比是不是一個離群值惠险。
To demonstrate the multidimensional scaling-based visualization in REVIGO, we visualize these terms in Fig. 7; for illustrative purposes, all seven terms are visible in this instance, instead of only the four cluster representatives. Here, it can be seen how two terms are quite distinct from the rest and also from each other: ‘‘regulation of dopamine metabolism’’ and ‘‘sensory perception of chemical stimulus’’ – these terms were not assigned to any of the clusters in the redundancy elimination procedure described above. The remaining five terms are more closely related, where the ‘‘telencephalon development’’ and ‘‘negative regulation of glial cell differentiation’’ have more significant p-values than the three other terms and were thus chosen as cluster representatives.
一個結(jié)果圖解讀,直接機(jī)翻了:
為了在REVIGO中演示基于多維縮放的可視化抒线,我們在圖7中可視化這些術(shù)語班巩;出于說明性目的,在這種情況下嘶炭,所有七個術(shù)語都可見抱慌,而不僅僅是四個簇代表。在這里眨猎,可以看到兩個術(shù)語是如何與其他術(shù)語以及彼此完全不同的:“多巴胺新陳代謝的調(diào)節(jié)”和“化學(xué)刺激的感官知覺”-在上述冗余消除過程中抑进,這些術(shù)語沒有被分配給任何群集。剩下的五個術(shù)語關(guān)系更為密切睡陪,其中“端腦發(fā)育”和“膠質(zhì)細(xì)胞分化的負(fù)調(diào)控”比其他三個術(shù)語具有更顯著的p值寺渗,因此被選為聚類代表。
最后是和其他軟件的對比兰迫,沒用過信殊,所以就不討論了
=========END===========