2019-08-20 【三代測(cè)序】有參轉(zhuǎn)錄組注釋評(píng)估軟件 MatchAnnot

matchAnnot iso-seq注釋軟件

MatchAnnot is a python script which accepts a SAM file of IsoSeq transcripts aligned to a genomic reference and matches them to an annotation database in GTF format.
The aligner used must be splice-aware. MatchAnnot has been developed using the STAR aligner (http://code.google.com/p/rna-star).

安裝

到github上下載zip文件浪漠，解壓或者運(yùn)行以下
unzip MatchAnnot-master.zip
進(jìn)文件夾找相應(yīng)py文件運(yùn)行即可纤控。
其中matchAnnot.py and clusterView.py可以命令行直接運(yùn)行，前面加路徑讯检，或者簡(jiǎn)單些，放到path里
還是要注意基因組和注釋是否匹配谨湘，一定要匹配
如果輸入SAM文件媳握，一定要sort過(guò)的

運(yùn)行

很簡(jiǎn)單

Usage: matchAnnot.py [options] <SAM_file> ...

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --gtf=GTF             annotations file, in format specified by --format
  --format=FORMAT       annotations in alternate gtf format (def: standard)
  --clusters=CLUSTERS   cluster_report.csv file name (optional)
  --vars                print variants for each cluster (def: no)
  --outpickle=OUTPICKLE
                        matches in pickle format (optional)

輸入命令行
matchAnnot.py --gtf gencode.v19.annotation.gtf myData.sam > annotations.out

input 文件需求如下

MatchAnnot expects the following inputs:

    --gtf          Annotation file, in format as described by --format option (Mandatory).
    --format       Format of annotation file: 'standard', 'alt' or 'pickle' (default: standard).
    --clusters     cluster_report.csv as produced by IsoSeq (Optional).
    (pipe or arg)  SAM file of IsoSeq transcripts aligned to genomic reference (Mandatory).

輸出文件格式如下

The output of the gencode_isoseq.pl script contains several types of line:

isoform:     A mapped isoform, output of IsoSeq. Line shows isoform name,
             and start and end genomic coordinates of alignment.

cigar:       The cigar string from the SAM file entry for the isoform.   從SAM讀取

cl:          *A list of the reads-of-insert which were clustered to create the
             isoform*. This information is printed only if a cluster report file
             is supplied via the --clusters parameter. Each line lists one or 
         more reads from a single SMRTcell, labelled as either full-length
             or non-FL. The mapping from SMRTcell number to full SMRTcell name
         is in the summary at the end of the output.

polyA:   A list of the positions where polyadenylation motifs were found 
             near the 3' end of the isoform.    可統(tǒng)計(jì)polyA信號(hào)出現(xiàn)位置、motif等

gene:   A gene in the annotation file whose position overlaps the
             aligned isoform. Line shows gene name, its start and end
             coordinates, and the differences between those and the
             isoform start and end.
***
tr:          An annotated transcript of the gene under consideration. Line
             shows transcript name, a score, and the exon-to-exon
             mapping. Each [] grouping in the exon mapping is
             a list of transcript exons which match the isoform exons
             (see example below). Scores are as follows:

             5: IsoSeq exons match annotation exons one-for-one. Sizes agree
                except for leading and trailing edges.

             4: Like 5, but leading and trailing edge sizes differ by a 
                larger amount than the score-5 transcript found for this gene.

             3: One-for-one exon match, but sizes of internal exons disagree.

             2: The best match among all score=1 transcripts.

             1: Some exons overlap, overlaps are 1-for-1 where they exist.

         0: Everyting else: isoform overlaps gene, but little or
            no exon congruance.

exon:        Details of a single exon match. Shown only for transcripts
             with score >= 3. Line shows isoform and transcript start and
             stop coordinates and the delta between them, plus the
             number of indels found in the alignment (per the cigar
             string).

result:      A one-line summary for the isoform, showing the best gene and
             trancript found, and the resulting score.

summary:     Bookkeeping information at the end.


An example of an exon mapping (exons are numbered from 0):

                                          1                         2                               3               4                     5
   isoform:        ==========    ======    ==============  ===         =======
   transcript      =====        =========    ====    =========  =====    ========
                                 1                       2                      3                 4                    5                6

   maps as follows:

   [1] [2] [3,4] [4] [6]


   An ideal mapping is one-for-one:

   [1] [2] [3] [4] [5]


   To make it *really* ideal, the exon coordinates should be equal as well (or nearly so).

貌似比較重要的注釋分級(jí)參數(shù)形病，解讀客年。吐槽下某公司的翻譯霞幅，對(duì)于score1的翻譯是
“轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng)，但是僅有部分外顯子重疊量瓜； ”
一直沒(méi)弄明白咋回事司恳，看原文才清楚了，基本擰了啊

https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output

tr:          An annotated transcript of the gene under consideration. Line shows transcript name, a score, and the exon-to-exon mapping. Each [] grouping in the exon mapping is a list of transcript exons which match the isoform exons (see example below). Scores are as follows:

             5: IsoSeq exons match annotation exons one-for-one. Sizes agree except for leading and trailing edges.
                      PB轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子完全一一對(duì)應(yīng)绍傲，僅在轉(zhuǎn)錄本起始和終止區(qū)域的末端有差別扔傅； 
             4: Like 5, but leading and trailing edge sizes differ by a larger amount than the score-5 transcript found for this gene.
                      PB轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子完全一一對(duì)應(yīng)，類似類型5烫饼，不過(guò)在轉(zhuǎn)錄本起始和終止區(qū)域的末端差異較大猎塞；  
             3: One-for-one exon match, but sizes of internal exons disagree.
                      轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng)（結(jié)構(gòu)一致？）杠纵，但是中間外顯子大小會(huì)有差別荠耽；
             2: The best match among all score=1 transcripts.
                      在所有的 score=1的轉(zhuǎn)錄本中最匹配的轉(zhuǎn)錄本？（比較費(fèi)解） 
             1: Some exons overlap, overlaps are 1-for-1 where they exist.
                     PB transcripts僅匹配到部分外顯子比藻，但可以與已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng)铝量。
         0: Everyting else: isoform overlaps gene, but little or no exon congruance.
                     轉(zhuǎn)錄本在基因區(qū)間內(nèi)，但是與已知轉(zhuǎn)錄本的外顯子基本沒(méi)有重疊银亲。 


都沒(méi)有說(shuō)某個(gè)分?jǐn)?shù)一定要1 Vs 1對(duì)應(yīng)外顯子, 看結(jié)果吧

參考資料
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Run-matchAnnot
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output
https://github.com/TomSkelly/MatchAnnot/wiki

最后編輯于：2019.08.20 11:37:03

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末款违，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子群凶，更是在濱河造成了極大的恐慌插爹，老刑警劉巖，帶你破解...
沈念sama閱讀 222,252評(píng)論 6贊 516
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件请梢，死亡現(xiàn)場(chǎng)離奇詭異赠尾，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)毅弧，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,886評(píng)論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門气嫁，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人够坐，你說(shuō)我怎么就攤上這事寸宵。” “怎么了元咙？”我有些...
開封第一講書人閱讀 168,814評(píng)論 0贊 361
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵梯影，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我庶香，道長(zhǎng)甲棍，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 59,869評(píng)論 1贊 299
?港島之戀（遺憾婚禮）
正文為了忘掉前任赶掖，我火速辦了婚禮感猛，結(jié)果婚禮上七扰，老公的妹妹穿的比我還像新娘。我一直安慰自己陪白，他們只是感情好颈走，可當(dāng)我...
茶點(diǎn)故事閱讀 68,888評(píng)論 6贊 398
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布。她就那樣靜靜地躺著咱士，像睡著了一般疫鹊。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上司致，一...
開封第一講書人閱讀 52,475評(píng)論 1贊 312
城市分裂傳說(shuō)
那天拆吆，我揣著相機(jī)與錄音，去河邊找鬼脂矫。笑死枣耀，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的庭再。我是一名探鬼主播捞奕，決...
沈念sama閱讀 41,010評(píng)論 3贊 422
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼拄轻！你這毒婦竟也來(lái)了颅围？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,924評(píng)論 0贊 277
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤恨搓，失蹤者是張志新（化名）和其女友劉穎院促，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體斧抱，經(jīng)...
沈念sama閱讀 46,469評(píng)論 1贊 319
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡常拓，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 38,552評(píng)論 3贊 342
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了辉浦。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片弄抬。...
茶點(diǎn)故事閱讀 40,680評(píng)論 1贊 353
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖宪郊，靈堂內(nèi)的尸體忽然破棺而出掂恕，到底是詐尸還是另有隱情，我是刑警寧澤弛槐，帶...
沈念sama閱讀 36,362評(píng)論 5贊 351
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布懊亡，位于F島的核電站，受9級(jí)特大地震影響丐黄，放射性物質(zhì)發(fā)生泄漏斋配。R本人自食惡果不足惜孔飒，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 42,037評(píng)論 3贊 335
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一灌闺、第九天我趴在偏房一處隱蔽的房頂上張望艰争。院中可真熱鬧，春花似錦桂对、人聲如沸甩卓。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,519評(píng)論 0贊 25
一樁弒父案蕉斜，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)逾柿。三九已至，卻和暖如春宅此，著一層夾襖步出監(jiān)牢的瞬間机错，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,621評(píng)論 1贊 274
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工父腕，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留弱匪，地道東北人。一個(gè)月前我還...
沈念sama閱讀 49,099評(píng)論 3贊 378
代替公主和親
正文我出身青樓璧亮，卻偏偏與公主長(zhǎng)得像萧诫，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子枝嘶，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,691評(píng)論 2贊 361