matchAnnot iso-seq注釋軟件
https://github.com/TomSkelly/MatchAnnot
MatchAnnot is a python script which accepts a SAM file of IsoSeq transcripts aligned to a genomic reference and matches them to an annotation database in GTF format.
The aligner used must be splice-aware. MatchAnnot has been developed using the STAR aligner (http://code.google.com/p/rna-star).
安裝
到github上下載zip文件浪漠,解壓或者運(yùn)行以下
unzip MatchAnnot-master.zip
進(jìn)文件夾找相應(yīng)py文件運(yùn)行即可纤控。
其中matchAnnot.py and clusterView.py可以命令行直接運(yùn)行,前面加路徑讯检,或者簡(jiǎn)單些,放到path里
還是要注意基因組和注釋是否匹配谨湘,一定要匹配
如果輸入SAM文件媳握,一定要sort過(guò)的
運(yùn)行
很簡(jiǎn)單
Usage: matchAnnot.py [options] <SAM_file> ...
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--gtf=GTF annotations file, in format specified by --format
--format=FORMAT annotations in alternate gtf format (def: standard)
--clusters=CLUSTERS cluster_report.csv file name (optional)
--vars print variants for each cluster (def: no)
--outpickle=OUTPICKLE
matches in pickle format (optional)
輸入命令行
matchAnnot.py --gtf gencode.v19.annotation.gtf myData.sam > annotations.out
input 文件需求如下
MatchAnnot expects the following inputs:
--gtf Annotation file, in format as described by --format option (Mandatory).
--format Format of annotation file: 'standard', 'alt' or 'pickle' (default: standard).
--clusters cluster_report.csv as produced by IsoSeq (Optional).
(pipe or arg) SAM file of IsoSeq transcripts aligned to genomic reference (Mandatory).
輸出文件格式如下
The output of the gencode_isoseq.pl script contains several types of line:
isoform: A mapped isoform, output of IsoSeq. Line shows isoform name,
and start and end genomic coordinates of alignment.
cigar: The cigar string from the SAM file entry for the isoform. 從SAM讀取
cl: *A list of the reads-of-insert which were clustered to create the
isoform*. This information is printed only if a cluster report file
is supplied via the --clusters parameter. Each line lists one or
more reads from a single SMRTcell, labelled as either full-length
or non-FL. The mapping from SMRTcell number to full SMRTcell name
is in the summary at the end of the output.
polyA: A list of the positions where polyadenylation motifs were found
near the 3' end of the isoform. 可統(tǒng)計(jì)polyA信號(hào)出現(xiàn)位置、motif等
gene: A gene in the annotation file whose position overlaps the
aligned isoform. Line shows gene name, its start and end
coordinates, and the differences between those and the
isoform start and end.
***
tr: An annotated transcript of the gene under consideration. Line
shows transcript name, a score, and the exon-to-exon
mapping. Each [] grouping in the exon mapping is
a list of transcript exons which match the isoform exons
(see example below). Scores are as follows:
5: IsoSeq exons match annotation exons one-for-one. Sizes agree
except for leading and trailing edges.
4: Like 5, but leading and trailing edge sizes differ by a
larger amount than the score-5 transcript found for this gene.
3: One-for-one exon match, but sizes of internal exons disagree.
2: The best match among all score=1 transcripts.
1: Some exons overlap, overlaps are 1-for-1 where they exist.
0: Everyting else: isoform overlaps gene, but little or
no exon congruance.
exon: Details of a single exon match. Shown only for transcripts
with score >= 3. Line shows isoform and transcript start and
stop coordinates and the delta between them, plus the
number of indels found in the alignment (per the cigar
string).
result: A one-line summary for the isoform, showing the best gene and
trancript found, and the resulting score.
summary: Bookkeeping information at the end.
An example of an exon mapping (exons are numbered from 0):
1 2 3 4 5
isoform: ========== ====== ============== === =======
transcript ===== ========= ==== ========= ===== ========
1 2 3 4 5 6
maps as follows:
[1] [2] [3,4] [4] [6]
An ideal mapping is one-for-one:
[1] [2] [3] [4] [5]
To make it *really* ideal, the exon coordinates should be equal as well (or nearly so).
貌似比較重要的注釋分級(jí)參數(shù)形病,解讀客年。吐槽下某公司的翻譯霞幅,對(duì)于score1的翻譯是
“轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng),但是僅有部分外顯子重疊量瓜; ”
一直沒(méi)弄明白咋回事司恳,看原文才清楚了,基本擰了啊
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output
tr: An annotated transcript of the gene under consideration. Line shows transcript name, a score, and the exon-to-exon mapping. Each [] grouping in the exon mapping is a list of transcript exons which match the isoform exons (see example below). Scores are as follows:
5: IsoSeq exons match annotation exons one-for-one. Sizes agree except for leading and trailing edges.
PB轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子完全一一對(duì)應(yīng)绍傲,僅在轉(zhuǎn)錄本起始和終止區(qū)域的末端有差別扔傅;
4: Like 5, but leading and trailing edge sizes differ by a larger amount than the score-5 transcript found for this gene.
PB轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子完全一一對(duì)應(yīng),類似類型5烫饼,不過(guò)在轉(zhuǎn)錄本起始和終止區(qū)域的末端差異較大猎塞;
3: One-for-one exon match, but sizes of internal exons disagree.
轉(zhuǎn)錄本與注釋到的已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng)(結(jié)構(gòu)一致?)杠纵,但是中間外顯子大小會(huì)有差別荠耽;
2: The best match among all score=1 transcripts.
在所有的 score=1的轉(zhuǎn)錄本中最匹配的轉(zhuǎn)錄本?(比較費(fèi)解)
1: Some exons overlap, overlaps are 1-for-1 where they exist.
PB transcripts僅匹配到部分外顯子比藻,但可以與已知轉(zhuǎn)錄本外顯子一一對(duì)應(yīng)铝量。
0: Everyting else: isoform overlaps gene, but little or no exon congruance.
轉(zhuǎn)錄本在基因區(qū)間內(nèi),但是與已知轉(zhuǎn)錄本的外顯子基本沒(méi)有重疊银亲。
都沒(méi)有說(shuō)某個(gè)分?jǐn)?shù)一定要1 Vs 1對(duì)應(yīng)外顯子, 看結(jié)果吧
參考資料
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Run-matchAnnot
https://github.com/TomSkelly/MatchAnnot/wiki/How-to-Interpret-matchAnnot-Output
https://github.com/TomSkelly/MatchAnnot/wiki