Annovar測試記錄

table_annovar.pl

運行該程序時,會根據(jù)代碼需要拆成幾個步驟運行拂玻,如先用convert2annovar將vcf轉(zhuǎn)為avinput文件宰译,再根據(jù)protocol中的內(nèi)容拆成幾項annotate_variation.pl任務。

  • Example:
/mnt/fvg01vol8/software/biosoft/annovar/table_annovar.pl /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//vari/XWJ_180428/XWJ_180428.vcf /mnt/fvg01vol8/database/humandb/annovardb -out /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//Annotation/XWJ_180428/XWJ_180428.vari -buildver hg19 -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_afr,1000g2015aug_eas,1000g2015aug_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -vcfinput
  • 常見參數(shù)說明:
Usage:
     table_annovar.pl [arguments] <query-file> <database-location>

optional arguments:
        --protocol <string>         comma-delimited string specifying database protocol
        --operation <string>        comma-delimited string specifying type of operation
        --outfile <string>          output file name prefix
        --buildver <string>         genome build version (default: hg18)
        --remove                    remove all temporary files
        --nastring <string>         string to display when a score is not available (default: null)
        --csvout                    generate comma-delimited CSV file (default: tab-delimited txt file)
        --gff3dbfile <files>        specify comma-delimited GFF3 files
        --vcfinput                  specify that input is in VCF format and output will be in VCF format

Input data prepare

The convert2annovar.pl script can convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina CASAVA format, SOLiD GFF genotype-calling format, Complete Genomics variant format, SOAPsnp format, MAQ format and VCF format. Additionally, the program can generate ANNOVAR input files from a list of dbSNP identifiers, or from transcript identifiers, or from a genomic region.

  • 常見參數(shù)說明:
    USAGE: convert2annovar.pl [arguments] <variantfile>
    --format
            the format of the input files. Currently supported formats
            include pileup, cg, cgmastervar, gff3-solid, soap, maq, casava,
            vcf4, vcf4old, rsid. In August 2013, the VCF file processing
            subroutine is changed (multiple samples in VCF file can be
            processed in genotype-aware manner), but users can use vcf4old
            to have identical results as the old behavior. (輸入文件格式,常
            用的VCF4)
    --outfile
            specify the output file name. By default, output is written to
            STDOUT. (輸出文件乡括,否則就打印到屏幕上智厌,或可采用'>'重定向到文件)
    --allsample
            for multi-sample **VCF4** file, the --allsample argument will
            process all samples in the file and generate separate output
            files for each sample. By default, only the first sample in VCF4
            file will be processed. (每個樣本生成一個avinput文件)
    --withzyg
            for VCF4 format, print out zygosity information, coverage
            information and genotype quality information when -includeinfo
            is used. By default, these information are printed out if
            -includeinfo is not used. (輸出純雜合信息铣鹏、覆蓋度哀蘑、基因型質(zhì)量)
    --snpqual
            quality score threshold in the pileup file, such that variant
            calls with lower quality scores will not be printed out in the
            output file.(PILEUP文件時所采用的的過濾條件)
  • avinput格式:
  1. chromosome
  2. start
  3. end
  4. reference allele
  5. alternative allele
  6. annotation [OPTION]

    eg.
    20 1110696 1110696 A G het 67 6
    *The 3 extra columns are zygosity status, genotype quality and read depth.

  • ps:

    • In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column.
    • If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called {outfile}.invalid_input

Gene-based annotation

  • 數(shù)據(jù)庫的下載和準備:

Before working on gene-based annotation, a gene definition file and associated FASTA file must be downloaded into a directory if they are not already downloaded.
獲取Gene Definition文件:

annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/

humandb/為本地數(shù)據(jù)庫存放位置,RefGene參考RefGene, UCSC說明

RefSeq.PNG

For other gene definition systems (such as GENCODE, CCDS) or for other species (such as mouse/fly/worm/yeast), the user needs to build the FASTA file yourself.

  • 注釋過程:

參數(shù)說明:

 Arguments to download databases or perform annotations
        --downdb        download annotation database
        --geneanno      annotate variants by gene-based annotation (infer functional consequence on genes)
        --regionanno    annotate variants by region-based annotation (find overlapped regions in database)
        --filter        annotate variants by filter-based annotation (find identical variants in database)
 Arguments to control input and output
        --outfile <file>          output file prefix
        --webfrom <string>        specify the source of database (ucsc or annovar or URL) (downdb operation)
        --dbtype <string>         specify database type
        --buildver <string>       specify genome build version (default: hg18 for human)

基礎語法refgene注釋:其中-geneanno,-dbtype refGene為默認參數(shù),注意RefGene不含Mt信息

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/

注釋后生成兩個結(jié)果文件: ex1.refGene.exonic_variant_functionex1.refGene.variant_function哮奇, 一個為變異信息睛约,一個為外顯子區(qū)域變化情況

UCSC注釋:The transcript name look like uc002eg1.1, etc.

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/ -dbtype knownGene

Ensemble注釋:ensemblToGeneName.txt.gz can translate Ensembl identifiers to gene synonym.

annotate_variation.pl -out ex1 -build hg19 ex1.hg19.avinput humandb/ -dbtype ensGene

Technical Notes: Technically, the RefSeq Gene and UCSC Gene are transcript-based gene definitions. They built gene model based on transcript data, and then map the gene model back to human genomes. In comparison, Ensemble Gene and Gencode Gene are assembly-based gene definitions that attempt to build gene model directly from reference human genome. They came from different angles, trying to do the same thing: define genes in human genome.

  • 其他物種:

The GFF3 or GTF file downloaded from Ensembl or compiled by the user need to be converted to the GenePred format performed by gff3ToGenePred or gtfToGenePred.

  1. Please decompress both files (GTF file and the genome FASTA file for this plant):
    解壓GTF(GFF)文件和fasta文件
gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz 
gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz
  1. Please use the gtfToGenePred tool to convert the GTF file to GenePred file:
    用軟件gtfToGenePred將GTF文件轉(zhuǎn)為refGene格式
gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf AT_refGene.txt 
  1. Please generate a transcript FASTA file with our provided script:
    用annovar的retrieve_seq_from_fasta.pl軟件生成轉(zhuǎn)錄組序列文件
perl retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt --out AT_refGeneMrna.fa

After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file. Please note that the --buildver argument should be set to AT.


Region Based Annotation

Filter-based annotation是對變異位點的注釋辩涝,而Region-based annotationz主要針對的是那段區(qū)域怔揩。可采用的數(shù)據(jù)庫:UCSC數(shù)據(jù)庫商膊、BED文件翘狱、GFF文件。

Filter-based annotation looks exact matches between a query variant and a record in a database; two items are identical only if they have identical chromosome, start position, end position, ref allele and alaternative allele. Region-based annotation looks for over lap of a query variant with a region (this region could be a single position) in a database, and it does not care about exact match of positions, and it does not care about nucleotide identity at all.

  • UCSC下載文件注釋
    UCSC相關數(shù)據(jù)可使用Annovar自帶軟件下載
annotate_variation.pl -buildver hg19 -downdb targetScanS ~

注釋:

annotate_variation.pl -region ./test.avinput humandb/ -buildver hg19 -dbtype targetScanS -out test
  • GFF注釋
    GFF3格式說明阱高,注釋之后生成一個ex1.hg19_gff3文件茬缩,其中"Name="之后的內(nèi)容即為GFF文件所對應的ID號
annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile hg18_example_db_gff3.txt ex1.hg18.avinput humandb/ -out ex1 
  • BED文件注釋
annotate_variation.pl ex1.hg18.avinput humandb/ -bedfile hg18_SureSelect_All_Exon_G3362_with_names.bed -dbtype bed -regionanno -out ex1

參考網(wǎng)址:


測試記錄:

  • 2018-06-06測試芒草基因組注釋:
/mnt/fvg01vol8/software/biosoft/gff3ToGenePred  /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3 /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Mis_refGene.txt

報錯信息:

/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:2: invalid meta line: ##annot-version v7.1
/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:3: expected "##species NCBI_Taxonomy_URI", got "##species Miscanthus sinensis"
GFF3: 2 parser errors

解決方法:
刪除2凰锡、3行信息,但保留第一行##gff-version 3
不保留第一行將會有如下報錯:

/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene.gff3:1: invalid GFF3 header
Can't find annotation record "Misin01G000100.v7.1" referenced by "Misin01G000100.1.v7.1" Parent attribute
GFF3: 2 parser errors

ERROR:

程序完整運行裕膀,未出現(xiàn)報錯信息勇哗,但是注釋出來的全是intergenic,且Gene.refGene部分顯示為NONE;NONE

解決方法:
VCF中chromosome一欄所顯示的染色體號僅有數(shù)字1抄谐、2...19,而注釋的GFF文件中不同染色體表示為Chr01...毅厚,修改VCF文件后浦箱,可正確注釋SNP位點。


其他注釋相關軟件

snpEff

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末珍语,一起剝皮案震驚了整個濱河市竖幔,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌募逞,老刑警劉巖馋评,帶你破解...
    沈念sama閱讀 212,884評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異纠脾,居然都是意外死亡蜕青,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,755評論 3 385
  • 文/潘曉璐 我一進店門慧脱,熙熙樓的掌柜王于貴愁眉苦臉地迎上來菱鸥,“玉大人躏鱼,你說我怎么就攤上這事∪究粒” “怎么了?”我有些...
    開封第一講書人閱讀 158,369評論 0 348
  • 文/不壞的土叔 我叫張陵贸呢,是天一觀的道長楞陷。 經(jīng)常有香客問我茉唉,道長,這世上最難降的妖魔是什么度陆? 我笑而不...
    開封第一講書人閱讀 56,799評論 1 285
  • 正文 為了忘掉前任懂傀,我火速辦了婚禮,結(jié)果婚禮上恃泪,老公的妹妹穿的比我還像新娘犀斋。我一直安慰自己,他們只是感情好叽粹,可當我...
    茶點故事閱讀 65,910評論 6 386
  • 文/花漫 我一把揭開白布虫几。 她就那樣靜靜地躺著,像睡著了一般衡招。 火紅的嫁衣襯著肌膚如雪每强。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 50,096評論 1 291
  • 那天浪箭,我揣著相機與錄音辨绊,去河邊找鬼。 笑死宣鄙,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的冻晤。 我是一名探鬼主播鼻弧,決...
    沈念sama閱讀 39,159評論 3 411
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼叉存!你這毒婦竟也來了度帮?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,917評論 0 268
  • 序言:老撾萬榮一對情侶失蹤甫菠,失蹤者是張志新(化名)和其女友劉穎冕屯,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體痰洒,經(jīng)...
    沈念sama閱讀 44,360評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡浴韭,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,673評論 2 327
  • 正文 我和宋清朗相戀三年念颈,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片嗡靡。...
    茶點故事閱讀 38,814評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡窟感,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出哈误,到底是詐尸還是另有隱情,我是刑警寧澤蜜自,帶...
    沈念sama閱讀 34,509評論 4 334
  • 正文 年R本政府宣布袁辈,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏媳危。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 40,156評論 3 317
  • 文/蒙蒙 一鸣皂、第九天 我趴在偏房一處隱蔽的房頂上張望暮蹂。 院中可真熱鬧,春花似錦荆陆、人聲如沸集侯。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽辈讶。三九已至,卻和暖如春生闲,著一層夾襖步出監(jiān)牢的瞬間勘伺,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,123評論 1 267
  • 我被黑心中介騙來泰國打工冲茸, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人轴术。 一個月前我還...
    沈念sama閱讀 46,641評論 2 362
  • 正文 我出身青樓,卻偏偏與公主長得像盖袭,于是被迫代替她去往敵國和親彼宠。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 43,728評論 2 351

推薦閱讀更多精彩內(nèi)容

  • ??annovar對人類基因組和非人類基因組variants注釋流程 部分翻譯自:Hui Y, Kai W. Ge...
    dulunar閱讀 4,276評論 0 10
  • ANNOVAR的安裝 ANNOVAR網(wǎng)址 log in之后才能download,使用教育機構(gòu)后綴的郵箱即可注冊摧冀。 ...
    面面的徐爺閱讀 22,953評論 1 26
  • 基因組組裝完成后索昂,或者是完成了草圖,就不可避免遇到一個問題椒惨,需要對基因組序列進行注釋。注釋之前首先得構(gòu)建基因模型凄杯,...
    xuzhougeng閱讀 50,644評論 11 184
  • 比如銷售家具秉宿,放眼望去,每戶人家都有購買需求膊存,不過有需求并不一定會購買忱叭,只有結(jié)婚或新裝修房子的業(yè)主才會添置…… 下...
    一條狗的流浪記閱讀 418評論 0 0
  • 那么是哪八條一定要改的配置呢!我沒列快捷鍵爵卒,是因為IDEA原生的快捷鍵撵彻,我用的也挺順手实牡,所以并不是認為一定要改轴合! ...
    溫馭臣閱讀 197評論 0 0