Annovar測試記錄

table_annovar.pl

運行該程序時，會根據(jù)代碼需要拆成幾個步驟運行拂玻，如先用convert2annovar將vcf轉(zhuǎn)為avinput文件宰译，再根據(jù)protocol中的內(nèi)容拆成幾項annotate_variation.pl任務。

Example:

/mnt/fvg01vol8/software/biosoft/annovar/table_annovar.pl /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//vari/XWJ_180428/XWJ_180428.vcf /mnt/fvg01vol8/database/humandb/annovardb -out /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//Annotation/XWJ_180428/XWJ_180428.vari -buildver hg19 -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_afr,1000g2015aug_eas,1000g2015aug_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -vcfinput

常見參數(shù)說明：

Usage:
     table_annovar.pl [arguments] <query-file> <database-location>

optional arguments:
        --protocol <string>         comma-delimited string specifying database protocol
        --operation <string>        comma-delimited string specifying type of operation
        --outfile <string>          output file name prefix
        --buildver <string>         genome build version (default: hg18)
        --remove                    remove all temporary files
        --nastring <string>         string to display when a score is not available (default: null)
        --csvout                    generate comma-delimited CSV file (default: tab-delimited txt file)
        --gff3dbfile <files>        specify comma-delimited GFF3 files
        --vcfinput                  specify that input is in VCF format and output will be in VCF format

Input data prepare

The convert2annovar.pl script can convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina CASAVA format, SOLiD GFF genotype-calling format, Complete Genomics variant format, SOAPsnp format, MAQ format and VCF format. Additionally, the program can generate ANNOVAR input files from a list of dbSNP identifiers, or from transcript identifiers, or from a genomic region.

常見參數(shù)說明：

    USAGE: convert2annovar.pl [arguments] <variantfile>
    --format
            the format of the input files. Currently supported formats
            include pileup, cg, cgmastervar, gff3-solid, soap, maq, casava,
            vcf4, vcf4old, rsid. In August 2013, the VCF file processing
            subroutine is changed (multiple samples in VCF file can be
            processed in genotype-aware manner), but users can use vcf4old
            to have identical results as the old behavior. (輸入文件格式，常
            用的VCF4)
    --outfile
            specify the output file name. By default, output is written to
            STDOUT. (輸出文件乡括，否則就打印到屏幕上智厌，或可采用'>'重定向到文件)
    --allsample
            for multi-sample **VCF4** file, the --allsample argument will
            process all samples in the file and generate separate output
            files for each sample. By default, only the first sample in VCF4
            file will be processed. (每個樣本生成一個avinput文件)
    --withzyg
            for VCF4 format, print out zygosity information, coverage
            information and genotype quality information when -includeinfo
            is used. By default, these information are printed out if
            -includeinfo is not used. (輸出純雜合信息铣鹏、覆蓋度哀蘑、基因型質(zhì)量)
    --snpqual
            quality score threshold in the pileup file, such that variant
            calls with lower quality scores will not be printed out in the
            output file.(PILEUP文件時所采用的的過濾條件)

avinput格式：

chromosome
start
end
reference allele
alternative allele
annotation [OPTION]

eg.
20 1110696 1110696 A G het 67 6
*The 3 extra columns are zygosity status, genotype quality and read depth.

ps:
- In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column.
- If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called {outfile}.invalid_input

Gene-based annotation

數(shù)據(jù)庫的下載和準備：

Before working on gene-based annotation, a gene definition file and associated FASTA file must be downloaded into a directory if they are not already downloaded.
獲取Gene Definition文件：

annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/

humandb/為本地數(shù)據(jù)庫存放位置，RefGene參考RefGene, UCSC說明

RefSeq.PNG

For other gene definition systems (such as GENCODE, CCDS) or for other species (such as mouse/fly/worm/yeast), the user needs to build the FASTA file yourself.

注釋過程：

參數(shù)說明：

 Arguments to download databases or perform annotations
        --downdb        download annotation database
        --geneanno      annotate variants by gene-based annotation (infer functional consequence on genes)
        --regionanno    annotate variants by region-based annotation (find overlapped regions in database)
        --filter        annotate variants by filter-based annotation (find identical variants in database)
 Arguments to control input and output
        --outfile <file>          output file prefix
        --webfrom <string>        specify the source of database (ucsc or annovar or URL) (downdb operation)
        --dbtype <string>         specify database type
        --buildver <string>       specify genome build version (default: hg18 for human)

基礎語法refgene注釋：其中-geneanno,-dbtype refGene為默認參數(shù)，注意RefGene不含Mt信息

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/

注釋后生成兩個結(jié)果文件: ex1.refGene.exonic_variant_function和ex1.refGene.variant_function哮奇，一個為變異信息睛约，一個為外顯子區(qū)域變化情況

UCSC注釋：The transcript name look like uc002eg1.1, etc.

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/ -dbtype knownGene

Ensemble注釋：ensemblToGeneName.txt.gz can translate Ensembl identifiers to gene synonym.

annotate_variation.pl -out ex1 -build hg19 ex1.hg19.avinput humandb/ -dbtype ensGene

Technical Notes: Technically, the RefSeq Gene and UCSC Gene are transcript-based gene definitions. They built gene model based on transcript data, and then map the gene model back to human genomes. In comparison, Ensemble Gene and Gencode Gene are assembly-based gene definitions that attempt to build gene model directly from reference human genome. They came from different angles, trying to do the same thing: define genes in human genome.

其他物種：

The GFF3 or GTF file downloaded from Ensembl or compiled by the user need to be converted to the GenePred format performed by gff3ToGenePred or gtfToGenePred.

Please decompress both files (GTF file and the genome FASTA file for this plant):
解壓GTF(GFF)文件和fasta文件

gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz 
gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz

Please use the gtfToGenePred tool to convert the GTF file to GenePred file:
用軟件gtfToGenePred將GTF文件轉(zhuǎn)為refGene格式

gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf AT_refGene.txt

Please generate a transcript FASTA file with our provided script:
用annovar的retrieve_seq_from_fasta.pl軟件生成轉(zhuǎn)錄組序列文件

perl retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt --out AT_refGeneMrna.fa

After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file. Please note that the --buildver argument should be set to AT.

Region Based Annotation

Filter-based annotation是對變異位點的注釋辩涝，而Region-based annotationz主要針對的是那段區(qū)域怔揩。可采用的數(shù)據(jù)庫：UCSC數(shù)據(jù)庫商膊、BED文件翘狱、GFF文件。

Filter-based annotation looks exact matches between a query variant and a record in a database; two items are identical only if they have identical chromosome, start position, end position, ref allele and alaternative allele. Region-based annotation looks for over lap of a query variant with a region (this region could be a single position) in a database, and it does not care about exact match of positions, and it does not care about nucleotide identity at all.

UCSC下載文件注釋
UCSC相關數(shù)據(jù)可使用Annovar自帶軟件下載

annotate_variation.pl -buildver hg19 -downdb targetScanS ~

注釋：

annotate_variation.pl -region ./test.avinput humandb/ -buildver hg19 -dbtype targetScanS -out test

GFF注釋
GFF3格式說明阱高，注釋之后生成一個ex1.hg19_gff3文件茬缩，其中"Name="之后的內(nèi)容即為GFF文件所對應的ID號

annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile hg18_example_db_gff3.txt ex1.hg18.avinput humandb/ -out ex1

BED文件注釋

annotate_variation.pl ex1.hg18.avinput humandb/ -bedfile hg18_SureSelect_All_Exon_G3362_with_names.bed -dbtype bed -regionanno -out ex1

測試記錄：

2018-06-06測試芒草基因組注釋：

/mnt/fvg01vol8/software/biosoft/gff3ToGenePred  /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3 /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Mis_refGene.txt

報錯信息：

/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:2: invalid meta line: ##annot-version v7.1
/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:3: expected "##species NCBI_Taxonomy_URI", got "##species Miscanthus sinensis"
GFF3: 2 parser errors

解決方法：
刪除2凰锡、3行信息，但保留第一行##gff-version 3
不保留第一行將會有如下報錯：

/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene.gff3:1: invalid GFF3 header
Can't find annotation record "Misin01G000100.v7.1" referenced by "Misin01G000100.1.v7.1" Parent attribute
GFF3: 2 parser errors

ERROR：

程序完整運行裕膀，未出現(xiàn)報錯信息勇哗，但是注釋出來的全是intergenic，且Gene.refGene部分顯示為NONE;NONE

解決方法：
VCF中chromosome一欄所顯示的染色體號僅有數(shù)字1抄谐、2...19，而注釋的GFF文件中不同染色體表示為Chr01...毅厚，修改VCF文件后浦箱，可正確注釋SNP位點。

其他注釋相關軟件

snpEff

最后編輯于：2018.11.22 10:42:57

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末珍语，一起剝皮案震驚了整個濱河市竖幔，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌募逞，老刑警劉巖馋评，帶你破解...
沈念sama閱讀 212,884評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異纠脾，居然都是意外死亡蜕青，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,755評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門慧脱，熙熙樓的掌柜王于貴愁眉苦臉地迎上來菱鸥，“玉大人躏鱼，你說我怎么就攤上這事∪究粒” “怎么了？”我有些...
開封第一講書人閱讀 158,369評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵贸呢，是天一觀的道長楞陷。經(jīng)常有香客問我茉唉，道長，這世上最難降的妖魔是什么度陆？我笑而不...
開封第一講書人閱讀 56,799評論 1贊 285
?港島之戀（遺憾婚禮）
正文為了忘掉前任懂傀，我火速辦了婚禮，結(jié)果婚禮上恃泪，老公的妹妹穿的比我還像新娘犀斋。我一直安慰自己，他們只是感情好叽粹，可當我...
茶點故事閱讀 65,910評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布虫几。她就那樣靜靜地躺著，像睡著了一般衡招。火紅的嫁衣襯著肌膚如雪每强。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 50,096評論 1贊 291
城市分裂傳說
那天浪箭，我揣著相機與錄音辨绊，去河邊找鬼。笑死宣鄙，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的冻晤。我是一名探鬼主播鼻弧，決...
沈念sama閱讀 39,159評論 3贊 411
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼叉存！你這毒婦竟也來了度帮？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,917評論 0贊 268
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤甫菠，失蹤者是張志新（化名）和其女友劉穎冕屯，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體痰洒，經(jīng)...
沈念sama閱讀 44,360評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡浴韭，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,673評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年念颈，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片嗡靡。...
茶點故事閱讀 38,814評論 1贊 341
活死人
序言：一個原本活蹦亂跳的男人離奇死亡窟感，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出哈误，到底是詐尸還是另有隱情，我是刑警寧澤蜜自，帶...
沈念sama閱讀 34,509評論 4贊 334
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布袁辈，位于F島的核電站，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏媳危。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 40,156評論 3贊 317
男人毒藥：我在死后第九天來索命
文/蒙蒙一鸣皂、第九天我趴在偏房一處隱蔽的房頂上張望暮蹂。院中可真熱鬧，春花似錦荆陆、人聲如沸集侯。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,882評論 0贊 21
一樁弒父案浓体，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽辈讶。三九已至，卻和暖如春生闲，著一層夾襖步出監(jiān)牢的瞬間勘伺，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,123評論 1贊 267
情欲美人皮
我被黑心中介騙來泰國打工冲茸，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人轴术。一個月前我還...
沈念sama閱讀 46,641評論 2贊 362
代替公主和親
正文我出身青樓，卻偏偏與公主長得像盖袭，于是被迫代替她去往敵國和親彼宠。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 43,728評論 2贊 351