table_annovar.pl
運行該程序時,會根據(jù)代碼需要拆成幾個步驟運行拂玻,如先用convert2annovar將vcf轉(zhuǎn)為avinput文件宰译,再根據(jù)protocol中的內(nèi)容拆成幾項annotate_variation.pl任務。
- Example:
/mnt/fvg01vol8/software/biosoft/annovar/table_annovar.pl /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//vari/XWJ_180428/XWJ_180428.vcf /mnt/fvg01vol8/database/humandb/annovardb -out /mnt/fvg01vol7/project/180601_E00599_0078_AH52NWCCXY/MCGHUM1805002E/result//Annotation/XWJ_180428/XWJ_180428.vari -buildver hg19 -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_afr,1000g2015aug_eas,1000g2015aug_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -vcfinput
- 常見參數(shù)說明:
Usage:
table_annovar.pl [arguments] <query-file> <database-location>
optional arguments:
--protocol <string> comma-delimited string specifying database protocol
--operation <string> comma-delimited string specifying type of operation
--outfile <string> output file name prefix
--buildver <string> genome build version (default: hg18)
--remove remove all temporary files
--nastring <string> string to display when a score is not available (default: null)
--csvout generate comma-delimited CSV file (default: tab-delimited txt file)
--gff3dbfile <files> specify comma-delimited GFF3 files
--vcfinput specify that input is in VCF format and output will be in VCF format
Input data prepare
The convert2annovar.pl
script can convert other "genotype calling" format into ANNOVAR format. Currently, the program can handle Samtools genotype-calling pileup format, Illumina CASAVA format, SOLiD GFF genotype-calling format, Complete Genomics variant format, SOAPsnp format, MAQ format and VCF format. Additionally, the program can generate ANNOVAR input files from a list of dbSNP identifiers, or from transcript identifiers, or from a genomic region.
- 常見參數(shù)說明:
USAGE: convert2annovar.pl [arguments] <variantfile>
--format
the format of the input files. Currently supported formats
include pileup, cg, cgmastervar, gff3-solid, soap, maq, casava,
vcf4, vcf4old, rsid. In August 2013, the VCF file processing
subroutine is changed (multiple samples in VCF file can be
processed in genotype-aware manner), but users can use vcf4old
to have identical results as the old behavior. (輸入文件格式,常
用的VCF4)
--outfile
specify the output file name. By default, output is written to
STDOUT. (輸出文件乡括,否則就打印到屏幕上智厌,或可采用'>'重定向到文件)
--allsample
for multi-sample **VCF4** file, the --allsample argument will
process all samples in the file and generate separate output
files for each sample. By default, only the first sample in VCF4
file will be processed. (每個樣本生成一個avinput文件)
--withzyg
for VCF4 format, print out zygosity information, coverage
information and genotype quality information when -includeinfo
is used. By default, these information are printed out if
-includeinfo is not used. (輸出純雜合信息铣鹏、覆蓋度哀蘑、基因型質(zhì)量)
--snpqual
quality score threshold in the pileup file, such that variant
calls with lower quality scores will not be printed out in the
output file.(PILEUP文件時所采用的的過濾條件)
- avinput格式:
- chromosome
- start
- end
- reference allele
- alternative allele
- annotation [OPTION]
eg.
20 1110696 1110696 A G het 67 6
*The 3 extra columns are zygosity status, genotype quality and read depth.
-
ps:
- In some cases, users may want to specify only positions but not the actual nucleotides. In that case, "0" can be used to fill in the 4th and 5th column.
- If ANNOVAR encounters an invalid input line, it will write the invalid line into a file called {outfile}.invalid_input
Gene-based annotation
- 數(shù)據(jù)庫的下載和準備:
Before working on gene-based annotation, a gene definition file and associated FASTA file must be downloaded into a directory if they are not already downloaded.
獲取Gene Definition文件:
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/
humandb/
為本地數(shù)據(jù)庫存放位置,RefGene參考RefGene, UCSC說明
For other gene definition systems (such as GENCODE, CCDS) or for other species (such as mouse/fly/worm/yeast), the user needs to build the FASTA file yourself.
- 注釋過程:
參數(shù)說明:
Arguments to download databases or perform annotations
--downdb download annotation database
--geneanno annotate variants by gene-based annotation (infer functional consequence on genes)
--regionanno annotate variants by region-based annotation (find overlapped regions in database)
--filter annotate variants by filter-based annotation (find identical variants in database)
Arguments to control input and output
--outfile <file> output file prefix
--webfrom <string> specify the source of database (ucsc or annovar or URL) (downdb operation)
--dbtype <string> specify database type
--buildver <string> specify genome build version (default: hg18 for human)
基礎語法refgene注釋:其中-geneanno
,-dbtype refGene
為默認參數(shù),注意RefGene不含Mt信息
annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/
注釋后生成兩個結(jié)果文件: ex1.refGene.exonic_variant_function
和ex1.refGene.variant_function
哮奇, 一個為變異信息睛约,一個為外顯子區(qū)域變化情況
UCSC注釋:The transcript name look like uc002eg1.1
, etc.
annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/ -dbtype knownGene
Ensemble注釋:ensemblToGeneName.txt.gz
can translate Ensembl identifiers to gene synonym.
annotate_variation.pl -out ex1 -build hg19 ex1.hg19.avinput humandb/ -dbtype ensGene
Technical Notes: Technically, the RefSeq Gene and UCSC Gene are transcript-based gene definitions. They built gene model based on transcript data, and then map the gene model back to human genomes. In comparison, Ensemble Gene and Gencode Gene are assembly-based gene definitions that attempt to build gene model directly from reference human genome. They came from different angles, trying to do the same thing: define genes in human genome.
- 其他物種:
The GFF3 or GTF file downloaded from Ensembl or compiled by the user need to be converted to the GenePred format performed by gff3ToGenePred
or gtfToGenePred
.
- Please decompress both files (GTF file and the genome FASTA file for this plant):
解壓GTF(GFF)文件和fasta文件
gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz
- Please use the
gtfToGenePred
tool to convert the GTF file to GenePred file:
用軟件gtfToGenePred
將GTF文件轉(zhuǎn)為refGene格式
gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf AT_refGene.txt
- Please generate a transcript FASTA file with our provided script:
用annovar的retrieve_seq_from_fasta.pl
軟件生成轉(zhuǎn)錄組序列文件
perl retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt --out AT_refGeneMrna.fa
After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file. Please note that the --buildver
argument should be set to AT.
Region Based Annotation
Filter-based annotation是對變異位點的注釋辩涝,而Region-based annotationz主要針對的是那段區(qū)域怔揩。可采用的數(shù)據(jù)庫:UCSC數(shù)據(jù)庫商膊、BED文件翘狱、GFF文件。
Filter-based annotation looks exact matches between a query variant and a record in a database; two items are identical only if they have identical chromosome, start position, end position, ref allele and alaternative allele. Region-based annotation looks for over lap of a query variant with a region (this region could be a single position) in a database, and it does not care about exact match of positions, and it does not care about nucleotide identity at all.
- UCSC下載文件注釋
UCSC相關數(shù)據(jù)可使用Annovar自帶軟件下載
annotate_variation.pl -buildver hg19 -downdb targetScanS ~
注釋:
annotate_variation.pl -region ./test.avinput humandb/ -buildver hg19 -dbtype targetScanS -out test
- GFF注釋
GFF3格式說明阱高,注釋之后生成一個ex1.hg19_gff3文件茬缩,其中"Name="之后的內(nèi)容即為GFF文件所對應的ID號
annotate_variation.pl -regionanno -dbtype gff3 -gff3dbfile hg18_example_db_gff3.txt ex1.hg18.avinput humandb/ -out ex1
- BED文件注釋
annotate_variation.pl ex1.hg18.avinput humandb/ -bedfile hg18_SureSelect_All_Exon_G3362_with_names.bed -dbtype bed -regionanno -out ex1
參考網(wǎng)址:
測試記錄:
- 2018-06-06測試芒草基因組注釋:
/mnt/fvg01vol8/software/biosoft/gff3ToGenePred /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3 /mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Mis_refGene.txt
報錯信息:
/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:2: invalid meta line: ##annot-version v7.1
/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene_exons.gff3:3: expected "##species NCBI_Taxonomy_URI", got "##species Miscanthus sinensis"
GFF3: 2 parser errors
解決方法:
刪除2凰锡、3行信息,但保留第一行##gff-version 3
不保留第一行將會有如下報錯:
/mnt/fvg01vol8/database/Plants/Miscanthus_sinensis/v7.1/annotation/Msinensis_497_v7.1.gene.gff3:1: invalid GFF3 header
Can't find annotation record "Misin01G000100.v7.1" referenced by "Misin01G000100.1.v7.1" Parent attribute
GFF3: 2 parser errors
ERROR:
程序完整運行裕膀,未出現(xiàn)報錯信息勇哗,但是注釋出來的全是
intergenic
,且Gene.refGene部分顯示為NONE;NONE
解決方法:
VCF中chromosome一欄所顯示的染色體號僅有數(shù)字1抄谐、2...19,而注釋的GFF文件中不同染色體表示為Chr01...毅厚,修改VCF文件后浦箱,可正確注釋SNP位點。