annovar是一款常用的注釋軟件脓钾,可在其官網(wǎng)注冊后下載。
annovar無需安裝幢痘,下載后解壓即可直接使用褥符。annovar軟件里面是幾個perl寫的腳本:
annotate.pl 下載數(shù)據(jù)庫,注釋數(shù)據(jù)
retrieve_seq_from_fasta.pl
coding_change.pl 可用來推斷蛋白質(zhì)序列
convert2annovar.pl 將變異文件轉(zhuǎn)化annovar可以使用的文件格式
table_annovar.pl 注釋文件韭寸,一次可完成多種類型的注釋
variants_reduction.pl 可用來更靈活地定制過濾注釋流程
1春哨、Download DataBase
annovar提供了許多常用的數(shù)據(jù)庫文件可使用annotate.pl
直接下載:
# 這里下載三個數(shù)據(jù)庫 refgene、dbsnp恩伺、1000genomes
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb
2赴背、使用convert2annovar.pl將輸入文件進行格式轉(zhuǎn)換
使用annovar注釋對輸入文件有一定的格式要求,因此注釋前需對輸入文件的格式做簡單的轉(zhuǎn)換晶渠。
annovar對輸入文件有明確格式要求的只有前5列凰荚,這5列依次必須為: Chromosome ("chr" prefix is optional), Start, End, Reference Allelel, Alternative Allele. 其余的可以列隨需要添加。輸入文件格式如下:
1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
其中插入或者刪除以-
表示褒脯, “0” means this information is not readily available.
使用convert2annovar.pl
最常用的就是對vcf文件進行轉(zhuǎn)換:
perl convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
除了這種最簡單常用的用法外便瑟,convert2annovar.pl
還有一些非常有用的參數(shù).
2.1 -allsample
對于含有多個樣本的vcf文件,格式轉(zhuǎn)換時只會取其第一個樣本進行注釋番川,也就是說即使別的樣本在這個位點有變異到涂,只要第一個樣本在某個位點沒有變異轉(zhuǎn)換時就會將這個位點去掉不會出現(xiàn)在注釋文件中脊框。如果想要得到所有樣本的變異位點的注釋的話,可以先將其拆分為幾個樣本的注釋輸入文件:
# 轉(zhuǎn)換格式時vcf中的每一個樣本會單獨生成一個待注釋的vcf文件
perl convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample
這一功能另一用法類似于vcftools中的vcf-subset践啄,可將多樣本的vcf文件分開為多個單樣本的vcf文件浇雹,但是效率要比vcf-subset高得多。
2.2 -includeinfo
, -comment
includeinfo
參數(shù)會保留vcf文件中的所有信息屿讽。
comment
參數(shù)會保留vcf文件的頭部注釋信息(以#開頭的行)箫爷。
convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample -includeinfo -comment
2.3 dbSNP identifiers
有時我們得到了一些有興趣的snp位點(dbsnp rsID)且只想對這些位點進行注釋,可使用-format rsid
來完成這一功能:
[kaiwang@biocluster ~/]$ cat example/snplist.txt
rs74487784
rs41534544
rs4308095
rs12345678
[kaiwang@biocluster ~/]$ convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snnplist.avinput
NOTICE: Scanning dbSNP file humandb/hg19_snp138.txt...
NOTICE: input file contains 4 rs identifiers, output file contains information for 4 rs identifiers
WARNING: 1 rs identifiers have multiple records (due to multiple mapping) and they are all written to output
[kaiwang@biocluster ~/]$ cat snplist.avinput
chr2 186229004 186229004 C T rs4308095
chr7 6026775 6026775 T C rs41534544
chr7 6777183 6777183 G A rs41534544
chr9 3901666 3901666 T C rs12345678
chr22 24325095 24325095 A G rs74487784
2.4 All variants in a genomic region
現(xiàn)在我們發(fā)現(xiàn)有一段區(qū)域很可疑聂儒,其變異可能與我們性狀相關(guān),只想對這一段區(qū)域進行注釋硫痰,通樣的可以使用-format
參數(shù)解決:
[kaiwang@biocluster ~/]$ convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003
NOTICE: Reading region from STDIN ... Done with 1 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from humandb/hg19_seq/chr1.fa
NOTICE: Finished writting FASTA for 1 genomic regions to stdout
1 2000001 2000001 A C
1 2000001 2000001 A G
1 2000001 2000001 A T
1 2000002 2000002 T A
1 2000002 2000002 T C
1 2000002 2000002 T G
1 2000003 2000003 C A
1 2000003 2000003 C G
1 2000003 2000003 C T
而且衩婚,通過一些其他參數(shù)還能限定這段區(qū)域內(nèi)的變異類型來進行注釋,譬如2bp insertion效斑,3bp的indel等非春,詳見http://annovar.openbioinformatics.org/en/latest/user-guide/input/ 。
此外缓屠,annovar還能對MAQ奇昙、GFF等格式進行注釋。
幾個需要注意的地方:
- vcf文件在格式轉(zhuǎn)換時敌完,若突變位點有兩個不同的等位基因則在結(jié)果文件中會分兩行放储耐。
- 在注釋時,遇到格式不符合的行會跳過繼續(xù)注釋而不是終止注釋滨溉,最后那些格式不符合的行會生成另一個文件(*.invalid_input)什湘。
3、Annotate
annovar提供了兩個腳本以供注釋使用:annotate_variation.pl
一次注釋一個數(shù)據(jù)庫晦攒,table_annovar.pl
一次注釋多個數(shù)據(jù)庫闽撤。
### 使用annotate_variation.pl注釋refgene數(shù)據(jù)庫
perl annotate_variation.pl input.av -buildver hg19 -geneanno -dbtype refGene humandb/ -out ex1
# -geneanno 表示使用基于基因的注釋,另有-filter表示基于過濾的注釋,-region表示基于位置的注釋
# -dbtype refGene 表示使用"refGene"數(shù)據(jù)庫
# -out ex1 表示輸出文件以ex1為前綴脯颜,亦可用 --outfile 直接指定文件名
### 使用table_annovar.pl注釋多個數(shù)據(jù)庫
perl table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt
#-bulidver hg19 表示使用的參考基因組版本
#-out myanno 表示輸出文件前綴哟旗,亦可用 --outfile 直接指定文件名
#-remove 表示刪除中間文件
#-protocol 表示使用的數(shù)據(jù)庫,其數(shù)據(jù)庫順序要與后面的operation注釋方式對應(yīng)上
#-operation 表示對應(yīng)數(shù)據(jù)庫的注釋類型(g代表gene-based栋操、r代表region-based闸餐、f代表filter-based,gx means gene-based with cross-reference annotation (from -xref argument))
#-nasting . 點號代替缺省值
#-csvout 表示輸出為csv格式
4矾芙、注釋類型
annovar注釋類型有三種:gene-based绎巨、region-based、filter-based
4.1 gene-based
Gene-based annotation是根據(jù)SNPs以及CNVs的位置信息來確定是否會造成編碼序列以及開放閱讀框的改變從而影響氨基酸的改變蠕啄,使用者可以自主選擇RefSeq genes, 包括UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等來進行注釋场勤。注釋后會生成兩個文件:ex1.variant_function and ex1.exonic_variant_function戈锻。
$ perl annotate_variation.pl -geneanno -dbtype refGene -out ex1 -build hg19 example/ex1.avinput humandb/
$ cat ex1.variant_function
UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
intronic DNASE2B 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
intergenic LOC645354(dist=11566),LOC391003(dist=116902) 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
intergenic UBIAD1(dist=55105),PTCHD2(dist=135699) 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
intergenic LOC100129138(dist=872538),NONE(dist=NONE) 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
exonic IL23R 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
exonic ATG16L1 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
exonic NOD2 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
exonic NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
exonic NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
exonic GJB2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
exonic CRYL1,GJB6 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
第一個文件包括對于所有突變的注釋,通過在文件最前面加入兩列和媳,以tab分割
第一列為變異所在基因位置的類型格遭,如外顯子,內(nèi)含子留瞳,UTR5拒迅,UTR3,基因間等
第三列為變異文件原有的comment信息
第二列為對第一列的描述信息她倘,詳情見下
#ex1.exonic_variant_function
[kaiwang@biocluster ~/]$ cat ex1.exonic_variant_function
line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
line10 nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss line15 frameshift deletion GJB6:NM_001110221:wholegene,GJB6:NM_001110220:wholegene,GJB6:NM_001110219:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
第二個輸出文件以.exonic_variant_function結(jié)尾璧微,只列出外顯子(氨基酸會改變)的變異
第一列為第一個文件中該變異所在的行號;
第二列為該變異的功能性后果硬梁,如外顯子改變導(dǎo)致的氨基酸變化前硫,閱讀框移碼,無義突變,終止突變等
第三列為基因名稱,轉(zhuǎn)錄識別標志和相應(yīng)的轉(zhuǎn)錄本的序列變化
第四列為原輸入文件內(nèi)容
4.2 region-based annotation
其與Gene-based annotation作用相反,它是用來確認在特定區(qū)域的突變造成的影響判沟。比如在44個物種的保守基因區(qū)域巡扇,預(yù)測的轉(zhuǎn)錄因子結(jié)合區(qū)域,基因重復(fù)區(qū)域,GWAS分析區(qū)域,基因突變數(shù)據(jù)庫,表觀組學位點等外莲。此處以Conserved genomic elements annotation為例介紹region-based annotation的使用:
[kaiwang@biocluster ~/]$ annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way example/ex1.avinput humandb/
NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with 5163775 regions
NOTICE: Finished region-based annotation on 12 genetic variants in ex1.hg19.avinput
NOTICE: Output files were written to ex1.hg19_phastConsElements46way
# -regionanno 表示使用基于區(qū)域的注釋
# -dbtype phastConsElements46way 表示使用"phastConsElements46way"數(shù)據(jù)庫,注意需要使用Region-based的數(shù)據(jù)庫
######輸出文件
[kaiwang@biocluster ~/]$ cat ex1.hg19_phastConsElements46way
phastConsElements46way Score=387;Name=lod=50 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
phastConsElements46way Score=420;Name=lod=68 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
phastConsElements46way Score=385;Name=lod=49 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
phastConsElements46way Score=395;Name=lod=54 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
phastConsElements46way Score=545;Name=lod=218 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
輸出文件:輸出的注釋文件第1列為“phastConsElements46way”兔朦,對應(yīng)注釋的類型苍狰,這里的phastCons 46-way alignments屬于保守的基因組區(qū)域的注釋;
第二列包含評分和名稱烘绽,評分來自UCSC淋昭,可以使用--score_threshold和--normscore_threshold來過濾評分低的變異,“Name=lod=x”名稱表示該區(qū)域的名稱安接;
剩余的部分為輸入文件的內(nèi)容
4.3 filter-based annotation
Filter-based annotation是用以確認已記錄在特定數(shù)據(jù)庫里的突變翔忽。例如想要知道突變是否為novel variation就需要知道該突變是否存在于dbSNP庫里,它在1000 genome project里面等位基因頻率怎樣盏檐,以及計算一系列突變項目得分并加以過濾歇式。它區(qū)別于region-based annotation就在于它針對突變堿基進行工作,而region-based annotation 針對染色體位置胡野。舉例來說就是region-based比對chr1:1000-1000而filter-based比對chr1:1000-1000上的A->G材失。
它擁有多種數(shù)據(jù)庫,包括針對全基因組測序的突變頻率硫豆,針對全外顯子數(shù)據(jù)測序的突變頻率龙巨,在孤立或者小類群人群中的突變頻率笼呆,全基因組數(shù)據(jù)突變的功能預(yù)測,全外顯子組突變的功能預(yù)測旨别,剪切變異體的功能預(yù)測诗赌,疾病相關(guān)突變,突變確認等秸弛,如千人基因組數(shù)據(jù)庫變異頻率進行過濾铭若,各種有害性打分軟件打分過濾,各種外顯子數(shù)據(jù)庫中的變異頻率進行過濾递览。
4.3.1 使用1000Genomes數(shù)據(jù)庫進行頻率注釋
下面使用千人基因組數(shù)據(jù)進行注釋:
[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/
NOTICE: Variants matching filtering criteria are written to ex1.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1.hg19_EUR.sites.2012_04_filtered
NOTICE: Processing next batch with 15 unique variants in 15 input lines
NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done
###已存在于數(shù)據(jù)庫中的變異寫入 *dropped文件叼屠,在數(shù)據(jù)庫中不存在的變異信息將會被寫入到*filtered文件.
###第一列為注釋數(shù)據(jù)庫名字,第二列為等位基因的突變頻率
####需要注意的是绞铃,我們也可以使用-maf 0.05 -reverse過濾掉高于0.05的變異镜雨;但是過濾ALT等位基因的頻率,我們更提倡使用-score_threshold參數(shù)憎兽。
[kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped
1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2012apr_eur 0.96 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
4.3.2 使用dbsnp數(shù)據(jù)庫進行注釋
[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -out ex1 -build hg19 -dbtype snp138 example/ex1.avinput humandb/
NOTICE: Variants matching filtering criteria are written to ex1.hg19_snp138_dropped, other variants are written to ex1.hg19_snp138_filtered
NOTICE: Processing next batch with 15 unique variants in 15 input lines
NOTICE: Database index loaded. Total number of bins is 2858459 and the number of bins to be scanned is 12
NOTICE: Scanning filter database humandb/hg19_snp138.txt...Done
###在dbsnp中已有編號的寫入*dropped文件中,在dbsnp中沒有的變異寫入*filtered文件中
[kaiwang@biocluster ~/]$ cat ex1.hg19_snp138_dropped
snp138 rs35561142 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
snp138 rs149123833 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
snp138 rs1000050 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
snp138 rs1287637 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
snp138 rs11209026 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
snp138 rs6576700 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
snp138 rs15842 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
snp138 rs80338939 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
snp138 rs2066844 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
snp138 rs2066845 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
snp138 rs2066847 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
snp138 rs2241880 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
#*dropped文件
第一列如region-based注釋的結(jié)果一樣以數(shù)據(jù)庫命名吵冒;
第二列為已經(jīng)在數(shù)據(jù)庫的突變的indentifier號纯命;
第三列開始同樣是輸入文件的內(nèi)容