需要的數(shù)據(jù)文件:ref.fa test_1.fq test_2.fq
- 測序數(shù)據(jù)質(zhì)控(具體方法另做總結(jié))葵第,由原始測序文件raw_reads.fq得到可以進(jìn)一步分析的clean_reads.fq(raw.fq--->clean.fq)
- 建索引
#index
bwa index ref.fa
bwa index -a bwtsw -p hg19 hg19.fa & #以人類為例药磺,-p指定索引文件的開頭名稱
#生成5個(gè)文件:.amb,.ann,.bwt,.pac,.sa
samtools faidx ref.fa
gatk CreateSequenceDictionary -R ref.fa -O ref.dict
- 比對(duì)+排序+標(biāo)記重復(fù)
這一步由test_1.fq伦连、test_2.fq得到test.sorted.markup.bam
# bwa
bwa mem -t 4 -R '@RG\tID:test\tPL:illumina\tSM:test' ref.fa test_1.fq test_2.fq | samtools view -bS - >test.bam
samtools sort -@ 3 -o test.sorted.bam test.bam
rm test.bam
#markduplicate
Gatk MarkDuplicates -I test.sorted.bam -O test.sorted.markdup.bam -M test.sorted.markdup_metrics.txt3
#-I代表排序后的一個(gè)或者多個(gè)bam/sam文件,-M 代表輸出重復(fù)矩陣
#對(duì)排序標(biāo)記重復(fù)后的bam文件建立索引
samtools index test.sorted.markup.bam
此外济榨,可收集比對(duì)信息、測序深度等
#Collect Alignment & Insert Size Metrics
gatk CollectAlignmentSummaryMetrics R=ref.fa I=test.sorted.markdup.bam O=alignment_metrics.txt
gatk CollectInsertSizeMetrics INPUT=test.sorted.markdup.bam OUTPUT=insert_metrics.txt HISTOGRAM_FILE=insert_size_histogram.pd
samtools depth -a test.sorted.markdup.bam > depth_out.txt
- 堿基質(zhì)量重矯正
在開始檢測變異前先進(jìn)行堿基質(zhì)量值重矯正,包含BaseRecalibrator和ApplyBQSR兩步获洲,第一步得到一個(gè)校準(zhǔn)表文件(recal_data.table)
第二步利用recal_data.table重新調(diào)整原來BAM文件中的堿基質(zhì)量值,生成一份新的BAM文件殿如,變異檢測就是利用這個(gè)新的BAM文件進(jìn)行贡珊。
堿基質(zhì)量重矯正需要利用一些已經(jīng)存在的變異信息,所以不同物種的變異檢測在堿基質(zhì)量重矯正上存在差異涉馁。對(duì)應(yīng)人類來說現(xiàn)有一些數(shù)據(jù)庫门岔,但其他物種可能沒有。這一步由test.sorted.markup.bam得到recal_reads.bam
①對(duì)于有相應(yīng)變異數(shù)據(jù)庫的物種(比如人)來說
#Base Quality Score Recalibration(BQSR)#2
gatk BaseRecalibrator -R ref.fa -I test.sorted.markdup.bam --known-sites dbsnp_138.hg19.vcf -O recal_data.table #這里以dbsnp為例說明
gatk ApplyBQSR -R ref.fa -I test.sorted.markdup.bam -bqsr recal_data.table -O recal_reads.bam #2
#由test.sorted.markup.bam直接得到recal_reads.bam
個(gè)人理解烤送,堿基質(zhì)量重矯正需要提供vcf文件寒随,那么沒有變異數(shù)據(jù)庫的物種進(jìn)行變異檢測,我們就想辦法創(chuàng)造vcf文件帮坚,類似于dbsnp_138.hg19.vcf 這樣的妻往,作為 BaseRecalibrator的--known-sites參數(shù)。
②對(duì)于沒有相應(yīng)變異數(shù)據(jù)庫的物種來說(不知是否可以略過试和?參考https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/)
#先進(jìn)行第一次Call Variants得到raw vcf讯泣,用raw vcf代替dbsnp_138.hg19.vcf等進(jìn)行矯正
gatk HaplotypeCaller -R ref.fa -I test.sorted.markdup.bam -O raw_variants.vcf
#Extract SNPs & Indels
gatk SelectVariants -R ref.fa -V raw_variants.vcf -select-type SNP -O raw_snps.vcf
gatk SelectVariants -R ref.fa -V raw_variants.vcf -select-type INDEL -O raw_indels.vcf
#Filter SNPs
gatk VariantFiltration -R ref.fa -V raw_snps.vcf \
-O filtered_snps.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 60.0" \
-filter-name "MQ_filter" -filter "MQ < 40.0" \
-filter-name "SOR_filter" -filter "SOR > 4.0" \
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
#Filter Indels
gatk VariantFiltration -R ref.fa -V raw_indels.vcf \
-O filtered_indels.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 200.0" \
-filter-name "SOR_filter" -filter "SOR > 10.0"
#Exclude Filtered Variants
gatk SelectVariants --exclude-filtered -V filtered_snps.vcf -O bqsr_snps.vcf
gatk SelectVariants --exclude-filtered -V filtered_indels.vcf -O bqsr_indels.vcf
#Base Quality Score Recalibration(BQSR)#1
gatk BaseRecalibrator -R ref.fa -I test.sorted.markdup.bam \
--known-sites bqsr_snps.vcf \
--known-sites bqsr_indels.vcf \
-O recal_data.table
#Apply BQSR #1
gatk ApplyBQSR -R ref.fa -I test.sorted.markdup.bam -bqsr recal_data.table -O recal_reads.bam
#先進(jìn)行變異檢測,變異過濾阅悍,得到bqsr_snps.vcf和bqsr_indels.vcf判帮,再利用它們進(jìn)行BaseRecalibrator和ApplyBQSR
#由test.sorted.markdup.bam局嘁,中間call一次變異,再進(jìn)行矯正晦墙,最后得到recal_reads.bam
然后開始HaplotypeCaller
- 變異檢測(HaplotypeCaller)
這一步由堿基重矯正后的recal_reads.bam生成原始vcf文件raw_variants_recal.vcf
gatk HaplotypeCaller -R ref.fa -I recal_reads.bam -O raw_variants_recal.vcf
- 變異質(zhì)控和過濾
對(duì)原始vcf文件進(jìn)行變異質(zhì)控和過濾有2種方法悦昵,一是GATK VQSR,二是通過過濾指標(biāo)進(jìn)行過濾晌畅。VQSR的應(yīng)用有一定要求但指,它需要存在相應(yīng)物種的變異數(shù)據(jù)庫(如1000G、dbsnp等)抗楔,同時(shí)要求新檢測的結(jié)果中有足夠多的變異棋凳,適合全基因組分析。
①VQSR
#snp和indels分開校準(zhǔn)
#校準(zhǔn)snp
gatk VariantRecalibrator -V raw_variants_recal.vcf -O recalibrate_SNP.recal -mode SNP --tranches-file recalibrate_SNP.tranches \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -an QD \
-an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ --max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15.0:$hg19_VCF/hapmap_3.3.hg19.sites.vcf \
-resource omini,known=false,training=true,truth=false,prior=12.0:$hg19_VCF/1000G_omni2.5.hg19.sites.vcf \
-resource dbsnp,known=true,training=false,truth=false,prior=6.0:$hg19_VCF/dbsnp_138.hg19.vcf
gatk ApplyVQSR -V raw_variants_recal.vcf -O recalibrated_snps_raw.vcf --recal-file recalibrate_SNP.recal \
--tranches-file recalibrate_SNP.tranches \
-truth-sensitivity-filter-level 99.5 --create-output-variant-index true -mode SNP
#校準(zhǔn)indel
gatk VariantRecalibrator -V raw_variants_recal.vcf -O recalibrate_INDEL.recal -mode INDEL --tranches-file recalibrate_INDEL.tranches \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -an QD \
-an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ --max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15.0:$hg19_VCF/hapmap_3.3.hg19.sites.vcf \
-resource omini,known=false,training=true,truth=false,prior=12.0:$hg19_VCF/1000G_omni2.5.hg19.sites.vcf \
-resource dbsnp,known=true,training=false,truth=false,prior=6.0:$hg19_VCF/dbsnp_138.hg19.vcf
gatk ApplyVQSR -V raw_variants_recal.vcf -O recalibrated_indel_raw.vcf --recal-file recalibrate_INDEL.recal \
--tranches-file recalibrate_INDEL.tranches \
-truth-sensitivity-filter-level 99.5 --create-output-variant-index true -mode INDEL
VariantRecalibrator和ApplyVQSR用法參照官方连躏,以下貼出可能用到的參數(shù):
USAGE: VariantRecalibrator [arguments]
Build a recalibration model to score variant quality for filtering purposes
Version:4.2.0.0
Required Arguments:
--output,-O <GATKPath> The output recal file used by ApplyVQSR Required.
--resource <FeatureInput> A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
This argument must be specified at least once. Required.
--tranches-file <File> The output tranches file used by ApplyVQSR Required.
--use-annotation,-an <String> The names of the annotations which should used for calculations This argument must be specified at least once. Required.
--variant,-V <GATKPath> One or more VCF files containing variants This argument must be specified at least once.Required.
Optional Arguments:
--truth-sensitivity-tranche,-tranche <Double>
The levels of truth sensitivity at which to slice the data. (in percent, that is 1.0 for 1percent)
This argument may be specified 0 or more times. Default value: [100.0, 99.9,99.0, 90.0].
USAGE: ApplyVQSR [arguments]
Apply a score cutoff to filter variants based on a recalibration table
Version:4.2.0.0
Required Arguments:
--output,-O <GATKPath> The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value Required.
--recal-file <FeatureInput> The input recal file used by ApplyVQSR Required.
--variant,-V <GATKPath> One or more VCF files containing variants This argument must be specified at least once.Required.
②通過過濾指標(biāo)過濾
gatk SelectVariants -R ref.fa -V raw_variants_recal.vcf -select-type SNP -O raw_snps_recal.vcf
gatk SelectVariants -R ref.fa -V raw_variants_recal.vcf -select-type INDEL -O raw_indels_recal.vcf
gatk VariantFiltration -R ref.fa -V raw_snps_recal.vcf \
-O filtered_snps_final.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 60.0" \
-filter-name "MQ_filter" -filter "MQ < 40.0" \
-filter-name "SOR_filter" -filter "SOR > 4.0" \
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
gatk VariantFiltration -R ref.fa -V raw_indels_recal.vcf \
-O filtered_indels_final.vcf \
-filter-name "QD_filter" -filter "QD < 2.0" \
-filter-name "FS_filter" -filter "FS > 200.0" \
-filter-name "SOR_filter" -filter "SOR > 10.0"
至此剩岳,得到了filtered_snps_final.vcf和filtered_indels_final.vcf。
GATK pipeline.jpg
參考:
1.https://gatk.broadinstitute.org/hc/en-us
2.http://www.reibang.com/p/c41e8f3266b4
3.https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/