一個完整的WGS變異檢測snakemake流程詳解

寫在前邊的話

??在此要感謝堿基礦工，第一次做全外顯子測序的變異檢測時便是參考了他寫的教程亭螟，當時的GATK還是3.8版本艇肴，而現(xiàn)在已經(jīng)是4.1了携狭。使用GATK來做變異檢測非常好用坏为，但是在有些步驟是非常耗時的螃宙，如果不能用多線程去處理的話测摔，尤其是有多個樣本時置济，就會效率很差。所以才打算寫一個snakemake流程化處理锋八，最后會附上一個完整腳本供大家參考浙于。

第一步：建立index以及下載數(shù)據(jù)庫

??Index 使用UCSC數(shù)據(jù)庫的hg38作為reference，使用bwa進行比對挟纱。下載hg38數(shù)據(jù)存為hg38.chrom.fasta, 建立bwa比對index：

$ bwa index -a bwtsw -p hg38.chrom.fasta hg38.chrom.fasta

參數(shù)說明：
-p 指定輸出index文件名字
第二個hg38.chrom.fasta為當前文件夾下的reference

??另外羞酗，還需要一個hg38.chrom.dict文件和一個hg38.chrom.fasta.fai文件, 命令如下：

$ gatk CreateSequenceDictionary -R hg38.chrom.fasta -O hg38.chrom.dict
$ samtools faidx hg38.chrom.fasta

參數(shù)說明：
-R 基因組reference文件
-O 輸出文件路徑
生成文件和輸入文件都是在當前目錄下

??index基本建立完成，然后是已知突變數(shù)據(jù)庫下載紊服，gatk提供了相關網(wǎng)站bundle可下載相關數(shù)據(jù)檀轨。下載hg38的所有數(shù)據(jù)到本地：

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/*

第二步：數(shù)據(jù)處理

1. 去接頭

??使用trim_galore進行去接頭，自動檢測接頭序列：

input:
    R1 = '/Volumes/RawData1/WGS/Fastq/{sample}_R1.fastq.gz',
    R2 = '/Volumes/RawData1/WGS/Fastq/{sample}_R2.fastq.gz'
output:
    'Trim/{sample}_R1_val_1.fq.gz',
    'Trim/{sample}_R2_val_2.fq.gz'
shell:
    '''
        trim_galore \
        -q 20 \         
        --length 25 \ 
        --e 0.1 \
        --paired \
        {input.R1} \
        {input.R2} \
        --gzip \
        -o Trim
    '''

2. 序列比對

??BWA 的比對速度實在是太慢了欺嗤，而hisat2速度相對快一些参萄，而且一樣可以用于dna測序比對；比對之后利用samtools進行轉(zhuǎn)換為bam并排序：

input:
    r1 = 'Trim/{sample}_R1_val_1.fq.gz',
    r2 = 'Trim/{sample}_R2_val_2.fq.gz',
    index = index
output:
    bam = 'Mapping/{sample}.sorted.bam',
    sum = 'Mapping/{sample}_aln_sum.txt'
shell:
    '''
        hisat2 \
        -p {threads} \
        -x {input.index}/genome \
        -1 {input.r1} \
        -2 {input.r2} \
        --summary-file {output.sum} | \
        samtools view -Sb -q 30 - | \
        samtools sort -@ {threads} -m 2G -O bam \
        -T Mapping/{wildcards.sample}.tmp -o {output.bam}
    '''

3. 添加頭信息

??由于不是用bwa比對煎饼，所以沒辦法在一開始就為reads加上頭信息讹挎，所以這一步使用gatk去添加：

input:
    'Mapping/{sample}.sorted.bam'
output:
    'Mapping/{sample}.addhead.bam'
shell:
    '''
        gatk AddOrReplaceReadGroups \
        -I {input} \
        -O {output} \
        -LB {wildcards.sample} \
        -PL illumina \ # 測序平臺不能亂寫，其他隨意
        -PU {wildcards.sample} \
        -SM {wildcards.sample} \
        -SO coordinate
    '''

4.去掉pcr重復

??這一步只是把重復reads標記了出來吆玖，并沒有刪除筒溃，可以通過更改參數(shù)去刪除reads：

input:
    'Mapping/{sample}.addhead.bam'
output:
    bam = 'MarkDup/{sample}.markdup.bam',
    met = 'MarkDup/{sample}.metrics.txt'
shell:
    '''
        gatk MarkDuplicates \
        -I {input} \
        -O {output.bam} \
        -M {output.met}
    '''

??做完這一步需要建立一個index，方便后續(xù)調(diào)用bam文件：

input:
    'MarkDup/{sample}.markdup.bam'
output:
    'MarkDup/{sample}.markdup.bam.bai'
shell:
    'samtools index {input}'

5. 堿基質(zhì)量校正

??由于比對到SNP或INDEL上的reads附近會有很多錯配沾乘，為了避免出現(xiàn)過多假陽性怜奖，需要對這部分reads進行局部重新比對;這個過程用到了很多已知變異集，即已知的可靠的變異位點翅阵，重比對將主要圍繞這些位點進行：

# SNP and INDEL datasets
omni='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/1000G_omni2.5.hg38.vcf.gz'
thg='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
mill='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
db146='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/dbsnp_146.hg38.vcf.gz'
hap='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/hapmap_3.3.hg38.vcf.gz'
dbsnp='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.dbsnp.vcf.gz'
kn_indel='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.known_indels.vcf.gz'
gold='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.variantEvalGoldStandard.vcf.gz'

第一步是重比對的過程：

input:
    bam = 'MarkDup/{sample}.markdup.bam',
    ref = ref,
    db146 = db146,
    mill = mill,
    thg = thg,
    hap = hap,
    omni = omni,
    kn_indel = kn_indel,
    gold = gold,
    dbsnp = dbsnp
output:
    'BQSR/{sample}.markdup.recal.table'
shell:
    '''
        gatk BaseRecalibrator \
        -R {input.ref} \
        -I {input.bam} \
        --known-sites {input.db146} \
        --known-sites {input.mill} \
        --known-sites {input.thg} \
        --known-sites {input.hap} \
        --known-sites {input.omni} \
        --known-sites {input.kn_indel} \
        --known-sites {input.gold} \
        --known-sites {input.dbsnp} \
        -O {output}
    '''

第二步是把重比對的結(jié)果再寫入到bam文件中：

input:
    bam = 'MarkDup/{sample}.markdup.bam',
    table = 'BQSR/{sample}.markdup.recal.table',
    ref = ref
output:
    'BQSR/{sample}.BQSR.bam'
shell:
    '''
        gatk ApplyBQSR \
        --bqsr-recal-file {input.table} \
        -R {input.ref} \
        -I {input.bam} \
        -O {output}
    '''

第三步依然是為bam文件建立index：

input:
    'BQSR/{sample}.BQSR.bam'
output:
    'BQSR/{sample}.BQSR.bam.bai'
shell:
    'samtools index {input}''

6. 開始真正的變異calling

??在對reads做了校正之后歪玲，就可以拿來做Variant calling 了，這一步只是拿到初始的變異位點掷匠，因為后續(xù)還有對這些變異位點進行過濾和校正:

input:
    bam = 'BQSR/{sample}.BQSR.bam',
    ref = ref
output:
    'HC/{sample}.HC.vcf.gz'
shell:
    '''
        gatk HaplotypeCaller \
        -R {input.ref} \
        -I {input.bam} \
        -O {output}
    '''

7. 對上一步得到的變異進行過濾

??該過程也分為兩步读慎，第一步是根據(jù)已知變異集的變異位點信息，利用自己的測序數(shù)據(jù)建立一個高斯模型槐雾，用來區(qū)分好的變異位點和壞的變異位點；好的變異位點即為已知變異集相同或相似的位點幅狮，壞的變異位點則相反：

首先是SNP篩選

第一步建立模型

input:
    vcf = 'HC/{sample}.HC.vcf.gz',
    ref = ref,
    hap = hap,
    omi = omni,
    thg = thg,
    dbs = db146,
    dbsnp = dbsnp,
    gold = gold
output:
    R = 'VQSR/{sample}.snps.plots.R',
    tr = 'VQSR/{sample}.snps.tranches',
    recal = 'VQSR/${sample}.snps.recal'
shell:
    '''
        gatk VariantRecalibrator \
        -R {input.ref} \
        -V {input.vcf} \
        --resource hapmap,known=false,training=true,truth=true,prior=15.0:{input.hap} \
        --resource omini,known=false,training=true,truth=true,prior=12.0:{input.omi} \
        --resource 1000G,known=false,training=true,truth=false,prior=10.0:{input.thg} \
        --resource dbsnp,known=true,training=false,truth=false,prior=2.0:{input.dbs} \
        --resource dbsnp38,known=true,training=false,truth=false,prior=2.0:{input.dbsnp} \
        --resource gold,known=true,training=false,truth=false,prior=2.0:{input.gold} \
        -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \ # 全基因組測序還有一個參數(shù)是 -an QD 募强，外顯子測序沒有
        -mode SNP \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
        --rscript-file {output.R} \
        --tranches-file {output.tr} \
        -O {output.recal}
    '''

第二步篩選變異位點：

input:
    vcf = 'HC/{sample}.HC.vcf.gz',
    ref = ref,
    tr = 'VQSR/{sample}.snps.tranches',
    recal = 'VQSR/${sample}.snps.recal'
output:
    'VQSR/{sample}.snp.vcf'
shell:
    '''
        gatk ApplyVQSR \
        -R {input.ref} \
        -V {input.vcf} \
        --ts-filter-level 99.0 \
        --tranches-file {input.tr} \
        --recal-file {input.recal} \
        -mode SNP \
        -O {output}
    '''

然后是INDEL篩選

??INDEL的篩選是建立在snp篩選基礎上的，所以snp篩選用過的變異集在這一步就不再用了。

第一步建立模型

input:
    vcf = 'VQSR/{sample}.snp.vcf',
    ref = ref,
    mill = mill,
    kn_indel = kn_indel
output:
    tr = 'VQSR/{sample}.indel.tranches',
    R = 'VQSR/{sample}.indel.plots.R',
    recal = 'VQSR/${sample}.indel.recal'
shell:
    '''
        gatk VariantRecalibrator \
        -R {input.ref} \
        -V {input.vcf} \
        --resource mills,known=true,training=true,truth=true,prior=12.0:{input.mill} \
        --resource kn_indel,known=true,training=true,truth=true,prior=10.0:{input.kn_indel} \
        -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \ # 全基因組測序還有一個參數(shù)是 -an QD 窘疮，外顯子測序沒有
        -mode INDEL \
        --max-gaussians 6 \
        --rscript-file {output.R} \
        --tranches-file {output.tr} \
        -O {output.recal}
    '''

第二步篩選變異位點

input:
    vcf = 'VQSR/{sample}.snp.vcf',
    ref = ref,
    tr = 'VQSR/{sample}.indel.tranches',
    recal = 'VQSR/${sample}.indel.recal'
output:
    'VQSR/{sample}.snp.indel.vcf'
shell:
    '''
        gatk ApplyVQSR \
        -R {input.ref} \
        -V {input.vcf} \
        --ts-filter-level 99.0 \
        --tranches-file {input.tr} \
        --recal-file {input.recal} \
        -mode INDEL \
        -O {output}
    '''

8.拆分SNP和INDEL結(jié)果

??由于前兩步產(chǎn)生的結(jié)果都在同一個文件里鲫忍，所以需要從中把SNP和INDEL分別拆分出來：

input:
    vcf = 'VQSR/{sample}.snp.indel.vcf',
    ref = ref
output:
    snp = 'SNP_INDEL/{sample}.snp.vcf.gz',
    indel = 'SNP_INDEL/{sample}.indel.vcf.gz'
shell:
    '''
        gatk SelectVariants -R {input.ref} -select-type SNP --variant {input.vcf} -O {output.snp}
        gatk SelectVariants -R {input.ref} -select-type INDEL --variant {input.vcf} -O {output.indel}
    '''

9. 選擇通過篩選的變異

??在前兩步的結(jié)果中，通過篩選的變異會加上一個'PASS'的標簽鸠儿，我們通過這個標簽可以選擇出對應的變異位點：

input:
    snp = 'SNP_INDEL/{sample}.snp.vcf.gz',
    indel = 'SNP_INDEL/{sample}.indel.vcf.gz'
output:
    snp = 'PASS/{sample}.filtered.snp.vcf',
    indel = 'PASS/{sample}.filtered.indel.vcf'
shell:
    '''
        zgrep 'PASS' {input.snp} > {output.snp}
        zgrep 'PASS' {input.indel} > {output.indel}
    '''

10. 對變異位點進行注釋

??最后得到的變異位點我們并不能直接看出它的功能等信息屹蚊，因此需要利用已知的變異庫去注釋，然后去觀察它是否和一些疾病等有關进每。注釋軟件我們選擇ANNOVAR,因為它下載即可使用汹粤，而且提供常用的變異庫，使用起來非常方便：

input:
    snp = 'PASS/{sample}.filtered.snp.vcf',
    indel = 'PASS/{sample}.filtered.indel.vcf',
    humandb = humandb,
    xref = xref
output:
    snp = 'ANNOVAR/{sample}.filtered.snp',
    indel = 'ANNOVAR/{sample}.filtered.indel'
shell:
    '''
        table_annovar.pl {input.snp} {input.humandb} -buildver hg38 -out {output.snp} \
        -remove -protocol refGene,cytoBand,exac03,dbnsfp33a,avsnp150,cosmic70,dbscsnv11 \
        -operation gx,r,f,f,f,f,f -nastring . \
        -polish -xref {input.xref}
    

        table_annovar.pl {input.indel} {input.humandb} -buildver hg38 -out {output.indel} \
        -remove -protocol refGene,cytoBand,exac03,dbnsfp33a,avsnp150,cosmic70,dbscsnv11 \
        -operation gx,r,f,f,f,f,f -nastring . \
        -polish -xref {input.xref}
    '''

最后田晚，我們得到的是Tab分割的表格嘱兼，分別儲存著變異位點的位置、序列贤徒、相關基因等信息芹壕，可以利用這些信息去進行可視化，例如利用IGV去觀察或者用circos或cirlize去作圖接奈。

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末踢涌，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子序宦，更是在濱河造成了極大的恐慌睁壁，老刑警劉巖，帶你破解...
沈念sama閱讀 217,277評論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件挨厚，死亡現(xiàn)場離奇詭異堡僻，居然都是意外死亡，警方通過查閱死者的電腦和手機疫剃，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,689評論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門钉疫，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人巢价，你說我怎么就攤上這事牲阁。” “怎么了壤躲？”我有些...
開封第一講書人閱讀 163,624評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵城菊，是天一觀的道長。經(jīng)常有香客問我碉克，道長凌唬，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,356評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任漏麦，我火速辦了婚禮客税，結(jié)果婚禮上况褪，老公的妹妹穿的比我還像新娘。我一直安慰自己更耻，他們只是感情好测垛，可當我...
茶點故事閱讀 67,402評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著秧均，像睡著了一般食侮。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上目胡，一...
開封第一講書人閱讀 51,292評論 1贊 301
城市分裂傳說
那天锯七，我揣著相機與錄音，去河邊找鬼讶隐。笑死起胰，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的巫延。我是一名探鬼主播效五，決...
沈念sama閱讀 40,135評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼炉峰！你這毒婦竟也來了畏妖？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,992評論 0贊 275
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤疼阔，失蹤者是張志新（化名）和其女友劉穎戒劫，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體婆廊，經(jīng)...
沈念sama閱讀 45,429評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡迅细，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,636評論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了淘邻。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片茵典。...
茶點故事閱讀 39,785評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖宾舅，靈堂內(nèi)的尸體忽然破棺而出统阿，到底是詐尸還是另有隱情，我是刑警寧澤筹我，帶...
沈念sama閱讀 35,492評論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布扶平，位于F島的核電站，受9級特大地震影響蔬蕊，放射性物質(zhì)發(fā)生泄漏结澄。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,092評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望麻献。院中可真熱鬧呼巷，春花似錦、人聲如沸赎瑰。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,723評論 0贊 22
一樁弒父案破镰，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽餐曼。三九已至，卻和暖如春鲜漩，著一層夾襖步出監(jiān)牢的瞬間源譬，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,858評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工孕似，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留踩娘，地道東北人。一個月前我還...
沈念sama閱讀 47,891評論 2贊 370
代替公主和親
正文我出身青樓喉祭，卻偏偏與公主長得像养渴，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子泛烙，可洞房花燭夜當晚...
茶點故事閱讀 44,713評論 2贊 354