寫在前面:從7月16日開始到PICU輪轉(zhuǎn)浪腐,前天跟值了一個(gè)夜班矢沿,可能是新人比較旺的緣故豁生,從中午就開始收病人,一直忙到凌晨3點(diǎn)多育叁,昨天早上6點(diǎn)過就起來干活查完房開完醫(yī)囑芍殖,寫完病程就到11點(diǎn)多,感覺腦子已經(jīng)完全不夠用了,回到住處就開始補(bǔ)瞌睡隐锭。今天6點(diǎn)多就出門上班计贰,因?yàn)橐纫恍z查結(jié)果,然后只辦了1個(gè)出院躁倒,忙完所有的事情后就12點(diǎn)了,1點(diǎn)過去參加一個(gè)組會(huì)褐桌,然后2點(diǎn)多終于閑下來象迎,看看之前的結(jié)果。
查看bam轉(zhuǎn)換情況發(fā)現(xiàn)有部分樣本轉(zhuǎn)換過程出了未知錯(cuò)誤挖帘,沒有成功轉(zhuǎn)換拇舀,提取出這部分的樣本名重新構(gòu)建config1再來進(jìn)行轉(zhuǎn)換
#構(gòu)建config1
basename -a *bam.tmp.0000* >tmp
cat tmp| while read id; do sample=${id%%.hg38.sort*}; echo $sample; done >config1
#刪除殘余文件
rm -rf *.bam.tmp.0*
#激活小環(huán)境重新開始轉(zhuǎn)換
conda activate wes
nohup cat config1 | while read id ; do bam=~/CHD_pooling_seq/${id}.dedup.bam; if [ ! -f ~/project/0.bwa/ok.${id}_marked.status ]; then echo "start CrossMap for ${id}" `date`; python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38 1>~/project/0.bwa/${id}_log.mark 2>&1; if [ $? -eq 0 ]; then touch ~/project/0.bwa/ok.${id}_marked.status; fi; echo "end CrossMap for ${id}" `date`; fi; done &
同時(shí)進(jìn)行varient calling:
單個(gè)樣本calling的腳本wesFlow_multi_to_gvcf.sh
:
(base) root@1100150:~/project# vi wesFlow_multi_to_gvcf.sh
(base) root@1100150:~/project# cat wesFlow_multi_to_gvcf.sh
#!usr/bin/bash
# use $sample
# bash ~/project/wesFlow_multi_to_gvcf.sh $sample
# This is a wesflow for only one sample
samtools=samtools
GATK=~/biosoft/gatk-4.1.7.0/gatk
#references
ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_bundle=~/annotation/variation/GATK
dbsnp=$gatk_bundle/dbsnp_146.hg38.vcf.gz
indel=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
G1000=$gatk_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz
hapmap=$gatk_bundle/hapmap_3.3.hg38.vcf.gz
omini=$gatk_bundle/1000G_omni2.5.hg38.vcf.gz
mills=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
outdir=~/project
## outdir directory
if [ ! -d $outdir/0.bwa ]
then mkdir -p $outdir/0.bwa
fi
if [ ! -d $outdir/gatk ]
then mkdir -p $outdir/gatk
fi
## start the gatk analysis
## start the gatk analysis
## with one sample
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" MarkDuplicates \
-I $outdir/0.bwa/$sample.hg38.sorted.bam \
-O $outdir/0.bwa/${sample}.sorted.marked.bam \
-M $outdir/0.bwa/$sample.metrics \
1>$outdir/0.bwa/${sample}_log.mark 2>&1 && echo "MarkDuplicates done!"
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" FixMateInformation \
-I $outdir/0.bwa/${sample}.sorted.marked.bam \
-O $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
-SO coordinate \
1>$outdir/0.bwa/${sample}_log.fix 2>&1
## 86 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" BaseRecalibrator \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
--known-sites $snp \
--known-sites $indel \
--known-sites $1000G \
-O $outdir/0.bwa/${sample}_recal.table \
1>$outdir/0.bwa/${sample}_log.recal 2>&1 && echo "BaseRecalibrator done!"
## 45 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" ApplyBQSR \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
-bqsr $outdir/0.bwa/${sample}_recal.table \
-O $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
1>$outdir/0.bwa/${sample}_log.ApplyBQSR 2>&1 && echo "ApplyBQSR done!"
## 449m for 16G data
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" HaplotypeCaller \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
#--dbsnp $dbsnp \
-O $outdir/gatk/${sample}.HC.vcf.gz \
1>$outdir/0.bwa/${sample}_log.HC 2>&1 && echo "HaplotypeCaller done!"
time $samtools index $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam && echo "** ${sample}.sorted.marked.fixed.bqsr.bam index done! **"
# VQSR
# first SNP mode 分別評(píng)估SNP和INDEL突變位點(diǎn)的質(zhì)量
# SNP mode
time $GATK VariantRecalibrator \
-R $ref \
-V $outdir/gatk/$sample.HC.vcf.gz \
--max-gaussians 4 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
-resource:omini,known=false,training=true,truth=false,prior=12.0 $omini \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 $G1000 \
-resource:snp,known=true,training=false,truth=false,prior=10.0 $dbsnp \
-an DP -an QD -an SOR -an ReadPosRankSum -an MQRankSum \
-mode SNP \
--rscript-file $outdir/gatk/${sample}.HC.snps.plots.R \
--tranches-file $outdir/gatk/${sample}.HC.snps.tranches \
-O $outdir/gatk/${sample}.HC.snps.recal
time $GATK ApplyVQSR \
-R $ref \
-V $outdir/gatk/$sample.HC.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file $outdir/gatk/$sample.HC.snps.tranches \
--recal-file $outdir/gatk/$sample.HC.snps.recal \
-mode SNP \
-O $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz && echo "** SNPs VQSR done **"
## Indel mode
time $GATK VariantRecalibrator \
-R $ref \
-V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
--max-gaussians 6 \
-resource:mills,known=false,training=true,truth=true,prior=15.0 $mills \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode INDEL \
--rscript-file $outdir/gatk/${sample}.HC.snps.indels.plots.R \
--tranches-file $outdir/gatk/${sample}.HC.snps.indels.tranches \
-O $outdir/gatk/${sample}.HC.snps.indels.recal
time $GATK ApplyVQSR \
-R $ref \
-V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file $outdir/gatk/$sample.HC.snps.indels.tranches \
--recal-file $outdir/gatk/$sample.HC.snps.indels.recal \
-mode INDEL \
-O $outdir/gatk/$sample.HC.snps.indels.VQSR.vcf.gz && echo "** SNPs and Indels VQSR $sample done **"
寫成循環(huán)運(yùn)行
bash CHD_Flow_multi.sh
內(nèi)容如下:
(wes) root@1100150:~/project# cat CHD_Flow_multi.sh
cat config | while read sample ; do echo $sample; bash ~/project/wesFlow_multi_to_gvcf.sh $sample; done
切換到config目錄提交到后臺(tái)運(yùn)行
cd ~/project/
nohup bash CHD_Flow_multi.sh &
這個(gè)單樣本的calling的腳本第一次使用要拂,因此注釋的時(shí)間可能不全對(duì),先看看脱惰,明天再來繼續(xù)分析吧。