劉小澤寫于2020-3-9
不涉及軟件的具體使用方法,重點(diǎn)在于結(jié)果的比較
從一個(gè)小問題入手不斷探索儡循,也是一種不錯(cuò)的學(xué)習(xí)方式
需要先進(jìn)行預(yù)處理步驟——得到BQSR bam
下面的流程就簡單帶過了择膝,只使用了標(biāo)準(zhǔn)流程肴捉,詳細(xì)流程可以跟著https://www.yuque.com/biotrainee/wes/
走一遍
gatk調(diào)用的mutect2
其中的-L
參數(shù)指定了:輸出exon上的結(jié)果
# 以下的變量都需要自己提前定義好,比如:
# $REF=/home/reference/mm10.fa
# Target the analysis to specific genomic intervals with the -L parameter
gatk --java-options "-Xmx20G -Djava.io.tmpdir=${TMP_DIR}" Mutect2 \
-R $REF \
-I ${ALIGN_DIR}/${N}_bqsr.bam -normal $N \
-I ${ALIGN_DIR}/${T}_bqsr.bam -tumor $T \
-L $BED \
-O ${RESULT_DIR}/${name}_mutect2.vcf
gatk --java-options "-Xmx20G -Djava.io.tmpdir=${TMP_DIR}" FilterMutectCalls \
-R $REF \
-V ${RESULT_DIR}/${name}_mutect2.vcf \
-O ${RESULT_DIR}/${name}_somatic.vcf
cat ${RESULT_DIR}/${name}_somatic.vcf | perl -alne '{if(/^#/){print}else{next unless $F[6] eq "PASS";next if $F[0] =~/_/;print } }' > \
${RESULT_DIR}/${name}_filter.vcf
然后是muse
# 需要先搞一個(gè)樣本的配置文件,例如
# cat >config
# normal1 tumor1
# normal2 tumor2
#...
cat config | while read i;do
conf=($i)
name=${conf[1]}
N=${conf[0]}
T=${conf[1]}
# for somatic calling
cd $RESULT_DIR
muse call -O $name -f $REF_DIR \
$ALIGN_DIR/${T}_bqsr.bam \
$ALIGN_DIR/${N}_bqsr.bam
muse sump -I ${name}.MuSE.txt -E \
–O ${name}_muse.vcf –D $DBSNP
done
最后是varscan
cat config | while read i;do
conf=($i)
name=${conf[1]}
N=${conf[0]}
T=${conf[1]}
# for somatic calling
# java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename
cd $RESULT_DIR
normal_pileup="samtools mpileup -q 1 -f $REF_DIR $ALIGN_DIR/${N}_bqsr.bam"
tumor_pileup="samtools mpileup -q 1 -f $REF_DIR $ALIGN_DIR/${T}_bqsr.bam"
varscan somatic <($normal_pileup) <($tumor_pileup) $name
varscan processSomatic ${name}.snp
# 重點(diǎn)關(guān)注:${name}.snp.Somatic.hc (.hc means High Confidence)
done
利用python腳本將varscan結(jié)果轉(zhuǎn)為vcf
# script: https://github.com/student-t/Varscan2VCF/blob/master/vscan2vcf.py
python varscan2vcf.py lung-1.snp.Somatic.hc >lung-1_varscan.hc.snp.vcf
三個(gè)軟件結(jié)果的統(tǒng)計(jì)
1 首先:統(tǒng)計(jì)所有原始的vcf文件variant數(shù)量
ls *vcf | while read i;do( count=`cat $i | grep -v "#" | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 4631 lung-1_gatk.filter.vcf
# 23875 lung-1_muse.vcf
# 29917 lung-1_varscan.hc.snp.vcf
2 然后找落在exon上的snv
首先需要exon.bed文件
BED=$wkd/reference/mm10.exon.bed
# 關(guān)于得到exon.bed
cat CCDS.current.txt | grep "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0"\t0\t+"}' > mm10.exon.bed
然后用SnpSift的intervals功能
# for muse
ls *muse.vcf | while read i;do (name=${i%%.*}.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
# for varscan
ls *varscan.hc.snp.vcf | while read i;do (name=${i%%.*}.hc.snp.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
# for gatk
ls *gatk.filter.vcf | while read i;do (name=${i%%.*}.filter.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
3 然后添加dbSNP信息
使用SnpSift的annotate功能【需要準(zhǔn)備好dbSNP的vcf】
ls *exon.vcf | while read i;do (name=${i%%.*}.exon.dbsnp.vcf;java -jar ~/biosoft/snpEff/SnpSift.jar annotate /home/yunzeliu/project/2020-01-20-mm10_wes/reference/snp_vcf/00-All.vcf $i >$name);done
對(duì)結(jié)果再進(jìn)行一個(gè)統(tǒng)計(jì)
ls *.exon.dbsnp.vcf | while read i;do( count=`cat $i | grep -v "#" | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 4631 lung-1_gatk.exon.dbsnp.vcf
# 5961 lung-1_muse.exon.dbsnp.vcf
# 6840 lung-1_varscan.exon.dbsnp.vcf
4 看一下未知突變的情況
外顯子測(cè)序一般會(huì)關(guān)注:癌有癌旁沒有脖卖、數(shù)據(jù)庫中未記錄的畦木、非同義突變
ls *.exon.dbsnp.vcf | while read i;do( count=`cat $i|perl -alne '{print if $F[2] eq "."}' | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 389 lung-1_gatk.exon.dbsnp.vcf
# 508 lung-1_muse.exon.dbsnp.vcf
# 556 lung-1_varscan.exon.dbsnp.vcf
三個(gè)軟件結(jié)果的比較
先用bedops --intersect
統(tǒng)計(jì)
它是根據(jù)bed格式去統(tǒng)計(jì)
bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | wc -l
# 4203
然后用bcftools isec
統(tǒng)計(jì)
如果使用bcftools馋劈,需要用vcf.gz格式晾嘶,并進(jìn)行index
# for gatk 【$F[3]指的是Ref堿基】
bcftools view -v snps lung-1_gatk.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_gatk.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_gatk.exon.dbsnp.vcf.gz
# for varscan
bcftools view -v snps lung-1_varscan.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_varscan.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_varscan.exon.dbsnp.vcf.gz
# for musr
bcftools view -v snps lung-1_muse.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_muse.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_muse.exon.dbsnp.vcf.gz
然后
bcftools isec -n=3 lung-1_gatk.exon.dbsnp.vcf.gz lung-1_varscan.exon.dbsnp.vcf.gz lung-1_muse.exon.dbsnp.vcf.gz -p isec
cat ./isec/0001.vcf | grep -v "#" | wc -l
# 4176
結(jié)果輸出在isec
目錄中
對(duì)比兩個(gè)結(jié)果
bedops
結(jié)果是:4203
bcftools
結(jié)果是:4176
看一下二者差異
# 這行代碼的意思是:找出來file2(bedops運(yùn)行結(jié)果)相對(duì)于file1(bcftools isec運(yùn)行結(jié)果)不同的值
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) <(bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3)
# 總共28個(gè)...
# 3978708
# 76066627
# 41103551
# 112930880
# 114862377
# 114865432
# 37335619
# 37766922
# 11601930
# 11703882
以位點(diǎn)3978708為例:
# 在bedops結(jié)果中檢查
bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 3978708
# chr10 3978707 3978708
# 在bcftools結(jié)果中檢查
cat ./isec/0001.vcf | grep -v "#" | grep 3978708
# 無
繼續(xù)往上推械姻,isec是由三個(gè)軟件的vcf得到的楷拳,那么看看三個(gè)文件中是否存在這個(gè)位點(diǎn)
# for gatk
grep 3978708 lung-1_gatk.exon.dbsnp.vcf
# chr10 3978708 . TG CA . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6.00;SEQQ=93;STRANDQ=93;TLOD=239.95 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:8.228e-03:177:71,0:93,0:77,100,0,0
# for varscan
grep 3978708 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS AF=0.5139;DP=188;SOMATIC;SS=2;SSC=187;GPV=1.0;SPV=1.8965275762250198E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125
# for muse
grep 3978708 lung-1_muse.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS SOMATIC;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125 GT:DP:AD:BQ:SS 0/1:120:57,62:29,32:2 0/0:183:182,0:29,0:.
看到三個(gè)軟件都能得到這個(gè)位點(diǎn)的結(jié)果欢揖,但是gatk得到的是:TG => CA
奋蔚,其他兩個(gè)軟件得到的是:T => C
在IGV中驗(yàn)證這個(gè)位點(diǎn)
# tumor
samtools view -h $wkd/analysis/3-align/lung-1_bqsr.bam \
chr10:3978600-3978800 | samtools view -Sb - > lung-1.test.bam
samtools index lung-1.test.bam
# normal
samtools view -h $wkd/analysis/3-align/normal-lung-2_bqsr.bam \
chr10:3978600-3978800 | samtools view -Sb - > normal-lung-2.test.bam
samtools index normal-lung-2.test.bam
發(fā)現(xiàn)兩個(gè)SNV在一起(紅T泊碑、褐G馒过、綠A腹忽、藍(lán)C)砚作,所以GATK將它認(rèn)為是:TG 變 CA
很明顯偎巢,gatk把3978708和3978709這兩個(gè)SNV放在了一起兼耀,如果單獨(dú)看3978709也是沒有結(jié)果的瘤运;但另外兩個(gè)軟件把這兩個(gè)分開展示了:
# 兩個(gè)位點(diǎn)分開寫了
grep 3978708 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS AF=0.5139;DP=188;SOMATIC;SS=2;SSC=187;GPV=1.0;SPV=1.8965275762250198E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125
grep 3978709 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978709 rs29362746 G A . PASS AF=0.5068;DP=189;SOMATIC;SS=2;SSC=185;GPV=1.0;SPV=3.092867428835387E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978709;SAO=0;VC=snp;VP=050000000305030100000100;dbSNPBuildID=125
所以拯坟,這樣造成的一個(gè)影響就是:
用之前的辦法尋找GATK郁季、varscan、muse三個(gè)的交集似枕,這個(gè)位點(diǎn)3978708
和下個(gè)位點(diǎn)3978709
都是不存在的
原因是什么凿歼?
之前用bcftools的時(shí)候冗恨,指定了一個(gè)參數(shù):-v
掀抹,其中的幾個(gè)選項(xiàng)是:
snps, indels, mnps, others
我們這里只想看snps,于是都指定了-v snps
但是gatk把相鄰的兩個(gè)snv當(dāng)成了mnps
MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
兩個(gè)相鄰的SNP會(huì)成為MNP一搜索就有結(jié)果:
image
# for gatk snps
bcftools view -v snps lung-1_gatk.exon.dbsnp.vcf | grep 3978708
# 無結(jié)果
# for gatk mnps
bcftools view -v mnps lung-1_gatk.exon.dbsnp.vcf | grep 3978708
# chr10 3978708 . TG CA . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
而GATK結(jié)果中像這樣的MNPs有多少個(gè)呢?
bcftools view -v mnps lung-1_gatk.exon.dbsnp.vcf | grep -v "#" |wc -l
# 31個(gè)
如何將MNP變成SNP顯示戒幔?
https://www.biostars.org/p/280202/
conda install -y vt
bcftools norm -m -both lung-1_gatk.exon.dbsnp.vcf | vt decompose_blocksub - -o new-lung-1_gatk.exon.dbsnp.vcf
之后再次查看位點(diǎn)chr10:3978708
就拆分成了兩個(gè)
grep 3978708 new-lung-1_gatk.exon.dbsnp.vcf
# chr10 3978708 . T C . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95;OLD_CLUMPED=chr10:3978708:TG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
# chr10 3978709 . G A . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95;OLD_CLUMPED=chr10:3978708:TG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
再次對(duì)比兩個(gè)結(jié)果
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3) <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) | wc -l
# 這次bcftools的結(jié)果反超27個(gè)
cat ./isec/0001.vcf | grep -v "#" | wc -l
# 4229(之前是4176)
# bedops結(jié)果是:4203
按說這次結(jié)果應(yīng)該一樣才對(duì),再次檢查一個(gè)位點(diǎn)(chr2:142256608
):
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3) <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) |head
# 142256608
# 92243077
# 102477559
# 102614562
# 102973071
# 104893180
# 105035217
# 106891444
# 108755519
# 109143523
如果看看bedops的結(jié)果就能明白了:
bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 142256608
# 無結(jié)果
bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 1422566089
# chr2 142256607 142256609
因?yàn)?code>bedops的結(jié)果是bed格式王污,它把相鄰的位點(diǎn)(142256608和142256609)合并在了一起楚午,所以這個(gè)結(jié)果少的那27個(gè),就是來自之前GATK檢測(cè)的31個(gè)MNPs
既然之前GATK檢測(cè)了31個(gè)MNPs阱驾,那么為何這里結(jié)果之差27呢里覆?
這是因?yàn)槲覀兦蟮氖墙患掳辏m然GATK能檢測(cè)到弓坞,但另外兩個(gè)可能檢測(cè)不到啊
例如:
grep 129697407 new-lung-1_gatk.exon.dbsnp.vcf
# chr6 129697407 . A C . PASS CONTQ=93;DP=930;ECNT=2;GERMQ=93;MBQ=20,33;MFRL=196,165;MMQ=60,33;MPOS=2;NALOD=2.39;NLOD=72.24;POPAF=6;SEQQ=93;STRANDQ=12;TLOD=29.33;OLD_CLUMPED=chr6:129697407:AG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:524,16:0.036:540:285,7:230,8:216,308,3,13 0/0:362,0:0.004103:362:197,0:154,0:143,219,0,0
# chr6 129697408 . G A . PASS CONTQ=93;DP=930;ECNT=2;GERMQ=93;MBQ=20,33;MFRL=196,165;MMQ=60,33;MPOS=2;NALOD=2.39;NLOD=72.24;POPAF=6;SEQQ=93;STRANDQ=12;TLOD=29.33;OLD_CLUMPED=chr6:129697407:AG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:524,16:0.036:540:285,7:230,8:216,308,3,13 0/0:362,0:0.004103:362:197,0:154,0:143,219,0,0
grep 129697407 lung-1_muse.exon.dbsnp.vcf
# 無結(jié)果
grep 129697407 lung-1_varscan.exon.dbsnp.vcf
# 無結(jié)果
因此即使將GATK的MNP變成SNP渡冻,但依然不會(huì)被bcftools的isec
結(jié)果收錄
一般來說菩帝,最后需要做個(gè)韋恩圖來對(duì)比一下軟件【當(dāng)然這個(gè)圖是之前的版本】,之后就可以挑取交集繼續(xù)向下分析
歡迎關(guān)注我們的公眾號(hào)~_~
我們是兩個(gè)農(nóng)轉(zhuǎn)生信的小碩,打造生信星球握础,想讓它成為一個(gè)不拽術(shù)語禀综、通俗易懂的生信知識(shí)平臺(tái)苔严。需要幫助或提出意見請(qǐng)后臺(tái)留言或發(fā)送郵件到jieandze1314@gmail.com