了解過vcf文件的格式之后,對親本純合且差異的位點進行過濾就變得簡單多了。如果格式規(guī)范的話一行awk命令其實就能解決冰寻。這里為了提高腳本的適用范圍,所以寫的稍微麻煩了些煤率。
vcf=要過濾的vcf文件
p1=親本一的樣本名(與vcf中的樣本名保持一致)
p2=親本二的樣本名(與vcf中的樣本名保持一致)
python vcf_filt.py -i vcf -p1 p1 -p2 p2 >filter.vcf
腳本內容如下:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-vcf', "--vcf",dest = "i", default="none", #metavar=", defining metavar is causing an error for some reason
help = "total vcf"
)
parser.add_argument('-p1', "--parent_1",dest = "P", default="P1", #metavar=", defining metavar is causing an error for some reason
help = "parent bulk 1 name"
)
parser.add_argument('-p2', "--parent_2",dest = "p",default="P2", #metavar=", defining metavar is causing an error for some reason
help = "parent bulk 2 name"
)
args = parser.parse_args()
vcffile=args.i
P1=args.P
P2=args.p
info_dic = {}
sample_dic = {}
with open(vcffile,'r')as vcf:
for line in vcf:
if line.strip() != '' and line[:2] != "##":
if line[:2] == "#C":
for i in line.strip().split("\t"):
sample_dic[i] = line.strip().split("\t").index(i)
print(line.strip())
else:
lst = line.strip().split("\t")
for i in lst[sample_dic["FORMAT"]].split(":"):
info_dic[i] = lst[sample_dic["FORMAT"]].split(":").index(i)
p1_gt = lst[sample_dic[P1]].split(":")[info_dic["GT"]][0] + lst[sample_dic[P1]].split(":")[info_dic["GT"]][2]
p2_gt = lst[sample_dic[P2]].split(":")[info_dic["GT"]][0] + lst[sample_dic[P2]].split(":")[info_dic["GT"]][2]
if (p1_gt == "00" and p2_gt == "11") or (p1_gt == "11" and p2_gt == "00"):
print(line.strip())
else:
print(line.strip())
整理不易仰冠,給個好評再走唄!