設(shè)計(jì)需求
統(tǒng)計(jì)fasta文件中多條序列信息,設(shè)計(jì)目標(biāo)效果:
將結(jié)果輸入到csv格式的表格中,因?yàn)閏sv格式表格用盔性,分隔數(shù)據(jù)。
腳本使用argparse模塊呢岗,提示輸入數(shù)據(jù)冕香。
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('integers', type=str, help='輸入fasta文件名')
args = parser.parse_args()
#獲得integers參數(shù)
print(args.integers)
讀取多個(gè)序列存入字典
當(dāng)fasta文件中多條序列時(shí),需要解決序列名與對(duì)應(yīng)序列的儲(chǔ)存問(wèn)題后豫。
創(chuàng)建一個(gè)fasta字典悉尾,將序列名字和序列儲(chǔ)存進(jìn)去。
當(dāng)line[0]也就是第一個(gè)字符等于>時(shí)挫酿,就將名字賦值給header构眯。
line[1:]是去除line第一個(gè)字符。因?yàn)閒asta以>開(kāi)頭去掉>才是真的名字饭豹。
如果line[0]不是>那就是序列鸵赖,那么我們就把序列儲(chǔ)存給fasta字典并對(duì)應(yīng)的序列名。
get()方法語(yǔ)法:
dict.get(key, default=None)
key -- 字典中要查找的鍵拄衰。
default -- 如果指定鍵的值不存在時(shí)它褪,返回該默認(rèn)值。
fasta[header] = fasta.get(header, '') + sequence使用這句是為了讓序列中換行的序列也儲(chǔ)存起來(lái)翘悉,連接兩個(gè)分行的序列茫打。
#讀取序列字典
def read_fasta(input):
with open(input, 'r') as f:
fasta = {}
for line in f:
line = line.strip()
if line[0] == '>':
header = line[1:]
else:
sequence = line
fasta[header] = fasta.get(header, '') + sequence
return fasta
統(tǒng)計(jì)序列信息的函數(shù)
命名這個(gè)函數(shù)為get_info,內(nèi)部參數(shù)為chr
在咱們會(huì)將fasta中ATCG的堿基內(nèi)容賦值給chr妖混,堿基可能有大寫(xiě)有小寫(xiě)老赤,所以我們使用.upper將所以字符變成大寫(xiě)。
再使用.count('G')統(tǒng)計(jì)ATCG各自的數(shù)量并賦值給對(duì)應(yīng)count_g制市,我們用ATCG各自的統(tǒng)計(jì)數(shù)可以在后面計(jì)算中免疫N值干擾抬旺。
gc含量計(jì)算其等于(G的數(shù)量+C的數(shù)量)/(A的數(shù)量+T的數(shù)量+C的數(shù)量+G的數(shù)量)
A的含量等于(A的數(shù)量)/(A的數(shù)量+T的數(shù)量+C的數(shù)量+G的數(shù)量),其他值的計(jì)算以此類推祥楣。
.format使用:
"{1} {0} {1}".format("hello", "world")設(shè)置指定位置开财。
'world hello world'
{:.2f} 保留小數(shù)點(diǎn)后兩位
最后,使用return返回函數(shù)結(jié)果(gc_con,A_content,T_content,C_content,G_content)
#統(tǒng)計(jì)序列信息的函數(shù)
def get_info(chr):
chr = chr.upper()
count_g = chr.count('G')
count_c = chr.count('C')
count_a = chr.count('A')
count_t = chr.count('T')
gc = (count_g + count_c) / (count_a + count_t + count_c + count_g)
A = (count_a) / (count_a + count_t + count_c + count_g)
T = (count_t) / (count_a + count_t + count_c + count_g)
C = (count_c) / (count_a + count_t + count_c + count_g)
G = (count_g) / (count_a + count_t + count_c + count_g)
gc_con = '{:.2%}'.format(gc)
A_content = '{:.2%}'.format(A)
T_content = '{:.2%}'.format(T)
C_content = '{:.2%}'.format(C)
G_content = '{:.2%}'.format(G)
return (gc_con,A_content,T_content,C_content,G_content)
更方便的命令行參數(shù)輸入
主要有三個(gè)步驟:
創(chuàng)建 ArgumentParser() 對(duì)象
調(diào)用 add_argument() 方法添加參數(shù)
使用 parse_args() 解析添加的參數(shù)
我們將添加的參數(shù)賦值給args
#設(shè)置輸入使用argparse
parser = argparse.ArgumentParser()
parser.add_argument('--input', '-i',
type=str,
help='input file in fasta format')
args = parser.parse_args()
輸出設(shè)置
首先我們通過(guò)之前的函數(shù)read_fasta獲得序列以及序列名的字典误褪,然后創(chuàng)建以輸入文件名為前綴的_sum.csv文件责鳍。
再將表格抬頭輸入進(jìn)去。
dict = read_fasta(args.input)
file_name = args.input.split('.')
name = file_name[0]
file_output = open("{}_sum.csv".format(name),'a')
file_output.write('name,length,GC content,A content,T content,C content,G content\n')
輸出運(yùn)算
通過(guò)for循環(huán)將字典里的每一個(gè)鍵都?xì)v遍兽间,通過(guò)get_info獲取序列堿基信息历葛。
用print打印到屏幕展示,用.write按照抬頭順序?qū)⑾嚓P(guān)信息輸入嘀略。
for val in dict:
seq_info = get_info(dict[val])
len_fasta = len(dict[val])
print('******\n{0}\nlength:{1:d}\ngc content :{2}\nA content :{3}\nT content :{4}\nC content :{5}\nG content :{6}\n'.format(val,len_fasta,seq_info[0],seq_info[1],seq_info[2],seq_info[3],seq_info[4]))
file_output.write('{0},{1},{2},{3},{4},{5},{6}\n'.format(val,len_fasta,seq_info[0],seq_info[1],seq_info[2],seq_info[3],seq_info[4]))
file_output.close()
全部代碼
附上全部代碼恤溶,可以直接使用
import argparse
#讀取序列字典
def read_fasta(input):
with open(input, 'r') as f:
fasta = {}
for line in f:
line = line.strip()
if line[0] == '>':
header = line[1:]
else:
sequence = line
fasta[header] = fasta.get(header, '') + sequence
return fasta
#統(tǒng)計(jì)序列信息的函數(shù)
def get_info(chr):
chr = chr.upper()
count_g = chr.count('G')
count_c = chr.count('C')
count_a = chr.count('A')
count_t = chr.count('T')
gc = (count_g + count_c) / (count_a + count_t + count_c + count_g)
A = (count_a) / (count_a + count_t + count_c + count_g)
T = (count_t) / (count_a + count_t + count_c + count_g)
C = (count_c) / (count_a + count_t + count_c + count_g)
G = (count_g) / (count_a + count_t + count_c + count_g)
gc_con = '{:.2%}'.format(gc)
A_content = '{:.2%}'.format(A)
T_content = '{:.2%}'.format(T)
C_content = '{:.2%}'.format(C)
G_content = '{:.2%}'.format(G)
return (gc_con,A_content,T_content,C_content,G_content)
#設(shè)置輸入使用argparse
parser = argparse.ArgumentParser()
parser.add_argument('--input', '-i',
type=str,
help='input file in fasta format')
args = parser.parse_args()
dict = read_fasta(args.input)
file_name = args.input.split('.')
name = file_name[0]
file_output = open("{}_sum.csv".format(name),'a')
file_output.write('name,length,GC content,A content,T content,C content,G content\n')
for val in dict:
seq_info = get_info(dict[val])
len_fasta = len(dict[val])
print('******\n{0}\nlength:{1:d}\ngc content :{2}\nA content :{3}\nT content :{4}\nC content :{5}\nG content :{6}\n'.format(val,len_fasta,seq_info[0],seq_info[1],seq_info[2],seq_info[3],seq_info[4]))
file_output.write('{0},{1},{2},{3},{4},{5},{6}\n'.format(val,len_fasta,seq_info[0],seq_info[1],seq_info[2],seq_info[3],seq_info[4]))
file_output.close()