提取最長(zhǎng)轉(zhuǎn)錄本是一個(gè)經(jīng)常需要做的工作,本腳本會(huì)產(chǎn)生三個(gè)文件:.fa.max(所需結(jié)果)残炮;.len
(提取序列長(zhǎng)度)略板;.len.list(基因列表)埠巨。
本文特點(diǎn)批量操作=┲搿!!!
我這次需要達(dá)到的目標(biāo)是提取每個(gè)物種的最長(zhǎng)序列!也適用于提取最長(zhǎng)轉(zhuǎn)錄本D弧!W钐讯赏!
示例文件:
>Aco|Aqcoe2G366900.1
MDGAATPMEVNDDDISTVEKTVRIEDVHSKSKDSIVLIISPRFIQPKNVDPKFIDIIANIYLVTRRVFNLRVEEISERVS
TEDVVGSEAKEACTICMDELSLDDRPVLTLPCSHDFHFLCLSKWIKDHNSCPLC
>Aco|Aqcoe2G002600.1
MSSSGNTHWCYSCRRPVRLRGRDAVCPYCSGGFVQELNEVEGISPLDFFGIDSDDERDQRFGVMEAFSALMRQRLAGRNRELDIRGRSGVVPDHGPGVNSGPWLIFSGQIPVHMSENGGIEVLLNGGPTVGLRRANVSDYFVGPGLDELIEQLTRNDRRGPPPAAQSAIDSMPTVKISHKHLRTDSHCPVCKERFELGTEARQMPCNHMYHSDCIVPWLVQHNSCPVCRLELPPQGSTSAHSSQSSSGGSRSGTASSNTSGRENSSESQTRRHPLSFLWPFRSSNTSTNTTNSNSNSNRSEPAASASAPLHEDNHQMHYTGWPFDY*
>Aco|Aqcoe2G351800.1
MAGMLPGVELARRRRTNHHHHHHHHQFEFTSNSREPTTLHHHQRLEPSSSMDETALMARKRLEEKLRGFSSAPSRWSKQQSLKGDKSYTVAYVKETHASKRQGKQIVQLDRRNSQREVCSICLEDFQAQQQVMNLPCCHRYHSDCLLPWLSAHSHCPYCRTKVMSSAN*
>Aco|Aqcoe2G264300.1
MSNDEHETSFTTICCVGREAFWCEDKVENVFVESEVSIDIYITLVDQWITSPQEEDYSDVDPPVLLRQISTICSEVAAIIMQEKPKNISVNIVVEFVHTVLCDVQAFNESFMDEEVDPILVPAAPSYVEALKTKQYDKENSKEESCSVCYQEFLMAEDVTLMPCSHIFHNNCIGRWLEMTNTCPMCRFTMPTFV*
>Aco|Aqcoe2G072200.1
MIDSSYSELEHEIAKHHLLYFLSFTEKGSNPIAYLGNIAWLAKQRAEETGGDYESIWLETFNSISSILIQEEAKTKLHDVVVLDGDEDDMICAICHEEFEVGFRGFGNTAMDCGHMFHRNCIYEWFKNKPNCPVCRHERKV*
目標(biāo)提取Aco|Aqcoe2G002600.1
使用嵌套的字典實(shí)現(xiàn),代碼如下:
import re
import sys
file =open("/home/data2/EvoGenome/xialf/orthfinder/OrthoFinder/Results_Apr20/24group.txt","r")
for g in file:
t =g.replace("\n","")
filename = "/home/data2/EvoGenome/xialf/orthfinder/OrthoFinder/Results_Apr20/24group/%s.fa"%t
seqlist={}
i=0
with open(filename,'r')as r:
lines=r.readlines()
for l in lines:
i+=1
linecontent =l.strip()
if linecontent.startswith('>'):
if i > 1:
name = linecontent[1:]
id =name[0:3]#此處是重點(diǎn)@湮尽J妗!
if id not in seqlist.keys():
seqlist[id]={}
seqlist[id][name]=""
elif i==1:
name =linecontent[1:]
id =name[0:3]#此處是重點(diǎn)H干凇?牧隆!
seqlist[id]={}
seqlist[id][name]=""
else:
seqlist[id][name]=seqlist[id][name]+linecontent
maxseq ={}
for k,v in seqlist.items():
d=0
for k1,v1 in v.items():
d +=1
if d ==1:
maxseq[k1]=v1
maxlen =len(v1)
k_last =k1
else:
if len(v1) > maxlen:
maxlen=len(v1)
maxseq[k1]=v1
del maxseq[k_last]
k_last =k1
result1=filename+".max"
result2=filename+".max.list"
result3=filename+".max.len"
with open(result1,'w')as w1:
with open(result2,'w')as w2:
with open(result3,'w')as w3:
for key,value in maxseq.items():
w1.write(">"+key+'\n'+value+"\n")
w2.write(key+'\n')
w3.write(key+'\t'+str(len(value))+"\n")
第一篇文章完成雾棺,嘿嘿膊夹。