計(jì)算蛋白理化性質(zhì)一般在網(wǎng)站镶摘,protparam
此網(wǎng)站只可以一次輸入一條序列押桃,結(jié)果需要挨個(gè)復(fù)制。當(dāng)序列條數(shù)很多時(shí)也是著實(shí)費(fèi)力碗短。
此外直接用Bioperl寫腳本也可以凤优,Biopython 不確定是否可以悦陋。
直接使用 request 應(yīng)該也可以
goal :再看一下class 的使用,學(xué)一下html語(yǔ)法筑辨,format 格式化
issue : 結(jié)果數(shù)據(jù)保存在pre標(biāo)簽內(nèi)叨恨,單項(xiàng)數(shù)據(jù)不是在標(biāo)簽內(nèi),
使用 /following::text()[1] 報(bào)錯(cuò) It should be an element.
solution : 直接獲取 標(biāo)簽內(nèi)所有內(nèi)容挖垛,split 從列表中獲取信息
issue : 運(yùn)行速度很慢,不過(guò)應(yīng)該比手動(dòng)準(zhǔn)確秉颗,
solution :可能是服務(wù)器在國(guó)外痢毒,不必等到頁(yè)面加載完全進(jìn)行下一步操作,
issue: 對(duì)類的概念了解不是太清楚蚕甥,將參數(shù)信息合并到字典里哪替,每條序列保存為json格式
solution :只能寫腳本轉(zhuǎn)換為自己需要的格式。
issue: json格式有誤菇怀,
solutionon: 直接寫為tab分隔
test: 使用58條序列測(cè)試
result: 用時(shí) 2681 s 約44mins , 太慢了凭舶。晌块。。帅霜。不過(guò)還好啦匆背。瀏覽器最小化后,就可以做其他事了身冀,
此腳本只能在win 下使用钝尸,并需正確安裝webdriver驅(qū)動(dòng)
獲得數(shù)據(jù)
#!/usr/bin/env python
# coding: utf-8
#usage: python scrpit inputfile outfilename
from selenium import webdriver
from Bio import SeqIO
import re,time,json,sys
st = time.time()
input_file = sys.argv[1]
out_file = sys.argv[2]
#import os
#os.chdir(r"C:\Users\Acer\Desktop\codee\python\expasy")
expasy = webdriver.Chrome()
expasy.get("https://web.expasy.org/protparam/")
class expasy_cal():
'''get physical and chemical parameters for a given protein sequence file
based on web https://web.expasy.org/protparam/'''
def inputseq(seq):
"""input the protein sequence"""
time.sleep(0.3)
while True:
if expasy.find_element_by_xpath('//*[@id="sib_body"]/form/textarea').is_displayed():
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/p[1]/input[1]').click() #獲取新網(wǎng)頁(yè)
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/textarea').send_keys(seq)
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/p[1]/input[2]').click()
break
else:
print("input box is not displayed")
def compute():
"""get the parameters showed below"""
#inbox.send_keys(seq)
time.sleep(0.3) #等待頁(yè)面加載的時(shí)間
while True:
if expasy.find_element_by_xpath('//*[@id="sib_body"]/h2').is_displayed():
pd={}
parameters = expasy.find_element_by_xpath('//*[@id="sib_body"]/pre[2]').text.split("\n\n") #分割不同參數(shù)
aaa='\n'.join(parameters)
bbb=re.split("[:\n]",aaa) #將參數(shù)值 與 值分割
pd["number_of_amine_acid"] = bbb[1].strip()
pd["molecular_weight"] = bbb[3].strip()
pd["theoretical_pi"] = bbb[5].strip()
pd["instability_index"] = re.findall("[\d.]+",bbb[66])[0] #filter 結(jié)果怎么是個(gè)對(duì)象
pd["aliphatic_index"] = bbb[70].strip()
pd["gravy"] = bbb[72].strip()
return pd
break
else:
print("loading")
with open(out_file,"w",encoding='utf-8') as f:
f.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
'seq_id',
'number_of_amine_acid',
'molecular_weight',
'theoretical_pi',
'instability_index',
'aliphatic_index',
'gravy'))
pros = SeqIO.parse(input_file,"fasta")
i=0
for pro in pros:
print("="*10,"seq",i+1,"->",pro.id,"on the way","="*10)
expasy_cal.inputseq(seq = pro.seq)
cccc = expasy_cal.compute()
number_of_amine_acid = cccc['number_of_amine_acid']
molecular_weight = cccc['molecular_weight']
theoretical_pi = cccc['theoretical_pi']
instability_index = cccc['instability_index']
aliphatic_index = cccc['aliphatic_index']
gravy = cccc['gravy']
f.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
pro.id,
number_of_amine_acid,
molecular_weight,
theoretical_pi,
instability_index,
aliphatic_index,
gravy))
i+=1
#single_id_pd = {pro.id:cccc} #單條序列計(jì)算結(jié)果封裝到字典 好像這種不是json格式
#print(single_id_pd)
#json.dump(single_id_pd,f,indent = 4,ensure_ascii=False) #是否要等到全部完成才寫入?
#f.write(json.dumps(single_id_pd,indent = 4,ensure_ascii=False)+"\n")
#expasy.get("https://web.expasy.org/protparam/")
expasy.back() #好像back也需要重新加載頁(yè)面 也是慢
expasy.close()
et = time.time()
print("process finished")
print("taking time",et-st,"s")
結(jié)果
seq_id number_of_amine_acid molecular_weight theoretical_pi instability_index aliphatic_index gravy
PgPUB56 458 50457.08 8.02 49.64 101.27 -0.206
PgPUB54 688 75204.03 8.46 50.28 108.21 -0.023
json格式 轉(zhuǎn)換為二維表格格式
存儲(chǔ)的格式可能錯(cuò)誤搂根,直接存儲(chǔ)為tab分格式的吧