本文水平有點(diǎn)low邓深,本來不打算寫這么低水平的文章整胃,怎奈我的python水平就在這么low的水平巢掺。以此貼記錄初試python解決生信簡單問題的學(xué)習(xí)過程钻注,也算讓自己看清楚自己python到底有多渣淆九。
python平時自用的多统锤,就是非工作場合的自我學(xué)習(xí)期間就可以簡單寫點(diǎn)毛俏,或者清洗規(guī)整一下數(shù)據(jù)。自我約束不到位饲窿,代碼日漸奔放拧抖,已經(jīng)越來越像perl的編碼習(xí)慣了,$a $b $c免绿。唧席。。嘲驾。淌哟。?辽故?
題目完整版來自:http://rosalind.info/problems/list-view/徒仓;。每個題可能有多種解法誊垢,不同解法用分別用## 1/2/3表示掉弛。如果你也同我一樣剛用python處理生信數(shù)據(jù)的話,請務(wù)必先自己做一遍再看文中代碼喂走。
1. 計算序列中各堿基數(shù)目
test.txt文件:
GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
# 初級解法
#!/usr/bin/env python3
nts = {c:0 for c in 'ATGC'}
with open('./test.txt','r') as f:
for a in f:
a = a.upper()
for nt in a.rstrip():
nts[nt] += 1
print (nts)
# 進(jìn)階解題
test = open('test.txt','r').readlines()
[len(re.findall(b,''.join(test))) for b in ['A','T','G','C']]
2. 將DNA序列轉(zhuǎn)化為RNA序列
#初級1
import re
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
#進(jìn)階1
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
print(line.replace('T','U'))
#進(jìn)階2
with open('test.txt','r') as f:
for line in f:
print(re.sub('T','U',line.rstrip().upper()))
3. 獲取序列的反向互補(bǔ)序列
#初級1
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('./test.txt','r') as f:
for line in f:
seq = ''
line = line.upper()
for aa in line.rstrip():
seq += trans.get(aa)
print(seq[::-1])
#初級2
def reverse_complement(seq):
ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
RevSeqList = list(reversed(seq))
RevComSeqList = [ntComplement[k] for k in RevSeqList]
RevComSeq = ''.join(RevComSeqList)
return RevComSeq
seq = ''
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
print (reverse_complement(line.rstrip()))
# 進(jìn)階
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('test.txt','r') as f:
for line in f:
tmp_list = list(reversed(line.strip()))
tmp_list2 = [trans[a] for a in tmp_list]
print(''.join(tmp_list2))
4. 找出fasta文件中GC含量最大的序列
#初級1
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
for line in f:
if re.match(">",line):
SeqName = line[1:]
Seq[SeqName] = ''
seqGC[SeqName] = 0
else:
line = line.upper()
line = line.rstrip()
Seq[SeqName] += line
seqGC[SeqName] += line.count('G')
seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
if maxGC < float(seqGC[key]/ len(value)*100):
maxGC = float(seqGC[key] / len(value)*100)
tmp = key
print ('>'+tmp+Seq[tmp])
# 進(jìn)階初版
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
SeqName = line[1:]
SeqTest[SeqName] = ''
continue
SeqTest[SeqName] += line.upper()
for key, value in SeqTest.items():
totalLength = len(value)
gcNum = value.count('G') + value.count('C')
gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))
# 進(jìn)階高板
seq = {}
seq_gc = {}
with open('test.fa','r') as f:
for line in f:
line = line.strip()
if re.match('>',line):
seq_id = re.sub('>', '',line)
seq[seq_id] = ''
else:
seq[seq_id] = seq[seq_id] + line.upper()
for key,s in seq.items():
seq_gc[key] = (s.count('G') + s.count('C'))/len(s)
print(max(seq_gc,key=seq_gc.get) + '\t' + max(seq_gc.values()))
5. 計算點(diǎn)突變數(shù)目
給兩個長度為t的序列s殃饿,t和s之間的哈明距離(Hamming distance)定義為dH(s,t)。該問題即返回兩條序列的哈明距離芋肠。
## 初版
fh = open('./test.txt','r')
lst = []
for line in fh:
lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
if lst[0][i] == lst[1][i]:
continue
hamming_dis += 1
print (hamming_dis)
#進(jìn)階1
fh = open('./test.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))
#進(jìn)階2
import numpy as np
import pandas as pd
sequence = []
number = 0
with open('test.txt','r') as f:
for line in f:
sequence.append(list(line.strip().upper()))
frame = pd.DataFrame(np.asarray(sequence))
for i in range(len(frame.columns)):
if frame[i][0] != frame[i][1]:
number = number + 1
print(number)
6. 孟德爾第一定理
一個群體中有三種基因型的生物:k,顯性純合子乎芳;m,雜合子;n,隱性純合子帖池。假設(shè)這對形狀由一對等位基因控制奈惑,且群體中隨機(jī)選取的任何兩個個體都能交配,求隨機(jī)選取兩個個體交配后睡汹,子代擁有顯性等位基因的概率肴甸。
#初版
k = int(input("enter the number of homozygous dominant: "))
m = int(input("enter the number of heterozygous: "))
n = int(input("enter the number of homozygous recessive: "))
num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
#進(jìn)階1
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)
p = 1 - (rr+hh*1/4+hr*1/2)
print(p)
7. 將RNA翻譯成蛋白質(zhì)
def translate_rna(sequence):
codonTable = {
'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
}
proteinsequence = ''
for n in range(0,len(sequence),3):
if sequence[n:n+3] in codonTable.keys():
proteinsequence += codonTable[sequence[n:n+3]]
return proteinsequence
protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
for line in f:
protein_fh.write(translate_rna(line.strip('\n')))
## 進(jìn)階
import re
from collections import OrderedDict
codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}
rnaseq = ''
with open('./rna.txt','r') as f:
for line in f:
line = line.rstrip()
line += line.upper()
aminoAcids = []
i = 0
while i < len(rnaseq):
condon = rnaseq[i:i+3]
if codonTable[condon] != '':
aminoAcids.append(codonTable[condon])
i += 3
peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna
# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())
8. 尋找DNA motif
## 1
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
if seq[i:i+motif_len] == motif:
position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])
9. 多個等長序列的一致性序列
比如序列如下:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
各位點(diǎn)堿基個數(shù):
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
#1 初階
def seq_list(fasta):
seq_list = []
for line in fasta.readlines():
if not line.startswith('>'):
seq = list(line.rstrip())
seq_list.append(seq)
return seq_list
def statistic_base(seq_list):
for base in 'ATGC':
base_total = []
for sit in range(len(seq_list[0])):
col = [x[sit] for x in seq_list]
num = col.count(base)
base_total.append(num)
print('%s:%s'%(base,base_total))
fh = open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
# 進(jìn)階 初級
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')
with open('./test.fa','r') as f:
for line in f:
if line.startswith('>'):
seq_name = line.rstrip()
seq[seq_name] = ''
continue
seq[seq_name] += line.upper().rstrip()
seqLength = len(seq[seq_name])
a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
sequence = ''
for j in seq.keys():
sequence += seq[j][i]
a.append(sequence.count('A'))
t.append(sequence.count('T'))
g.append(sequence.count('G'))
c.append(sequence.count('C'))
counts = Counter(sequence)
consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()
# 進(jìn)階高級
seq = []
seq_dict = {}
with open('t.list','r') as f:
for line in f:
if not line.startswith('>'):
seq.append(list(line.strip().upper()))
for aa in ['A','T','G','C']:
seq_dict[aa] = [list(frame[i]).count(aa) for i in range(len(frame.columns))]
print(''.join(list(pd.DataFrame(seq).idxmax(axis=1))))
10. 致命的斐波那契兔子
斐波那契序列是一個序列的數(shù)字定義的遞歸關(guān)系Fn = Fn-1+ Fn?2 ,我們設(shè)置的起始值F1 = F2 = 1。
假設(shè)每只兔子可以活m個月囚巴,n個月后有多少只兔子原在?
## 1
def fib(n,m):
f= [0,1,1]
for i in range(3,n+1):
if i <= m:
total = f[i-1] + f[i-2]
elif i == m+1:
total = f[i-1] + f[i-2] - 1
else:
total = f[i-1] + f[i-2] - f[i-m-1]
f.append(total)
return(f[n])
inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))
print(fib(n,m))
11. Graph Theory
文件介紹好麻煩,自己看:http://rosalind.info/problems/grph/
總之該題有三個堿基的首尾相同就連接起來文兢,
輸入文件:
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
輸出結(jié)果:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
seq = {}
with open('./overlap.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
seqname = line[1:]
seq[seqname] = ''
continue
seq[seqname] += line.upper()
for key , value in seq.items():
for key2 ,value2 in seq.items():
if key != key2 and value[-3:] == value2[:3]:
print(key+'\t'+key2)
12. 計算后代的期望值
同樣懶得解釋原理晤斩,具體原理看:http://rosalind.info/problems/iev/
現(xiàn)在有6種基因型組合夫婦:
AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa
給定6個非負(fù)整數(shù),代表6種基因型組合的夫婦數(shù)量姆坚,求下一代顯性性狀的個數(shù),假設(shè)每對夫妻有2個孩子实愚。
def expected(a,b,c,d,f,g):
AA_AA = 1
AA_Aa = 1
AA_aa = 1
Aa_Aa = 0.75
Aa_aa = 0.5
aa_aa = 0
p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
return (p)
inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))
13. 計算序列間的最長一致性序列兼呵,即尋找序列間公有的motif(Finding a Shared Motif)
# 測試數(shù)據(jù)
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
# 輸出結(jié)果:
AC