生物信息python 入門題 初級至進(jìn)階

本文水平有點(diǎn)low邓深,本來不打算寫這么低水平的文章整胃,怎奈我的python水平就在這么low的水平巢掺。以此貼記錄初試python解決生信簡單問題的學(xué)習(xí)過程钻注,也算讓自己看清楚自己python到底有多渣淆九。
python平時自用的多统锤,就是非工作場合的自我學(xué)習(xí)期間就可以簡單寫點(diǎn)毛俏,或者清洗規(guī)整一下數(shù)據(jù)。自我約束不到位饲窿,代碼日漸奔放拧抖,已經(jīng)越來越像perl的編碼習(xí)慣了,$a $b $c免绿。唧席。。嘲驾。淌哟。?辽故?

題目完整版來自:http://rosalind.info/problems/list-view/徒仓;。每個題可能有多種解法誊垢,不同解法用分別用## 1/2/3表示掉弛。如果你也同我一樣剛用python處理生信數(shù)據(jù)的話,請務(wù)必先自己做一遍再看文中代碼喂走。

1. 計算序列中各堿基數(shù)目

test.txt文件:

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
# 初級解法
#!/usr/bin/env python3
nts = {c:0 for c in 'ATGC'}
with open('./test.txt','r') as f:
  for a in f:
    a = a.upper()
    for nt in a.rstrip():
      nts[nt] += 1
print (nts)

# 進(jìn)階解題
test = open('test.txt','r').readlines()
[len(re.findall(b,''.join(test))) for b in ['A','T','G','C']]

2. 將DNA序列轉(zhuǎn)化為RNA序列

#初級1
import re
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
#進(jìn)階1
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    print(line.replace('T','U'))
#進(jìn)階2
with open('test.txt','r') as f:
    for line in f:
            print(re.sub('T','U',line.rstrip().upper()))

3. 獲取序列的反向互補(bǔ)序列

#初級1
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('./test.txt','r') as f:
  for line in f:
    seq = ''
    line = line.upper()
    for aa in line.rstrip():
      seq += trans.get(aa)
     print(seq[::-1])
#初級2
def reverse_complement(seq):
  ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
  RevSeqList = list(reversed(seq))
  RevComSeqList = [ntComplement[k] for k in RevSeqList]
  RevComSeq = ''.join(RevComSeqList)
  return RevComSeq

seq = ''
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
print (reverse_complement(line.rstrip()))
# 進(jìn)階
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('test.txt','r') as f:
    for line in f:
        tmp_list = list(reversed(line.strip()))
        tmp_list2 = [trans[a] for a in tmp_list]
        print(''.join(tmp_list2))

4. 找出fasta文件中GC含量最大的序列

#初級1
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
        for line in f:
                if re.match(">",line):
                        SeqName = line[1:]
                        Seq[SeqName] = ''
                        seqGC[SeqName] = 0
                else:
                        line = line.upper()
                        line = line.rstrip()
                        Seq[SeqName] += line
                        seqGC[SeqName] += line.count('G')
                        seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
        if maxGC < float(seqGC[key]/ len(value)*100):
                maxGC = float(seqGC[key] / len(value)*100)
                tmp = key
print ('>'+tmp+Seq[tmp])
# 進(jìn)階初版
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        SeqName = line[1:]
                        SeqTest[SeqName] = ''
                        continue
                SeqTest[SeqName] += line.upper()

for key, value in SeqTest.items():
        totalLength = len(value)
        gcNum = value.count('G') + value.count('C')
        gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))
# 進(jìn)階高板
seq = {}
seq_gc = {}
with open('test.fa','r') as f:
    for line in f:
        line = line.strip()
        if re.match('>',line):
            seq_id = re.sub('>', '',line)
            seq[seq_id] = ''
        else:
            seq[seq_id] = seq[seq_id] + line.upper()

for key,s in seq.items():
    seq_gc[key] = (s.count('G') + s.count('C'))/len(s)
    print(max(seq_gc,key=seq_gc.get) + '\t' + max(seq_gc.values()))

5. 計算點(diǎn)突變數(shù)目

給兩個長度為t的序列s殃饿,t和s之間的哈明距離(Hamming distance)定義為dH(s,t)。該問題即返回兩條序列的哈明距離芋肠。


image
## 初版
fh = open('./test.txt','r')
lst = []
for line in fh:
        lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
        if lst[0][i] == lst[1][i]:
                continue
        hamming_dis += 1
print (hamming_dis)
#進(jìn)階1
fh = open('./test.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))
#進(jìn)階2 
import numpy as np
import pandas as pd
sequence = []
number = 0
with open('test.txt','r') as f:
    for line in f:
        sequence.append(list(line.strip().upper()))
frame = pd.DataFrame(np.asarray(sequence))
for i in range(len(frame.columns)):
    if frame[i][0] != frame[i][1]:
        number = number + 1
print(number)

6. 孟德爾第一定理

一個群體中有三種基因型的生物:k,顯性純合子乎芳;m,雜合子;n,隱性純合子帖池。假設(shè)這對形狀由一對等位基因控制奈惑,且群體中隨機(jī)選取的任何兩個個體都能交配,求隨機(jī)選取兩個個體交配后睡汹,子代擁有顯性等位基因的概率肴甸。

#初版
k = int(input("enter the number of homozygous dominant: "))
m = int(input("enter the number of heterozygous: "))
n = int(input("enter the number of homozygous recessive: "))

num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
#進(jìn)階1
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)

p = 1 - (rr+hh*1/4+hr*1/2)
print(p)

7. 將RNA翻譯成蛋白質(zhì)

def translate_rna(sequence):
    codonTable = {
    'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
    'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
    'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
    'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
    'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
    'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
    'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
    'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
    'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
    'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
    }
    proteinsequence = ''
    for n in range(0,len(sequence),3):
        if sequence[n:n+3] in codonTable.keys():
            proteinsequence += codonTable[sequence[n:n+3]]
    return proteinsequence


protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
        for line in f:
                protein_fh.write(translate_rna(line.strip('\n')))
## 進(jìn)階
import re
from collections import OrderedDict

codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}

rnaseq = ''
with open('./rna.txt','r') as f:
        for line in f:
                line = line.rstrip()
                line += line.upper()

aminoAcids = []
i = 0
while i < len(rnaseq):
        condon = rnaseq[i:i+3]
        if codonTable[condon] != '':
                aminoAcids.append(codonTable[condon])
        i += 3

peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna

# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())

8. 尋找DNA motif

## 1 
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
        if seq[i:i+motif_len] == motif:
                position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])

9. 多個等長序列的一致性序列

比如序列如下:

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

各位點(diǎn)堿基個數(shù):

A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
Consensus   A T G C A A C T
#1 初階
def seq_list(fasta):
        seq_list = []
        for line in fasta.readlines():
                if not line.startswith('>'):
                        seq = list(line.rstrip())
                        seq_list.append(seq)
        return seq_list
def statistic_base(seq_list):
        for base in 'ATGC':
                base_total = []
                for sit in range(len(seq_list[0])):
                        col = [x[sit] for x in seq_list]
                        num = col.count(base)
                        base_total.append(num)
                print('%s:%s'%(base,base_total))
fh =  open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
# 進(jìn)階 初級
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')

with open('./test.fa','r') as f:
        for line in f:
                if line.startswith('>'):
                        seq_name = line.rstrip()
                        seq[seq_name] = ''
                        continue
                seq[seq_name] += line.upper().rstrip()
        seqLength = len(seq[seq_name])

a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
        sequence = ''
        for j in seq.keys():
                sequence += seq[j][i]
        a.append(sequence.count('A'))
        t.append(sequence.count('T'))
        g.append(sequence.count('G'))
        c.append(sequence.count('C'))
        counts = Counter(sequence)
        consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()

# 進(jìn)階高級
seq = []
seq_dict = {}
with open('t.list','r') as f:
    for line in f:
        if not line.startswith('>'):
            seq.append(list(line.strip().upper()))          
for aa in ['A','T','G','C']:
    seq_dict[aa] = [list(frame[i]).count(aa) for i in range(len(frame.columns))]
     print(''.join(list(pd.DataFrame(seq).idxmax(axis=1))))

10. 致命的斐波那契兔子

斐波那契序列是一個序列的數(shù)字定義的遞歸關(guān)系Fn = Fn-1+ Fn?2 ,我們設(shè)置的起始值F1 = F2 = 1。
假設(shè)每只兔子可以活m個月囚巴,n個月后有多少只兔子原在?

## 1
def fib(n,m):
        f= [0,1,1]
        for i in range(3,n+1):
                if i <= m:
                        total = f[i-1] + f[i-2]
                elif i == m+1:
                        total = f[i-1] + f[i-2] - 1
                else:
                        total = f[i-1] + f[i-2] - f[i-m-1]
                f.append(total)
        return(f[n])

inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))

print(fib(n,m))

11. Graph Theory

文件介紹好麻煩,自己看:http://rosalind.info/problems/grph/
總之該題有三個堿基的首尾相同就連接起來文兢,
輸入文件:

>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG

輸出結(jié)果:

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
seq = {}
with open('./overlap.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        seqname = line[1:]
                        seq[seqname] = ''
                        continue
                seq[seqname] += line.upper()

for key , value in seq.items():
        for key2 ,value2 in seq.items():
                if key != key2 and value[-3:] == value2[:3]:
                        print(key+'\t'+key2)

12. 計算后代的期望值

同樣懶得解釋原理晤斩,具體原理看:http://rosalind.info/problems/iev/
現(xiàn)在有6種基因型組合夫婦:

AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa

給定6個非負(fù)整數(shù),代表6種基因型組合的夫婦數(shù)量姆坚,求下一代顯性性狀的個數(shù),假設(shè)每對夫妻有2個孩子实愚。

def expected(a,b,c,d,f,g):
        AA_AA = 1
        AA_Aa = 1
        AA_aa = 1
        Aa_Aa = 0.75
        Aa_aa = 0.5
        aa_aa = 0
        p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
        return (p)

inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))

13. 計算序列間的最長一致性序列兼呵,即尋找序列間公有的motif(Finding a Shared Motif)

# 測試數(shù)據(jù)
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

# 輸出結(jié)果:
AC
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末兔辅,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子击喂,更是在濱河造成了極大的恐慌维苔,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,839評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件懂昂,死亡現(xiàn)場離奇詭異介时,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)凌彬,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,543評論 2 382
  • 文/潘曉璐 我一進(jìn)店門沸柔,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人铲敛,你說我怎么就攤上這事褐澎。” “怎么了伐蒋?”我有些...
    開封第一講書人閱讀 153,116評論 0 344
  • 文/不壞的土叔 我叫張陵工三,是天一觀的道長。 經(jīng)常有香客問我先鱼,道長俭正,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,371評論 1 279
  • 正文 為了忘掉前任焙畔,我火速辦了婚禮段审,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘闹蒜。我一直安慰自己寺枉,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,384評論 5 374
  • 文/花漫 我一把揭開白布绷落。 她就那樣靜靜地躺著姥闪,像睡著了一般。 火紅的嫁衣襯著肌膚如雪砌烁。 梳的紋絲不亂的頭發(fā)上筐喳,一...
    開封第一講書人閱讀 49,111評論 1 285
  • 那天,我揣著相機(jī)與錄音函喉,去河邊找鬼避归。 笑死,一個胖子當(dāng)著我的面吹牛管呵,可吹牛的內(nèi)容都是我干的梳毙。 我是一名探鬼主播,決...
    沈念sama閱讀 38,416評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼捐下,長吁一口氣:“原來是場噩夢啊……” “哼账锹!你這毒婦竟也來了萌业?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,053評論 0 259
  • 序言:老撾萬榮一對情侶失蹤奸柬,失蹤者是張志新(化名)和其女友劉穎生年,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體廓奕,經(jīng)...
    沈念sama閱讀 43,558評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡抱婉,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,007評論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了桌粉。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蒸绩。...
    茶點(diǎn)故事閱讀 38,117評論 1 334
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖番甩,靈堂內(nèi)的尸體忽然破棺而出侵贵,到底是詐尸還是另有隱情,我是刑警寧澤缘薛,帶...
    沈念sama閱讀 33,756評論 4 324
  • 正文 年R本政府宣布窍育,位于F島的核電站,受9級特大地震影響宴胧,放射性物質(zhì)發(fā)生泄漏漱抓。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,324評論 3 307
  • 文/蒙蒙 一恕齐、第九天 我趴在偏房一處隱蔽的房頂上張望乞娄。 院中可真熱鬧,春花似錦显歧、人聲如沸仪或。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,315評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽范删。三九已至,卻和暖如春拷肌,著一層夾襖步出監(jiān)牢的瞬間到旦,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,539評論 1 262
  • 我被黑心中介騙來泰國打工巨缘, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留添忘,地道東北人。 一個月前我還...
    沈念sama閱讀 45,578評論 2 355
  • 正文 我出身青樓若锁,卻偏偏與公主長得像搁骑,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,877評論 2 345

推薦閱讀更多精彩內(nèi)容