序列和字母表
Bio.Alphabet.IUPAC
提供Protein椎组、DNA和RNA的基本定義
擴(kuò)展:
Protein——IUPAC.protein
基本類(lèi)意系;IUPAC.extended_protein
常見(jiàn)氨基酸類(lèi)
DNA——IUPAC.unambiguous_dna
基本字母岔绸;IUPAC.ambiguous_dna
歧義字母嗜傅;IUPAC.extended_dna
修飾后的堿基
RNA——IUPAC.unambiguous_rna
基本字母塞赂;IUPAC.ambiguous_rna
歧義字母
定義模糊序列
In[2]: from Bio.Seq import Seq
In[3]: my_seq = Seq("AGTACACTGGT")
In[4]: my_seq
Out[4]:
Seq('AGTACACTGGT', Alphabet())
In[5]: my_seq.alphabet
Out[5]:
Alphabet()
Seq()
可以創(chuàng)建一個(gè)基本的序列對(duì)象
定義DNA序列
In[6]: from Bio.Seq import Seq
In[7]: from Bio.Alphabet import IUPAC
In[8]: my_seq = Seq("AGCTGCAGCGAGCGAGC", IUPAC.unambiguous_dna)
In[9]: my_seq
Out[9]:
Seq('AGCTGCAGCGAGCGAGC', IUPACUnambiguousDNA())
In[10]: my_seq.alphabet
Out[10]:
IUPACUnambiguousDNA()
序列處理
迭代元素
In[11]: from Bio.Seq import Seq
In[12]: from Bio.Alphabet import IUPAC
In[15]: for index,letter in enumerate(my_seq):
...: print(index,letter)
...:
0 A
1 G
2 T
3 C
4 G
5 A
enumerate()
可以遍歷序列中的元素及其下標(biāo)
獲取長(zhǎng)度
In[17]: my_seq
Out[17]:
Seq('AGTCGA', IUPACUnambiguousDNA())
In[18]: print(len(my_seq))
6
獲取序列元素
In[19]: print(my_seq[0])
A
In[20]: print(my_seq[2])
T
非重疊計(jì)數(shù)
In[21]: Seq("AAAAA").count("AA")
Out[21]:
2
In[22]: "AAAAA".count("AA")
Out[22]:
2
統(tǒng)計(jì)GC含量
In[27]: from Bio.SeqUtils import GC
In[28]: my_seq
Out[28]:
Seq('AGTCGA', IUPACUnambiguousDNA())
In[29]: GC(my_seq)
Out[29]:
50.0
切片
In[30]: my_seq = Seq("AGCTGACTGACGCATGAACGATAGCA", IUPAC.unambiguous_dna)
In[31]: my_seq[4:12]
Out[31]:
Seq('GACTGACG', IUPACUnambiguousDNA())
In[32]: my_seq[4:12:3]
Out[32]:
Seq('GTC', IUPACUnambiguousDNA())
產(chǎn)生的新對(duì)象保留了原始Seq對(duì)象的字母表信息
返回倒序
In[33]: my_seq[::-1]
Out[33]:
Seq('ACGATAGCAAGTACGCAGTCAGTCGA', IUPACUnambiguousDNA())
轉(zhuǎn)換字符串
In[34]: str(my_seq)
Out[34]:
'AGCTGACTGACGCATGAACGATAGCA'
In[35]: print(my_seq)
AGCTGACTGACGCATGAACGATAGCA
In[36]: fasta = ">Name\n%s\n" % my_seq
In[37]: print(fasta)
>Name
AGCTGACTGACGCATGAACGATAGCA
print()
或%
可以自動(dòng)轉(zhuǎn)換
序列連接
相同字母表
In[39]: dna1 = Seq("AGCTAGCGA",IUPAC.unambiguous_dna)
In[40]: dna2 = Seq("AGTCCGATG", IUPAC.unambiguous_dna)
In[41]: dna = dna1 + dna2
In[42]: dna
Out[42]:
Seq('AGCTAGCGAAGTCCGATG', IUPACUnambiguousDNA())
不同字母表
In[50]: from Bio.Alphabet import generic_alphabet
In[51]: protein.alphabet = generic_alphabet
In[52]: dna.alphabet = generic_alphabet
In[53]: dna + protein
Out[53]:
Seq('AGCTAGCGAAGTCCGATGEVRNAK', Alphabet())
不同字母表序列連接,必須首先將兩個(gè)序列轉(zhuǎn)換為通用字母表,否則會(huì)報(bào)錯(cuò)
ypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()
大小寫(xiě)轉(zhuǎn)換
In[56]: my_seq = Seq("acgGATC",generic_alphabet)
In[57]: my_seq.upper()
Out[57]:
Seq('ACGGATC', Alphabet())
In[58]: my_seq.lower()
Out[58]:
Seq('acggatc', Alphabet())
互補(bǔ)鏈和反義鏈
In[61]: my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
In[62]: my_seq.complement()
Out[62]:
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
In[63]: my_seq.reverse_complement()
Out[63]:
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
生物過(guò)程模擬
轉(zhuǎn)錄
In[64]: coding_dna = Seq("AGTCGATCGATGACTAGCATGACGCATGACT", IUPAC.unambiguous_dna)
In[65]: coding_dna
Out[65]:
Seq('AGTCGATCGATGACTAGCATGACGCATGACT', IUPACUnambiguousDNA())
In[66]: template_dna = coding_dna.reverse_complement()
In[67]: template_dna
Out[67]:
Seq('AGTCATGCGTCATGCTAGTCATCGATCGACT', IUPACUnambiguousDNA())
In[68]: mRNA = coding_dna.transcribe()
In[69]: mRNA
Out[69]:
Seq('AGUCGAUCGAUGACUAGCAUGACGCAUGACU', IUPACUnambiguousRNA())
In[70]: template_dna.reverse_complement().transcribe()
Out[70]:
Seq('AGUCGAUCGAUGACUAGCAUGACGCAUGACU', IUPACUnambiguousRNA())
transcribe()
將T→U轉(zhuǎn)換,并調(diào)整字母表
反轉(zhuǎn)錄
In[71]: mRNA.back_transcribe()
Out[71]:
Seq('AGTCGATCGATGACTAGCATGACGCATGACT', IUPACUnambiguousDNA())
back_transcribe()
從U → T的替代并伴隨著字母表的變化
翻譯
In[73]: dna_seq = Seq("ATGCGTAGCTAGCTGACGTACGTAGCA",IUPAC.unambiguous_dna)
In[74]: len(dna_seq)
Out[74]:
27
In[75]: mrna_seq = dna.transcribe()
In[76]: mrna_seq.translate()
Out[76]:
Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
In[77]: dna.translate()
Out[77]:
Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
序列長(zhǎng)度必須是3的倍數(shù),否則
translate()
報(bào)錯(cuò)
translate(table,stop_symbol,to_stop,cds)
table
指定遺傳密碼表,默認(rèn)使用標(biāo)準(zhǔn)遺傳密碼,詳細(xì)見(jiàn)NCBI的遺傳密碼表說(shuō)明
In[79]: dna.translate(table="Yeast Mitochondrial")
Out[79]:
Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
指定使用酵母線粒體密碼表進(jìn)行翻譯
to_stop
僅翻譯到閱讀框的第一個(gè)終止密碼子错沃,然后停止,終止密碼子本身不翻譯
In[80]: dna.translate(to_stop = True)
Out[80]:
Seq('S', ExtendedIUPACProtein())
stop_symbol
指定終止符號(hào)
In[82]: dna.translate(stop_symbol = "?")
Out[82]:
Seq('S?RSPM', HasStopCodon(ExtendedIUPACProtein(), '?'))
cds
說(shuō)明翻譯時(shí)以起始密碼子編碼最前面的3個(gè)堿基
In[85]: from Bio.Seq import Seq
In[86]: from Bio.Alphabet import generic_dna
In[87]: gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
...: "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
...: "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
...: "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
...: "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
...: generic_dna)
In[88]: gene.translate(table="Bacterial")
Out[88]:
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
In[89]: gene.translate(table="Bacterial", cds=True)
Out[89]:
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())
密碼表
在線密碼表
內(nèi)置密碼表
In[90]: from Bio.Data import CodonTable
In[92]: print(CodonTable.unambiguous_dna_by_name["Standard"]) #通過(guò)名字來(lái)做標(biāo)識(shí)
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
In[93]: print(CodonTable.unambiguous_dna_by_id[1]) #通過(guò)數(shù)字來(lái)做標(biāo)識(shí)
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
Seq對(duì)象
比較
In[2]: from Bio.Seq import Seq
In[3]: from Bio.Alphabet import IUPAC
In[4]: seq1 = Seq("AGCT", IUPAC.unambiguous_dna)
In[5]: seq2 = Seq("AGCT", IUPAC.unambiguous_dna)
In[6]: seq1 == seq2
Out[6]:
True
In[12]: id(seq1) == id(seq2)
Out[12]:
False
In[13]: id(seq1)
Out[13]:
2111559244880
In[14]: id(seq2)
Out[14]:
2111559374160
In[15]: str(seq1) == str(seq2)
Out[15]:
True
兩個(gè)Seq對(duì)象季俩,序列和字母表都時(shí)相同的钮糖,雖然
seq1 == seq2
返回True,但是其實(shí)內(nèi)存中這兩個(gè)對(duì)象不是同一個(gè)。通過(guò)id()
函數(shù)可以看到id(seq1) == id(seq2)
返回False酌住,所以在做序列比較時(shí)店归,可以使用str()
處理后,只是以字符串比較酪我。
可變
tomutable()
In[19]: my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
In[20]: my_seq[5] = "G"
Traceback (most recent call last):
File "C:\Users\AnLau\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-56a40d7fb976>", line 1, in <module>
my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
In[21]: mutable_seq = my_seq.tomutable()
In[22]: mutable_seq
Out[22]:
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
In[23]: mutable_seq[5]="G"
In[24]: mutable_seq
Out[24]:
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
Seq對(duì)象不可變
可以使用tomutable()
函數(shù)將Seq對(duì)象變?yōu)镸utableSeq對(duì)象
創(chuàng)建MutableSeq對(duì)象
In[28]: mutable_seq = MutableSeq("AGCGATGAC",IUPAC.unambiguous_dna)
In[29]: mutable_seq
Out[29]:
MutableSeq('AGCGATGAC', IUPACUnambiguousDNA())
In[30]: mutable_seq[0]="T"
In[31]: mutable_seq
Out[31]:
MutableSeq('TGCGATGAC', IUPACUnambiguousDNA())
In[32]: mutable_seq.remove("T")
In[33]: mutable_seq
Out[33]:
MutableSeq('GCGATGAC', IUPACUnambiguousDNA())
In[34]: mutable_seq.reverse()
In[35]: mutable_seq
Out[35]:
MutableSeq('CAGTAGCG', IUPACUnambiguousDNA())
In[36]: new_seq = mutable_seq.toseq()
In[37]: new_seq
Out[37]:
Seq('CAGTAGCG', IUPACUnambiguousDNA())
In[38]: new_seq.reverse_complement()
Out[38]:
Seq('CGCTACTG', IUPACUnambiguousDNA())
In[39]: new_seq
Out[39]:
Seq('CAGTAGCG', IUPACUnambiguousDNA())
可以使用
toseq()
將MutableSeq對(duì)象轉(zhuǎn)變?yōu)镾eq對(duì)象
MutableSeq對(duì)象有reverse()
方法消痛,而且各個(gè)方法直接修改MutableSeq對(duì)象本身
UnknownSeq對(duì)象
In[40]: from Bio.Seq import UnknownSeq
In[41]: unk = UnknownSeq(20)
In[42]: unk
Out[42]:
UnknownSeq(20, alphabet = Alphabet(), character = '?')
In[43]: unk_dna = UnknownSeq(20,IUPAC.unambiguous_dna)
In[44]: unk_dna
Out[44]:
UnknownSeq(20, alphabet = IUPACUnambiguousDNA(), character = 'N')
In[45]: print(unk)
????????????????????
In[46]: print(unk_dna)
NNNNNNNNNNNNNNNNNNNN
UnknownSeq對(duì)象可以只存儲(chǔ)一個(gè)“N”和序列所需的長(zhǎng)度(整數(shù)),節(jié)省內(nèi)存
直接使用字符串
In[47]: from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
In[48]: dna_string = "AGTCGATCGATCGACTGCGACGTCGA"
In[49]: reverse_complement(dna_string)
Out[49]:
'TCGACGTCGCAGTCGATCGATCGACT'
In[50]: transcribe(dna_string)
Out[50]:
'AGUCGAUCGAUCGACUGCGACGUCGA'
In[51]: translate(dna_string)
Out[51]:
'SRSIDCDV'
C:\Users\AnLau\Anaconda3\lib\site-packages\Bio\Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)