序列比對(duì)是什么以及序列比對(duì)主要的作用是什么桃纯,本篇博客就一筆帶過(guò),因?yàn)椴皇侵饕窒韮?nèi)容披坏。
序列比對(duì)态坦,此處引申為pairwise alignment會(huì)更加恰當(dāng)一些,用于比較2條序列之間的相似程度棒拂,推斷它們之間的相似程度驮配,進(jìn)而探索對(duì)應(yīng)功能以及系統(tǒng)發(fā)育關(guān)系。
接下來(lái)大體分為2個(gè)部分着茸,1)全局比對(duì)壮锻,2)局部比對(duì)
首先要明確一個(gè)概念:序列比對(duì)想要達(dá)到的目的是什么?
引一張圖來(lái)說(shuō)明序列比對(duì)的目的以及全局比對(duì)涮阔、局部比對(duì)之間的區(qū)別猜绣,
總的來(lái)說(shuō),也就是全局比對(duì)和局部比對(duì)想要達(dá)到的目的是不一樣的敬特,
- 全局比對(duì)想要得到的是2條序列最佳的匹配結(jié)果(e.g. 最多的match數(shù)量掰邢、最高的比對(duì)得分牺陶、最高的identity),局部比對(duì)想要得到的是2條序列中最佳匹配片段(注意:最佳的匹配結(jié)果需要建立在相對(duì)較少的序列修改上)
- 全局比對(duì)更適用于evolution關(guān)系上更加靠近的(e.g. 粳稻和秈稻)辣之,而局部比對(duì)更加適用于evolution關(guān)系上關(guān)系比較遠(yuǎn)的(e.g. 水稻和葡萄)
「步驟拆解」Global Alignment
利用動(dòng)態(tài)規(guī)劃來(lái)解決問(wèn)題掰伸,最關(guān)鍵的一步就是列出動(dòng)態(tài)規(guī)劃公式,只要能列出公式怀估,后面的編程也都只是時(shí)間問(wèn)題狮鸭。
但是,我并不想一上來(lái)就列出數(shù)學(xué)公式多搀,我認(rèn)為以一個(gè)簡(jiǎn)單的例子入手更有利于序列比對(duì)問(wèn)題中的動(dòng)態(tài)規(guī)劃應(yīng)用歧蕉。
接下來(lái),先理一理基于動(dòng)態(tài)規(guī)劃的序列比對(duì)的過(guò)程康铭。
(1)Intialization + Matrix Filling
假設(shè)現(xiàn)在有2條長(zhǎng)度分別為n惯退、m的序列,
那則需要構(gòu)建行數(shù)為n+1从藤,列數(shù)為m+1的矩陣催跪,
而“Filling”這個(gè)過(guò)程,即將第一列和第一行進(jìn)行填充夷野,從數(shù)學(xué)公式的角度來(lái)理解的話叠荠,
- 第一列的填充:,i~[1, n]
- 第一行的填充:扫责,j~[1, m]
(2)Tracing Back
每一個(gè)單元的填充模式如下,
-
橫向和豎向的移動(dòng)代表了gap open(horizontal逃呼,vertical)
但更加復(fù)雜的情況應(yīng)該考慮到gap在哪一條序列打開(kāi)
對(duì)角線的移動(dòng)則可以分為1)match(從大的數(shù)值回溯到小的數(shù)值)鳖孤,2)mismatch(從小的數(shù)值回溯到大的數(shù)值)
Note:數(shù)值增大代表替換矩陣中,該堿基對(duì)應(yīng)關(guān)系為match抡笼,而數(shù)值減小苏揣,代表替換矩陣中堿基對(duì)應(yīng)關(guān)系為mismatch
「公式」Global Alignment
引入gap extension
出現(xiàn)4個(gè)單位長(zhǎng)度的一個(gè)完整gap將兩條序列給比對(duì)上,或者4個(gè)單位長(zhǎng)度的單獨(dú)gap將兩條序列給比對(duì)上是更符合生物學(xué)原理的推姻?
上述的文字情況如下所示平匈,
# 1.
ATCGATCGATCGATCG----
AGCTAGCTCAGTACGT----
# 2.
ATCG-ATCG-ATCG-ATCG-ATCG
ATCG-ATCG-ATCG-ATCG-ATCG
答案是前者。這就需要在序列比對(duì)中引入另一個(gè)非常重要的細(xì)節(jié) —— affine gap penalty藏古。
Note:此處引入的affine gap penalty為“not penalize open with extension”增炭,即在打開(kāi)一個(gè)gap的時(shí)候,不會(huì)在該gap上同時(shí)引入open和extension的罰分
affine gap penalty拧晕,即在打開(kāi)第一個(gè)gap的時(shí)候引入gap open罰分隙姿,而在該gap的基礎(chǔ)上進(jìn)行延續(xù)則采用gap extension罰分。
該種做法與原來(lái)的常量gap有一定區(qū)別厂捞,因此就需要改變動(dòng)態(tài)規(guī)劃公式输玷,同時(shí)引入CS中的狀態(tài)機(jī)可以幫助我們更好地理解這個(gè)問(wèn)題队丝。
上圖中存在3個(gè)狀態(tài),
1)M:當(dāng)前的比對(duì)情況下為match或mismatch
2)Ix:當(dāng)前的比對(duì)情況為在seq2上打開(kāi)一個(gè)gap欲鹏,而seq1上為一個(gè)base
3)Iy:當(dāng)前的比對(duì)情況為在seq1上打開(kāi)一個(gè)gap机久,而seq2上為一個(gè)base
三者之間是可以相互轉(zhuǎn)換的,通過(guò)d赔嚎、e膘盖、s(x, y)來(lái)調(diào)整。
因此動(dòng)態(tài)規(guī)劃公式變?yōu)槿缦碌男问剑?/p>
gap extension情況下的動(dòng)態(tài)規(guī)劃矩陣初始化
- M(0,0)=0
但由于在第一行以及在第一列的取值都是不存在的尽狠,因此定義為-inf衔憨。
同時(shí),由于每一個(gè)cell都存在一種情況袄膏,我們需要建立3個(gè)矩陣來(lái)存儲(chǔ)對(duì)應(yīng)的信息践图,分別用M、X沉馆、Y來(lái)表示码党。
經(jīng)初始化之后,就可以得到如下3個(gè)矩陣:
1)M
2)X
代表在列方向打開(kāi)一個(gè)gap斥黑,即seq2上插入一個(gè)gap
3)Y
代表在行方向上打開(kāi)一個(gè)gap揖盘,即seq1上插入一個(gè)gap
gap extension情況下的動(dòng)態(tài)規(guī)劃矩陣回溯
3個(gè)矩陣,可以使用1個(gè)矩陣來(lái)記錄當(dāng)前的cell的數(shù)值來(lái)源锌奴,3種情況如下
- 來(lái)自M兽狭,即當(dāng)前為一個(gè)match/mismatch,記錄為0
- 來(lái)自X鹿蜀,即當(dāng)前為一個(gè)gap open/gap extension箕慧,記錄為1
- 來(lái)自Y,即當(dāng)前為一個(gè)gap open/gap extension茴恰,記錄為2
給個(gè)示例的回溯矩陣颠焦,
「步驟拆解」Local Alignment
步驟與Global Alignment近似,只是引入了一個(gè)0往枣,就可以得到局部的最佳匹配伐庭。
公式如下,
代碼實(shí)現(xiàn)
自己對(duì)這部分代碼的評(píng)價(jià)是分冈,
1)封裝性還可以提高(OOP的應(yīng)用還是不夠)
2)可以將結(jié)果繪制(從老鄉(xiāng)那里得到的啟發(fā)圾另,需要熟練自己matplotlib的使用)
from ctypes import alignment
import numpy as np
# Preset variables
seq1 = "TCGTAGACGA"
seq2 = "ATAGAATGCGG"
print(f'The seq1 has length of {len(seq1)}')
print(f'The seq2 has length of {len(seq2)}')
match = 1
mismatch = -1
gap_open = -2
gap_extension = -1
#
global MIN
MIN = -float("inf")
def identity_match(base1, base2):
'''Note: this function is used to compare the bases and return match point or mismatch point'''
if base1 == base2:
return match
else:
return mismatch
def createscorematrix(n, m):
'''Note: this function is used to generate the original score function'''
# Create match matrix, x matrix and y matrix
m_mat = [np.zeros(m+1) for i in range(0, n+1)]
x_mat = [np.zeros(m+1) for i in range(0, n+1)]
y_mat = [np.zeros(m+1) for i in range(0, n+1)]
return m_mat, x_mat, y_mat
m_mat, x_mat, y_mat = createscorematrix(len(seq1), len(seq2))
# print(m_mat)
# print(x_mat)
# print(y_mat)
def scorematrix_init(m_mat, x_mat, y_mat, d, e, local=False):
'''Note: this function conduct the score matrix initialization'''
'''Global Alignment'''
if local == False:
'''match matrix filling'''
for i in range(0, len(m_mat)):
for j in range(0, len(m_mat[0])):
if i == 0 and j == 0:
m_mat[i][j] = 0
elif i == 0 and j > 0:
m_mat[i][j] = MIN
elif i > 0 and j == 0:
m_mat[i][j] = MIN
# print(m_mat)
for line in m_mat:
r_list = [str(i) for i in line]
print(' '.join(r_list))
'''x_matrix filling'''
for i in range(0, len(x_mat)):
for j in range(0, len(x_mat[0])):
if i == 0 and j == 0:
x_mat[i][j]=0
if i > 0 and j == 0:
x_mat[i][j] = d+e*(i-1)
x_first_row = [0]
x_first_row.extend([MIN]*(len(x_mat[0])-1))
x_mat[0] = x_first_row
# print(x_mat)
for line in x_mat:
r_list = [str(i) for i in line]
print(' '.join(r_list))
'''y_matrix filling'''
for i in range(0, len(y_mat)):
for j in range(0, len(y_mat[0])):
if i == 0 and j == 0:
y_mat[i][j]=0
elif i > 0 and j == 0:
y_mat[i][j] = MIN
y_first_row = [0]
y_first_row.extend([d+e*(i-1) for i in range(1, len(y_mat[0]-1))])
y_mat[0] = y_first_row
# print(y_mat)
for line in y_mat:
r_list = [str(i) for i in line]
print(' '.join(r_list))
return m_mat, x_mat, y_mat
'''Local Alignment: Initialization step for Smith-Watermen is useless'''
if local == True:
return m_mat, x_mat, y_mat
m_mat, x_mat, y_mat = scorematrix_init(m_mat, x_mat, y_mat, -2, -1, local=False)
# m_mat, x_mat, y_mat = scorematrix_init(m_mat, x_mat, y_mat, -2, -1, local=True)
def matrix_filling(m_mat, x_mat, y_mat, d, e, local=False):
'''this function is used to create the scoring matrix using three dynamic programming,
and building a tracing matrix to restore the paths for the retrieve of aliignments'''
'''Global Alignment Activation'''
if local == False:
# Filling score matrix and record the trace
trace_matrix = [np.zeros(len(m_mat[0])) for i in range(0, len(m_mat))]
for i in range(1, len(m_mat)):
# print(m_mat[0])
for j in range(1, len(m_mat[0])):
# print(i, j)
m_mat[i][j] = max(
m_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1]),
x_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1]),
y_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1])
)
x_mat[i][j] = max(m_mat[i-1][j] + d, x_mat[i-1][j] + e)
y_mat[i][j] = max(m_mat[i][j-1] + d, y_mat[i][j-1] + e)
# for line in m_mat:
# print(line)
# Take the greatest values in these three matrix,
# merge into one matrix,
# and record the path
new_mat = [np.zeros(len(m_mat[0])) for i in range(0, len(m_mat))]
for i in range(0, len(m_mat)):
for j in range(0, len(m_mat[0])):
new_mat[i][j] = max(m_mat[i][j], x_mat[i][j], y_mat[i][j])
# Fill the trace matrix
# Note: from match/mismatch is 0, from x_mat (open a gap in seq2) is 1, from y_mat (open a gap in seq1)
if m_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 0
elif x_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 1
elif y_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 2
# # Print out the scoring matrix
# for line in new_mat:
# r_list = [str(i) for i in line]
# print('\t'.join(r_list))
# # Print out the tracing matrix
for line in trace_matrix:
r_list = [str(i) for i in line]
print('\t'.join(r_list))
return new_mat, trace_matrix
'''Local Alignment Activation'''
if local == True:
# Filling score matrix and record the trace
trace_matrix = [np.zeros(len(m_mat[0])) for i in range(0, len(m_mat))]
for i in range(1, len(m_mat)):
# print(m_mat[0])
for j in range(1, len(m_mat[0])):
# print(i, j)
m_mat[i][j] = max(
m_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1]),
x_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1]),
y_mat[i-1][j-1] + identity_match(seq1[i-1], seq2[j-1]),
0
)
x_mat[i][j] = max(m_mat[i-1][j] + d, x_mat[i-1][j] + e)
y_mat[i][j] = max(m_mat[i][j-1] + d, y_mat[i][j-1] + e)
# for line in m_mat:
# print(line)
# Take the Greatest values in these three matrix
new_mat = [np.zeros(len(m_mat[0])) for i in range(0, len(m_mat))]
for i in range(0, len(m_mat)):
for j in range(0, len(m_mat[0])):
new_mat[i][j] = max(m_mat[i][j], x_mat[i][j], y_mat[i][j])
if m_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 0
elif x_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 1
elif y_mat[i][j] == max(m_mat[i][j], x_mat[i][j], y_mat[i][j]):
trace_matrix[i][j] = 2
# # Print out the scoring matrix
# for line in new_mat:
# r_list = [str(i) for i in line]
# print(' '.join(r_list))
# # Print out the tracing matrix
# for line in trace_matrix:
# r_list = [str(i) for i in line]
# print('\t'.join(r_list))
return new_mat, trace_matrix
score_matrix, trace_matrix = matrix_filling(m_mat, x_mat, y_mat, -2, -1, local=False)
# score_matrix, trace_matrix = matrix_filling(m_mat, x_mat, y_mat, -2, -1, local=True)
# seq1 = "-TCGTAGACGA"
# seq2 = "ATAGAATGCGG"
def global_backtracking(matrix, score_matrix):
'''this function is used to trace back the input matrix and output the final alignment
Note: the input matrix is trace matrix'''
ti = len(seq1)
tj = len(seq2)
alignment1 = ''
alignment2 = ''
while (ti > 0 or tj > 0):
# Choose to go left, up or diagonal
cell = matrix[ti][tj]
if cell == 0:
alignment1 = seq1[ti-1] + alignment1
alignment2 = seq2[tj-1] + alignment2
ti -= 1
tj -= 1
elif cell == 1:
alignment1 = seq1[ti-1] + alignment1
alignment2 = '-' + alignment2
ti -= 1
elif cell == 2:
alignment1 = '-' + alignment1
alignment2 = seq2[tj-1] + alignment2
tj -= 1
# fmt_alignment = f'{alignment1}\n{alignment2}'
# print(fmt_alignment)
# Formt the info
info = f"======The Global======\n {alignment1}\n {alignment2}\nSCORE: {score_matrix[len(score_matrix)-1][len(score_matrix[0])-1]}"
print(info)
global_backtracking(trace_matrix, score_matrix)
def local_backtracking(trace_matrix, score_matrix):
'''this function does backtracking like FUNCTION global_backtracking, but in the way of local aligment'''
# Convert score matrix into Numpy array to find maximum value
new_score_matrix = np.array(score_matrix)
pos = np.unravel_index(np.argmax(new_score_matrix, axis=None), new_score_matrix.shape) # retrieve the maximum value
ti = pos[0]
tj = pos[1]
# print(f'{ti}\t{tj}')
alignment1 = ''
alignment2 = ''
while (ti > 0 or tj > 0):
if new_score_matrix[ti][tj] == 0: # stop local alignment back tracking when 0 values met
break
cell = trace_matrix[ti][tj]
if cell == 0:
alignment1 = seq1[ti-1] + alignment1
alignment2 = seq2[tj-1] + alignment2
ti -= 1
tj -= 1
elif cell == 1:
alignment1 = seq1[ti-1] + alignment1
alignment2 = '-' + alignment2
ti -= 1
elif cell == 2:
alignment1 = '-' + alignment1
alignment2 = seq2[tj-1] + alignment2
tj -= 1
info = f"======The Local======\n {alignment1}\n {alignment2}\nSCORE: {np.ndarray.max(new_score_matrix)}"
print(info)
# local_backtracking(trace_matrix, score_matrix)
參考文獻(xiàn)
[1] 《組學(xué)數(shù)據(jù)中的統(tǒng)計(jì)與分析》,田衛(wèi)東
[2] https://users.soe.ucsc.edu/~karplus/bme205/f12/Alignment.html
[3] https://www.youtube.com/watch?v=ZBD9he4Zp1E
[4] Biological sequence analysis: Probabilistic models of proteins and nucleic acids
后話
開(kāi)學(xué)了一個(gè)月雕沉,但是未來(lái)的科研方向沒(méi)有定下來(lái)盯捌,
我只能說(shuō)我大概懂自己想往什么方向走,同時(shí)我也意識(shí)到優(yōu)秀的人太多了蘑秽,
所以我要更加踏實(shí)一些饺著,努力一些箫攀,更加學(xué)會(huì)總結(jié)一些。
寫(xiě)博客的思路幼衰,其實(shí)算受到了熊大哥畢業(yè)時(shí)發(fā)的文章的影響:
“利用空余時(shí)間不停地寫(xiě)”以及熊大哥在博客中提到的“討好型人格”靴跛,
從那一期播客中,我學(xué)習(xí)到了很多渡嚣,我也發(fā)現(xiàn)了自己實(shí)際上只是剛?cè)腴T(mén)梢睛,許多東西我都不熟。
繼續(xù)加油识椰。