Exact match
Naive exact match
Introduction:精確匹配复亏,耗時(shí)長
Input: parten 和 text即模式串和需要比對上的字符串
Output: parten出現(xiàn)所有位置的index
Algorithm:兩個(gè)loop,外層對t進(jìn)行移位痰滋,內(nèi)層逐字符比對黍瞧,關(guān)鍵是有一個(gè)mach標(biāo)志為丐巫,如果比對上了,append。
def naive(p, t):
occurrences = []
for i in range(len(t) - len(p) + 1): # loop over alignments
match = True
for j in range(len(p)): # loop over characters
if t[i+j] != p[j]: # compare characters
match = False
break
if match:
occurrences.append(i) # all chars matched; record
return occurrences
Boyer-Moore
Introduction:j精確匹配,可以跳過一些不需要匹配的位置,因此速度比Naive快
Input: p, t 和p_bm摊聋,它相當(dāng)于一個(gè)對p的preprocess
Output: p匹配到t的位置
Algorithm:核心思想:一個(gè)shift,每次選出一個(gè)最大gap, 如何匹配上了栈暇, append(i)
細(xì)節(jié):首先while循環(huán)麻裁,因?yàn)槟悴磺宄h(huán)次數(shù),所以不用for循環(huán)源祈;然后for循環(huán)煎源,對p中的每個(gè)字符遍歷,從右往左香缺,如果不等手销,通過bad_character和good_suffix兩種規(guī)則,求出最大shift,并break;如果match图张,則append锋拖,并skip_gs,繼續(xù)map祸轮,有可能有多個(gè)位置兽埃。
def boyer_moore(p, p_bm, t):
""" Do Boyer-Moore matching. p=pattern, t=text,
p_bm=BoyerMoore object for p """
i = 0
occurrences = []
while i < len(t) - len(p) + 1:
shift = 1
mismatched = False
for j in range(len(p)-1, -1, -1):
if p[j] != t[i+j]:
skip_bc = p_bm.bad_character_rule(j, t[i+j])
skip_gs = p_bm.good_suffix_rule(j)
shift = max(shift, skip_bc, skip_gs)
mismatched = True
break
if not mismatched:
occurrences.append(i)
skip_gs = p_bm.match_skip()
shift = max(shift, skip_gs)
i += shift
return occurrences
Hash Algorithm
Introduction:需要preprocess的算法,精確匹配适袜,適用于test較大的比對柄错,但是hash表較大
Input: p, t, k k是生成hash table也就是Index中的k-mer
Output:比對的位置
Algorithm:核心:先是對test進(jìn)行index生成hashtable,query的時(shí)候?qū)只選了一個(gè)k-mer, 因?yàn)檫@里選的都是精確匹配。python中用的一個(gè)list,然后元素是(k-mer, position)售貌, 用了一個(gè)二叉樹查找给猾,這里的-1有點(diǎn)奇怪。然后對p進(jìn)行選取k-mer颂跨,查找敢伸,最后需要對剩下的字符進(jìn)行匹配。
class Index(object):
""" Holds a substring index for a text T """
def __init__(self, t, k):
""" Create index from all substrings of t of length k """
self.k = k # k-mer length (k)
self.index = []
for i in range(len(t) - k + 1): # for each k-mer
self.index.append((t[i:i+k], i)) # add (k-mer, offset) pair
self.index.sort() # alphabetize by k-mer
def query(self, p):
""" Return index hits for first k-mer of p """
kmer = p[:self.k] # query with first k-mer
i = bisect.bisect_left(self.index, (kmer, -1)) # binary search
hits = []
while i < len(self.index): # collect matching index entries
if self.index[i][0] != kmer:
break
hits.append(self.index[i][1])
i += 1
return hits
def query_index(p, t, index):
k = index.k
offsets = []
for i in index.query(p):
if p[k:] == t[i+k: i+len(p)]:
offsets.append(i)
return offsets
減少hashtable的大小
提高specificity, 減少collision的次數(shù)
Approximate match
Pigenohole principle:即P與T匹配毫捣,允許k個(gè)失配,那么把P分成k+1份帝际,必然在T中能夠找到k+1份中某一份的精確匹配
if |X| = |Y|, then EditDistance(X, Y) <= HammingDistance(X, Y)
if |X| not equal |Y|, then EditDistance(X, Y) >= ||X|-|Y||
** Hamming distance
Edit distance
Hash with pigenohole principle
Introduction:模糊匹配蔓同,原理主要采用鴿子洞原理,即把P分為k+1份蹲诀,總有一份滿足精確匹配斑粱,找到它的位置,然后計(jì)算其余部分失配的個(gè)數(shù)不能超過限定值脯爪。
Input: p, t, n n表示允許失配的個(gè)數(shù)
Output:匹配到的位置则北,以及字符
Algorithm:核心思想是把p分為n+1份,我們只要求長度即可痕慢,并且小于它就行尚揣,不需要我們產(chǎn)生n+1個(gè)隨機(jī)數(shù),還得之和為len(P).然后根據(jù)這個(gè)平均長度掖举,生成hashtable.然后對n+1份字符進(jìn)行遍歷快骗,并且查找位置,之后對查找出的位置進(jìn)行遍歷塔次。如果查找的位置小于start 或者大于len(t),則continue,否則方篮,逐字符比對剩下的字符,統(tǒng)計(jì)失配個(gè)數(shù)励负。
def approximate_match(p, t, n):
segment_length = int(round(len(p) / (n + 1)))
all_matches = set()
p_idx = Index(t, segment_length)
idx_hits = 0
for i in range(n + 1):
start = i * segment_length
end = min((i + 1) * segment_length, len(p))
matches = p_idx.query(p[start:end])
# Extend matching segments to see if whole p matches
for m in matches:
idx_hits += 1
if m < start or m - start + len(p) > len(t):
continue
mismatches = 0
for j in range(0, start):
if not p[j] == t[m - start + j]:
mismatches += 1
if mismatches > n:
break
for j in range(end, len(p)):
if not p[j] == t[m - start + j]:
mismatches += 1
if mismatches > n:
break
if mismatches <= n:
all_matches.add(m - start)
return list(all_matches), idx_hits
def editDistance(a, b):
if len(a) == 0 :
return len(b)
if len(b) == 0:
return len(a)
if a[-1] == b[-1]:
delta = 1
else :
delta = 0
return min{editDistance(a[ :-1], b[ :-1]) + delta, editDistance(a, b[ : -1]) + 1, editDistance(a[ : -1], b) + 1}
Problem :遞歸太耗時(shí)了藕溅,主要是會重復(fù)計(jì)算相同的值
Dynamic programing
For any pair of prefixes from X and Y ,edit distance is calculated once ,這也是它跟遞歸相比快的原因,但很明顯继榆,它消耗更多的內(nèi)存資源巾表。
def editDistance(x, y):
# create distance matrix
D = []
for i in range(len(x)+1):
D.append([0]*(len(y)+1))
# initialze first row and colum of matrix
for i in range(len(x)+1):
D[i][0] = i
for i in range(len(y)+1):
D[0][i] = i
#fill in the rest of the matrix
for i in range(1, len(x)+1):
for j in range(1, len(y)+1):
disHor = D[i][j-1] + 1
disVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
disDiag = D[i-1][j-1]
else:
disDiag = D[i-1][j-1] + 1
D[i][j] = min(disHor, disVer, disDiag)
return D[-1][-1]
Approximate match
Introduction:第一行初始化為0,因?yàn)槟悴磺宄在哪個(gè)地方與T匹配第一個(gè)字符略吨,所以默認(rèn)的editDistance都為0攒发,然后從最后一排中editDistance最小的位置開始回溯
local alignment and gloal alignement
Assembly
Overlap graph
Introduction:
Input:一組reads
Output:overlap graph
Algorithm:核心:對每一個(gè)read移位生成k-mer,然后以k-mer為地址,value為read,進(jìn)行建表,key-value也即dic數(shù)據(jù)結(jié)構(gòu)
def overlap(a, b, min_length=3):
""" Return length of longest suffix of 'a' matching
a prefix of 'b' that is at least 'min_length'
characters long. If no such overlap exists,
return 0. """
start = 0 # start all the way at the left
while True:
start = a.find(b[:min_length], start) # look for b's prefix in a
if start == -1: # no more occurrences to right
return 0
# found occurrence; check for full suffix/prefix match
if b.startswith(a[start:]):
return len(a)-start
start += 1 # move just past previous match
def overlap_all_pairs(reads, k):
dic = {}
index = defaultdict(set)
for read in reads:
for i in range(len(read)-k+1):
index[read[i:i+k]].add(read)
# make graph
graph = defaultdict(set)
for r in reads:
for o in index[r[-k:]]:
if r != o:
if overlap(r, o, k):
graph[r].add(o)
edges = 0
for read in graph:
edges += len(graph[read])
return(edges, len(graph))
shortest common string
Introduction:需要把各種組合都考慮進(jìn)去晋南,求取最小的一個(gè)惠猿,太耗時(shí)間了
Input:
Output:
Algorithm
def scs(ss):
shortest_sup = None
for ssperm in itertools.permutations(ss):
sup = ssperm[0]
for i in range(len(ssperm)-1):
olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
sup += ssperm[i+1][olen:]
if shortest_sup is None or len(sup)<len(shortest_sup):
shortest_sup = sup
return shortest_sup
Greedy shortest common superstring
Introduction:不一定能得到最優(yōu)解,但是速度快,并且與最優(yōu)解的差距不大
Input:
Output:
Altogrithm
def pick_maximal_overlap(reads, k):
read_a = None
read_b = None
best_olen = 0
for a, b in itertools.permutations(reads, 2):
olen = overlap(a, b, min_length=k)
if olen > best_olen:
read_a, read_b = a, b
best_olen = olen
return read_a, read_b, best_olen
def greedy_scs(reads, k):
read_a , read_b, olen = pick_maximal_overlap(reads, k)
while(olen > 0):
reads.remove(read_a)
reads.remove(read_b)
reads.append(read_a + read_b[olen:])
read_a, read_b, olen = pick_maximal_overlap(reads, k)
return ''.join(reads)
De Bruijn graph
def de_bruijn_ize(st, k):
edges = []
nodes = set()
for i in range(len(st)-k+1):
edges.append((st[i:i+k-1], st[i+1:i+k]))
nodes.add(st[i:i+k-1])
nodes.add(st[i+1:i+k])
return nodes, edges