問題描述:字串匹配搜索
假設(shè)現(xiàn)在我們面臨這樣一個問題:有一個文本串S日杈,和一個模式串P遣铝,現(xiàn)在要查找P在S中的位置,怎么查找呢莉擒?
暴力匹配算法
如果用暴力匹配的思路酿炸,并假設(shè)現(xiàn)在文本串S匹配到 i 位置,模式串P匹配到 j 位置涨冀,則有:
1填硕、如果當(dāng)前字符匹配成功(即S[i] == P[j]),則i++鹿鳖,j++扁眯,繼續(xù)匹配下一個字符;
2栓辜、如果失配(即S[i]! = P[j])恋拍,令i = i - (j - 1)垛孔,j = 0藕甩。相當(dāng)于每次匹配失敗時,i 回溯,j 被置為0狭莱。
理清楚了暴力匹配算法的流程及內(nèi)在的邏輯僵娃,咱們可以寫出暴力匹配的代碼,如下:
int ViolentMatch(char* s, char* p)
{
int sLen = strlen(s);
int pLen = strlen(p);
int i = 0;
int j = 0;
while (i < sLen && j < pLen)
{
if (s[i] == p[j])
{
//①如果當(dāng)前字符匹配成功(即S[i] == P[j])腋妙,則i++默怨,j++
i++;
j++;
}
else
{
//②如果失配(即S[i]! = P[j]),令i = i - (j - 1)骤素,j = 0
i = i - j + 1;
j = 0;
}
}
//匹配成功匙睹,返回模式串p在文本串s中的位置,否則返回-1
if (j == pLen)
return i - j;
else
return -1;
}
KMP 算法
Knuth-Morris-Pratt 字符串查找算法济竹,簡稱為 “KMP算法”痕檬,常用于在一個文本串S內(nèi)查找一個模式串P 的出現(xiàn)位置,這個算法由Donald Knuth送浊、Vaughan Pratt梦谜、James H. Morris三人于1977年聯(lián)合發(fā)表,故取這3人的姓氏命名此算法袭景。
The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n).
However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since mless or equaln, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n).
KMP 算法核心原理示意圖
求解前綴表的核心思想
把前綴 P[0:j]
當(dāng)成是 P 的模式串(P[0:i]
)唁桩,P 本身當(dāng)成是查找的文本。
next 前綴表數(shù)組耸棒,上圖中是 lps 數(shù)組荒澡。
KMP源代碼
極簡版本的 KMP 算法源代碼:
next數(shù)組首位用-1來填充,這樣在處理長度的時候榆纽,思維上不會很繞仰猖。
/**
* getNext (pattern) 函數(shù): 計算字符串 pattern 的最大公共前后綴的長度 (max common prefix suffix length)
*/
fun getNext(P: String): IntArray {
val M = P.length
val next = IntArray(M + 1, { -1 })
// i: current index of P
var i = 0
// j: current index of the longest prefix of P
var j = -1
next[0] = -1 // next[i] = j
// compute next[i]
while (i < M) {
// 如果當(dāng)前字符匹配失敗(即P[i] != P[j]) && j != 0 奈籽,則令 i 不變饥侵,j = next[j]。
// 此舉意味著失配時衣屏,"模式串"即前綴P[0:j], 不再從 0 位置開始比對,直接從 j = next [j] 位置開始比對躏升。
while (j >= 0 && P[i] != P[j]) {
j = next[j]
}
i++
j++
next[i] = j
}
return next
}
/**
* kmp substring search algorithm
* @param S : the source text string
* @param P : the search pattern string
*/
fun kmp(S: String, P: String): Int {
val N = S.length
val M = P.length
if (P.isEmpty()) {
return 0
}
// j: the current index of P
var j = 0
// i: the current index of T
var i = 0
// next array
val next = getNext(P)
while (i < N) {
while (j >= 0 && S[i] != P[j]) {
j = next[j]
}
i++
j++
// when j == M, then pattern is founded in text, return the index (i - j)
if (j == M) {
return i - j
}
}
return -1
}
fun main() {
var text = "addaabbcaabffffggghhddabcdaaabbbaab"
var pattern = "aabbcaab"
print("${getNext(pattern).joinToString { it.toString() }} \n")
var index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
text = "hello"
pattern = "ll"
print("${getNext(pattern).joinToString { it.toString() }} \n")
index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
text = "abbbbbbcccddddaabaacabdcddaabbbbaad"
pattern = "aabaacab"
print("${getNext(pattern).joinToString { it.toString() }} \n")
index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
}
// 輸出:
//-1, 0, 1, 0, 0, 0, 1, 2, 3
//aabbcaab is the substring of addaabbcaabffffggghhddabcdaaabbbaab, the index is: 3
//-1, 0, 1
//ll is the substring of hello, the index is: 2
//-1, 0, 1, 0, 1, 2, 0, 1, 0
//aabaacab is the substring of abbbbbbcccddddaabaacabdcddaabbbbaad, the index is: 14
另外一個版本代碼:
/**
* getNext (pattern) 函數(shù): 計算字符串 pattern 的最大公共前后綴的長度 (max common prefix suffix length)
*/
fun getNext(P: String): IntArray {
val M = P.length
val next = IntArray(M, { -1 })
// i: current index of P
var i = 1
// j: current index of the longest prefix of P
var j = 0
next[0] = 0
// compute next[i]
while (i < M) {
if (P[i] == P[j]) { // ①
val len = j + 1
next[i] = len
i++
j++
} else {
// 如果當(dāng)前字符匹配失敗(即P[i] != P[j]) && j != 0 狼忱,則令 i 不變膨疏,j = next[j-1]。
// 此舉意味著失配時钻弄,"模式串"即前綴P[0:j], 不再從 0 位置開始比對,直接從 next [j-1] 位置開始比對佃却。
if (j != 0) {
j = next[j - 1] // j shift left, jmp ①
} else {
next[i] = 0 // now j is 0, next i
i++
}
}
}
return next
}
/**
* kmp substring search algorithm
* @param S : the source text string
* @param P : the search pattern string
*/
fun kmp(S: String, P: String): Int {
val N = S.length
val M = P.length
if (P.isEmpty()) {
return 0
}
// j: the current index of P
var j = 0
// i: the current index of T
var i = 0
// next array
val next = getNext(P)
while (i < N - M + 1) {
if (S[i] == P[j]) {
i++
j++
} else {
if (j > 0) {
// 當(dāng)前字符匹配失敗(即S[i] != P[j])窘俺,則令 i 不變饲帅,j = next[j-1]。
// 此舉意味著失配時,模式串P 不再從 0 位置開始比對,直接從 next [j-1] 位置開始比對灶泵。
j = next[j - 1]
} else {
i++
}
}
// when j == M, then pattern is founded in text
if (j == M) {
return i - M
}
}
return -1
}
fun main() {
var text = "addaabbcaabffffggghhddabcdaaabbbaab"
var pattern = "aabbcaab"
print("${getNext(pattern).joinToString { it.toString() }} \n")
var index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
text = "hello"
pattern = "ll"
print("${getNext(pattern).joinToString { it.toString() }} \n")
index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
text = "abbbbbbcccddddaabaacabdcddaabbbbaad"
pattern = "aabaacab"
print("${getNext(pattern).joinToString { it.toString() }} \n")
index = kmp(text, pattern)
println("$pattern is the substring of $text, the index is: $index")
}
// 輸出:
//0, 1, 0, 0, 0, 1, 2, 3
//aabbcaab is the substring of addaabbcaabffffggghhddabcdaaabbbaab, the index is: 3
//0, 1
//ll is the substring of hello, the index is: 2
//0, 1, 0, 1, 2, 0, 1, 0
//aabaacab is the substring of abbbbbbcccddddaabaacabdcddaabbbbaad, the index is: 14
參考資料
https://www.inf.hs-flensburg.de/lang/algorithmen/pattern/kmpen.htm
https://blog.csdn.net/v_july_v/article/details/7041827