1. Problem
Find the start position of first occurrence of String W in String S? (Leetcode Address)
2. Simple Solution
Each round (i
from 0
to len(S)-len(W)
), start from the ith position and match if the following substring match W
or not, if match return i
, go to next round( start from i+1
position) if not. Time complexity is O(n*m)
.
Example:S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"
Solution:
round 1:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
round 2:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
......
round 18:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
match successful, return position of E (17).
Show me the code (haystack is S, needle is W):
public int strStr(String haystack, String needle) {
if(needle.equals("")) return 0;
if(needle.length() > haystack.length()) return -1;
for(int i = 0; i < haystack.length(); i ++){
if(haystack.length() - i < needle.length()) return -1;
int j = 0;
while(j < needle.length()){
if(haystack.charAt(i+j) != needle.charAt(j)) break;
j ++;
}
if(j == needle.length()) return i;
}
return -1;
}
3. KMP Alg.
KMP is the abbr. of Knuth-Morris-Pratt, one of the authors of this algorithm. The other two are Donald Knuth and Vaughan Pratt.
They solve the above problem in a more efficient way. Let's explain it using an example.
The last example is not good enough for explain this algorithm, let's change another one: S = "BBCABCDAB ABCDABCDABDE"
, W = "ABCDABD"
Solution:
pre-process:(partial match table)
traversal W once (O(m)) we can get table shown below, the 1 under A represents how many char match from the beginning.
A | B | C | D | A | B | D |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 2 | 0 |
round 1:
BBCABCDAB ABCDABCDABDE
ABCDABD
round 4:
BBCABCDAB ABCDABCDABDE
ABCDABD
' '
and D
not match, and ' '
not part of ABCDABD
, so move to the position next to ' '
.
round 5:
BBCABCDAB ABCDABCDABDE
ABCDABD
as can be seen that till D, already 6 chars match, and partial match number is 2, so number of positions to move is num(match) - num(partial match) = (6 -2) = 4
. And comparing should start from num(partial match) + 1 = 2 + 1 = 3
after finish moving. Then we get round 11.
round 6:
BBCABCDAB ABCDABCDABDE
ABCDABD
start comparing from num(partial match) + 1 = 2 + 1 = 3
position of ABCDABD
, match successful, return position of first element (10).
As we can see, if use the simple solution, it would take 15 rounds to find the position, but with KMP it only takes 6 round to find it.
Time Complexity:
table algorithm is O(k), where k is the length of W. the traversal algorithm is O(n), where n is the length of S. so the time complexity of the whole algorithm is O(k+n).
Show me the code!
public static int kmp(String S, String W) {
if(W.length() == 0) return 0;
//calculate partial table
int count = 0;
Set<Character> map = new HashSet<>();
int[] partialTable = new int[W.length()];
partialTable[0] = 0;
map.add(W.charAt(0));
for(int k = 1; k < W.length(); k ++) {
map.add(W.charAt(k));
if(W.charAt(k) == W.charAt(count)) {
partialTable[k] = (++count);
}else {
if(count != 0 && W.charAt(k) == W.charAt(k - 1) && partialTable[k - 1] == partialTable[count -1] + 1) {//deal with special case (last 'a' in 'aabaaa')
partialTable[k] = count;
}else {
count = 0;
if(W.charAt(k) == W.charAt(count)) partialTable[k] = (++count);
else partialTable[k] = count;
}
}
}
// start to find if W exist in S
int i = 0, j = 0;
while (i <= S.length() - W.length()) {
for(; j < W.length(); j ++) {
if(S.charAt(i + j) == W.charAt(j)) {
if(j == W.length() - 1) return i;
}else {
if(!map.contains(S.charAt(i + j))) {
i += (j+1);
j = 0;
break;
}
if(j == 0) {
i ++;
j = 0;
break;
}
i += (j - partialTable[j - 1]); //jump to the proper position
j = partialTable[j - 1]; //jump the one already compared
break;
}
}
}
return -1;
}
4. BM Alg.
BM algorithm was created by Boyer-Moore and J Strother Moore. Which is more efficient that KMP in most cases, especially in real world cases. That's why BM is widely used in content matching in files in real world.
Back to the example:S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"
Solution:
round 1:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
start matching from the end of EXAMPLE
, E
not match, and S
not in EXAMPLE
, so direct jump to the position after S
:
round 2:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
E
not match P
, P
is treated as a "bad character" and P
's last position in EXAMPLE
is 4 (count from 0), so should move pst(mapped by 'P' in 'EXAMPLE') - pst(last 'P' in 'EXAMPLE') = 6 - 4 = 2
steps forward.
round 3:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
As we can see, MPLE
match (treat as "good character"), A
and I
not match ("bad character").
good character rule: pst(from 0, good character's position) - pst(last position of good character) = 3 - (-1) = 4
bad character rule: pst(bad character's positon) - pst(last position of bad character) = 2-(-1) = 3
4 > 3, so move 4 steps forward.
round 4:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
E
not match A
, A
is bad character, its current mapping position is 6, A
's last position in EXAMPLE
is 2, so should move 6 - 2 = 4
steps forward.
round 5:
THIS IS A SIMPLE EXAMPLE
EXAMPLE
match, return start position in S: 17.
Show me the code!
public static int bm(String S, String W) {
if(W.length() == 0) return 0;
Map<Character, List<Integer>> map = new HashMap<>();// to record each char's position
for(int i = 0; i < W.length(); i ++) {
if(map.get(W.charAt(i)) == null)
map.put(W.charAt(i), new ArrayList<>());
map.get(W.charAt(i)).add(i);
}
int stepNum = 0;
for(int k = 0; k <= S.length() - W.length(); k += stepNum) {
for(int m = W.length() - 1; m >= 0; m --) {
if(S.charAt(k + m) == W.charAt(m)) {//good char
if(m == 0) return k;//match found
}else {//bad char
// steps calculate based on bad char
int badStepNum = 0;
List<Integer> list = map.get(S.charAt(k + m));
if(list == null) {//not exist in W
badStepNum = m - (-1);
}else {
int badCharLastPst = -1;
for(int i = list.size() - 1; i >= 0; i --)
if(m > list.get(i) && m - list.get(i) < m - badCharLastPst)
badCharLastPst = list.get(i);
badStepNum = m - badCharLastPst;
}
// steps calculate based on good char
int goodStepNum = 0;
if(m < W.length() - 1) {
list = map.get(S.charAt(k + m + 1));
if(list == null) {//not exist in W
goodStepNum = (m + 1) - (-1);
}else {
int goodCharLastPst = -1;
for(int i = list.size() - 1; i >= 0; i --)
if(m + 1 > list.get(i) && m + 1 - list.get(i) < m + 1 - goodCharLastPst)
goodCharLastPst = list.get(i);
goodStepNum = m + 1 - goodCharLastPst;
}
}
stepNum = Math.max(badStepNum, goodStepNum);
break;
}
}
}
return -1;
}
Reference
https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm
https://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm