String matching – KMP algorithm [C language]

[KMP] algorithm is an improved string matching algorithm proposed by DEKnuth, JH Morris and VRPratt, so people call it Knuth-Morris-Platt operation (KMP algorithm for short). The core of the KMP algorithm is to use the information after the matching failure to minimize the matching times between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is through a next() function, and the function itself contains the local matching information of the pattern string. The time complexity of KMP algorithm is O(m+n)

A brute force algorithm

int BF(char *chang,char *duan)
{
    int c_strlen=strlen(chang);
    int d_strlen=strlen(duan);
    int c=0,d=0;
    while(c<c_strlen && d<d_strlen){
        if(chang[c]==duan[d]){
            c++;
            d++;
        }
        else{
            c = c - d + 1 ; // here we have to go back to the beginning, and then continue to match the next character so +1; 
            d = 0 ;
        }
    }
    return d<d_strlen?-1:c-d;
}

Its characteristics, the use of two pointers, one points to the main string, one points to the substring. 

If the match is successful, the two pointers move one bit back and continue to match. 

===============================================================================================================================================================================================

If the match fails, the main string pointer traces back to the next character where the match started, and the substring pointer traces back to the beginning character.

Until the [string] matches successfully, return the subscript of the main string at the beginning of the match.

The time complexity of          this algorithm is O(m*n) .

2. KMP algorithm

When using the BF algorithm, it will be found that if there is a mismatch, both pointers will be backtracked. The idea of ​​the KMP algorithm is to backtrack only one substring pointer, and the main string pointer remains unchanged. [In this way, the time complexity of] the KMP algorithm is O(n+m).

Substring pointer backtracking:

Now that we know that the KMP algorithm only backtracks the substring pointer, how to backtrack the substring pointer? Here is explained by example.

PS: One substring type of the KMP algorithm is not enough to explain all cases, so the following will use multiple types of substrings for parsing

In the above figure, it is assumed that the “stream” has been matched, and the pointer backtracking is performed for the mismatch, and pay attention to the discovery. The substring part of the string before and after the string is the same.  

Since the previous part is the same, it matches the main string. Then using KMP here, the substring pointer can be skipped directly, and the same characters that have already been matched are matched directly from the third position “lou” of the substring. 

Then the question arises, how does the substring know that it needs to go back to the “floor” to match? 

This is where the Next[] array needs to be calculated. What is stored in the Next[] array is the position of the pointer backtracking after the substring corresponding to the subscript position mismatch.

Calculate Next[] array and principle:

Calculating the Next[] array is mainly to calculate a subscript position of the substring, and the length of the longest same string in front . It is also called true prefix and true suffix.

PS: Note that when judging whether the true prefix and suffix are equal, it must start with the character at Str[0] and end with the character at Str[i-1]. Here i is the position where the match failed and is the index of Next[].

First of all, the first two characters have no prefix and suffix, so the Next[] array, the first and second bits are fixed to 0.

Take the above as an example, that is to say, the real prefix and the real suffix must start with a and end with b. Then its length is the value of Next[i]. Then if the substring does not match at the i subscript, the substring pointer falls back to Str[2], which happens to be c.

Knowing its calculation method, you can directly calculate the Next[] array of substrings.

Features: If the numbers form an increasing trend, the latter must be only 1 larger than the former. That is +1 for growth. But if it goes down, it doesn’t have to be just one. Using this feature can also help to quickly calculate the Next[] array, and judge the correctness of its calculation.

According to the above two examples, it can be found that when calculating the Next[] array, its true prefix and true suffix can be coincident. (But not completely coincident!!)

Introduced here, you can write the code of KMP’s substring search, and found that it is very similar to the BF algorithm. It’s just that when the strings are mismatched, the substring pointer goes back to the Next[] value of the corresponding subscript, which is the core of the KMP algorithm.

{
    int c_strlen=strlen(chang);
    int d_strlen=strlen(duan);
    int c=0,d=0;
    while(c<c_strlen && d<d_strlen){
        if(chang[c]==duan[d]){
            c++;
            d++;
        }
        else{
            d = Next[d]; // Mismatch, the pointer falls back to the corresponding Next[] subscript element.
        }
    }
    return d<d_strlen?-1:c-d;
}
//problem code

PS: There is a small problem with the code. If it is wrong at the beginning, that is to say, when chang[0] != duan[0], the Next[] at the 0 subscript is still 0. will enter an infinite loop. You can write an if() judgment. If it is mismatched at the beginning, let the main string go one bit backward.

int KMP(char *chang,char *duan)
{
    int *next=getNext(duan);

    int c_strlen=strlen(chang);
    int d_strlen=strlen(duan);
    int c=0,d=0;
    while(c<c_strlen && d<d_strlen){

        if(c=0 && chang[c]!=duan[d])
            c++;
        if(chang[c]==duan[d]){
            c++;
            d++;
        }
        else{
            d = Next[d]; // Mismatch, the pointer falls back to the corresponding Next[] subscript element, the storage location.
        }
    }
    return d<d_strlen?-1:c-d;
}
// fix

But the big guy who created the KMP algorithm used a better way of writing.  

They assign -1 to item 0 of the Next[] array. If the first character of the substring does not match, then d==-1, and then it is judged that if d is -1, it will enter the matching branch. In this case, in C++, the main string pointer goes one back. After d++, the substring pointer still points to the 0 position.

It will be found that this not only ensures that after the 0 subscript position is mismatched, the next match will continue from the next character of the main string, but also ensures that the substring traces back to the beginning position.

int KMP(char *chang,char *duan)
{
    int *next=getNext(duan);

    int c_strlen=strlen(chang);
    int d_strlen=strlen(duan);
    int c=0,d=0;
    while(c<c_strlen && d<d_strlen){

        if ( d== -1 || chang[c]==duan[d] ){ //match
            c++;
            d++;
        }
        else{
            d = Next[d]; // Mismatch, the pointer falls back to the corresponding Next[] subscript element, the storage location.
        }
    }
    return d<d_strlen?-1:c-d;
}
//Big guy writing

So Next[] array, the first item is fixed to -1, and the second item is fixed to 0.

Calculate the Next[] array, the code implementation:

Using oral arithmetic, you will find that it is relatively simple to calculate the Next[] array. But how to implement it as code? In fact, Next[] also has certain rules.

Looking at this substring, assuming a mismatch at the i subscript, its pointer backtracks to Str[3]. Let its backtracked subscript be k.

Because Next[] stores the index of the pointer backtracking, then there is Next[i]==k .

Then since Next[i]==K, it means that there is a maximum true prefix and a maximum true suffix at the subscript of i.

Here is “Wangjiangwang”. Because Next[i] here is derived from the length of “Wangjiangwang”. As mentioned earlier, when looking for a true prefix and a true suffix, it must start with the Str[0] character and end with the Str[i-1] character.

Then it means that Str[0]…Str[k-1]==Str[x]…Str[i-1] . (Here, the starting position of the true prefix is ​​fixed to 0, and the true suffix position is set to the unknown number X)

So, if Next[i]==k, then Str[0]…Str[k-1]==Str[x]…Str[i-1];

Because the length of the true prefix and the true suffix are equal, so k-1-0==i-1-x (using the subscript here, the length of the true prefix and the true suffix can be calculated)

Arrange it to get k = i – x and shift the term to get x = i – k 

According to x = i – k , Str[0]…Str[k-1]==Str[ik]…Str[i-1]; (insert the above formula) 

Let’s look at the substring again, because here Str[i]==Str[k]  , so the true prefix and true suffix at i+1 is “Wangjiang Wangjiang”, which is Str[0]…Str[k] ==Str[ik]…Str[i] ; (If you think about it here, you will find that the Next[] value at i+1 is actually the Next[] value of i plus 1.)

Observe the two formulas:

Next[i]==k; Str[0]…Str[k-1] == Str[i-k]…Str[i-1];

Str[i]==Str[k]; Str[0]…Str[k] == Str[ik]…Str[i]; (this formula, 1 more than the above)

Then it can be concluded that if Str[i]==Str[k], then: Next[i+1]=k+1;

So what if Str[i] != Str[k]?

It is found that Str[k]!=Str[i]; k continues to backtrack. Until backtracking to Str[i]==Str[k]; let Next[i+1]==k+1 hold .

Of course , after backtracking to the beginning, Str[i]==Str[k] is still not found; then Next[i+1]==0 ; the next Next[] array element of i is 0

Assuming that k backtracks to the beginning, Str[i]==Str[k], Next[i+1]==k+1 is established, where Next[i+1]==1. 

At this point, all situations have been considered. The code that computes the Next[] array can be written:

int *getNext(char *str)
{
    int len=strlen(str);
    int *next[]=(int *)calloc(len,sizeof(int));
    next[0]=-1;
    next[ 1 ]= 0 ;
     int i= 2 ; // main string pointer 
    int k= 0 ; // backtrack position 
    while ( i < len ){
         if ( k== -1 || str[i -1 ]== str[k] ){     //There are multiple ways to write 
            next[i]=k+ 1 ;   
            i++;
            k++;
        }
        else{
            k=next[k]; 
        }
    }
    return next;
}

PS: Because the first two numbers are fixed in the two Next[] arrays, the calculation starts directly from the third one. ( In fact, you will find that if there is a true prefix and a true suffix in front of the third element, it is also fixed to 1, and the first character and the second character are equal)

Another way to understand Next[] arrays:

Seeking the Next[] array is equivalent to substring matching itself.

Match yourself with yourself, and match at i, so it means that there is also a “hope” in front of it. Matching in this way is equivalent to i+1 finding a true prefix and a true suffix. Then the Next[] array element at i+1 is 1. Next[i+1]==1.

Continue to match down.

In the case of matching again here, it is also str[i]==str[k], which is equivalent to adding a same character to the true prefix and true suffix, true prefix and true suffix + 1. Because Next[i] is not only the position of the pointer backtracking, but also represents the length of the true prefix and true suffix at i, so  Next[i+1]==Next[i]+1  ; 

But because the value stored in the Next[] array is the subscript position that represents the substring pointer to be backtracked, that is, Next[i]==k , then it can also be obtained. Next[i+1]==k+1 (insert the above formula)

Then keep calculating the comparison, you can get a part of the Next[] array. Then as shown below

A mismatch has occurred here. Next[i+1]==5 if i here matches. But there’s a mismatch here, so Next[i+1] can’t be equal to 5 in the first place.

Because it is equivalent to doing string matching, the top is the main string, and the bottom is the substring. Then a mismatch occurs and the pointer backtracks. Where to go back and find out! ! The backtracking here is the value stored in the substring corresponding to the subscript Next[], Next[i]==4. Backtrack to the “floor” where the subscript position is 4, that is, k == Next[k];

It is found that str[i]!=str[k] is still mismatched, and the pointer continues to backtrack, k==Next[k];  

The mismatch goes into the mismatch branch of if(), but since the pointer backtracks to the initial position, since Next[0]==-1, k==-1. You will find that it is very good here, k==-1, entering the matching branch of if(), then i++, k++. The main string and substring pointers are all moved one bit backward, Next[i]=k; k==0, Next[i]==0;

Or Next[i+1]=k+1; the previous i and k have been added, so there is no need to add them.

Code:

int * getNext(char* str)
{
    int s_len = strlen(str);
    int *next = (int *) calloc (s_len,sizeof(int));
    next[ 0 ] = -1 ; 
     int i= 0 , k= -1 ;    //k is the substring pointer, i is the main string pointer. The main string pointer is assigned -1, in order to match from the next position 
    while (i<s_len -1 ){     //Because if it starts from 0, if you match yourself, you will all match yourself, and you will not be able to calculate the array elements 
        if (k == -1 ||str[i]==str[k]){
            i++;
            k++;
            Next[i]=k; //No need to add it before
        }
        else{
            k=Next[k];
        }
    }
    return next;
}

PS: The writing is almost the same as before. Because it is equivalent to string matching, and because the first element of Next[] is fixed and does not need to be calculated. So the main string pointer is one step faster than the substring pointer.

It is equivalent to the first comparison as shown above, and then compare a character to calculate the value of a Next[]. Of course the first value is -1 and the second value is 0, which is fixed.

int * getNext(char* str)
{
    int s_len = strlen(str);
    int *next = (int *) calloc (s_len,sizeof(char));
    next[ 0 ] = -1 ;
     int i= 0 ,k= -1 ; //k is the main string pointer, i is the main string pointer 
    while (i<s_len -1 ){
         if (k== -1 ||str [i]==str[k]){
            Next[i+ 1 ]=k+ 1 ;   //It can also be written like this for easy understanding.
            i++;
            k++;     
        }
        else{
            k=Next[k];
        }
    }
    return next;
}

PS: 

3. NextValue

In fact, Next[] can also be optimized. Let’s look at a substring first.

Assuming a mismatch at i, the pointer backtracks. A problem will be found, the backtracking will go back to subscript 6, and then the character with subscript 6 is also a, the same as the original, so it must not match, it will continue to backtrack, 5 4 3 2 1 0, you will find that the pointer will always be Backtrack to 0.

Then you will find that no matter which of the previous a is mismatched, the pointer will actually backtrack to 0. In fact, you can skip the step-by-step backtracking steps in the middle, so that the pointer is in place one step at a time, and backtracking to the very beginning at a time. That is, when Str[i]==Str[k] (the character pointed to by the pointer now is equal to the character backtracked to), Next[i]==Next[k], thus regaining a NextValue[] array.

It is also the value of Next[] if Str[i]!=Str[k].

Let’s look at another substring

If there is a mismatch at i, and Str[3]==Str[0]; so Nextval[3]==Nextval[0];  

If i is mismatched at subscript position 5, because Str[5]!=Str[2]; so Nextval[5]==2;

PS: When actually calculating the NextValue[] array, pay attention to the sequential calculation, because when the previous data covers the later data, the overwritten data must be used later. What does that mean?

This sequentially computes a NextValue[], 

This will use the value that was previously overwritten with 0, and it will be overwritten with 0 instead of the original 1. So it should be calculated in order. 

Using Nextvalue[] and Next[] array results are the same, Nextvalue[] is the optimized version. its essence is 

Code:

#include <iostream>
#include<stdio.h>

int* getNext(char *str)
{
    int len = strlen(str);
    int* next = (int*)calloc(len, sizeof(int));
    next[0] = -1;
    int i = 0, k = -1;

    while ( i < len -1 ) { //note -1 here

        if (k==-1||str[i]==str[k]) {
            i++;
            k++;
            next[i] = (str[i] == str[k] ? next[k] : k);
        }
        else {
            k = next[k];
        }
    }

    return next;
}


int KMP(char* chang,char* duan)
{

    int* next = getNext(duan);
    int c_len = strlen(chang);
    int d_len = strlen(duan);
    int i = 0, k = 0;

    while (i<c_len && k<d_len) {

        if (k == -1 || chang[i] == duan[k]) {
            i++;
            k++;
        }
        else {
            k = next[k];
        }
    }

    return k < d_len ? -1 : i - k;
}

int main()
{

    char str1[] = "wjl,wjn,wjlswjn,jlqg,jnqg" ;
     char str2[] ​​= "wjlswjn" ;
     printf ( "Found string at subscript position %d of position array\n" , KMP(str1,str2 ));
     return  0 ;

}

PS: Note again that if next[i]=k, this is written after i++, k++. If it is written in front, use next[i+1]=k+1.

Then i<s_len-1, note that it is -1 here. Take the code here as an example, the length of str2 is 7. If i<s_len is used, when calculating the 6 position, i++ is done once, and it becomes next[7], and the array is out of bounds. If i<s_len is used here, i=5, i++ becomes 6 once, and next[6] is to calculate the last element, which is just finished.

test:

PS: If you look closely at the getNext() function and the KMP() function, you will find that they are very similar. The side also explains the process of obtaining the Next[] array, which is equivalent to matching the substring with its own string.

Summarize:

If you understand the KMP algorithm, you actually know that when it is nothing more than a mismatch, skip the true prefix that has been successfully matched before. Then you can know that if there are more repeated characters in a substring, the efficiency of the algorithm will be higher. Otherwise, if there are no repeated characters or there are few repeated characters, it will degenerate into a BF algorithm.

Leave a Comment

Your email address will not be published. Required fields are marked *