Analysis of Longest common substring matching

This topic summarizes the contents of our assignment. In this assignment, we presented and discussed the longest common substring problem is to find the longest string (or strings) that is a substring (or are substrings) of two or more strings. The problem of longest common sub sequence arises whenever we search for similarities across multiple texts. A important application is in finding a consensus among DNA sequences.

A string in C is an array of characters. C strings can be numbers, letters or symbols, words or random text, but all strings in this programming language terminate with a null character. Programming in C, you can create strings or manipulate them using various C functions. A substring is a portion of a larger string. There are several mathematical tools for determining if two strings share a common substring.

To qualify as a sub string, the characters of a shorter string must appear in a longer string in order but not necessarily together. Now let’s take an example, "tiger" is a substring of "tiny gherkin" as "t," "i," "g," "e" and "r" all appear in that order in the larger string. If an entire string shows up inside another, larger string, the smaller string is a sub sequence of the big string. With a pair of short strings, picking out a common sub sequence is simple, but as strings grow longer, it's harder to do that with the naked eye.

Let us start with a simple approach, given are two strings: Given two (or three strings), we will find the longest sub string that appears in both two string. the longest common substring problem is to find the longest string (or strings) that is a sub string (or are substrings) of two or more strings. One should not be confused with the longest common subsequence problem. Example the longest common substring of the strings "ABABC", "BABCA" and "ABCBA" is string "ABC" of length 3. Other common substrings are "AB", "BC" and "BA".

Problem definition:
Given two strings say “S” and “T” where S of length m and T of length n, find the longest strings which are substrings of both S and T.
A generalisation is the k-common substring problem. Given the set of strings S = {S_1, ..., S_K}, where |S_i|=n_i and Σn_i = N.
Find for each 2 ≤ k ≤ K, the longest strings which occur as sub strings of at least k strings.

Algorithms:
One can find the starting position and lengths of the longest common substrings of S and T in Theta (n+m)
Pseudo code
The following pseudo code will find the set of longest common substrings between two strings with dynamic programming:

 function LCSubstr(S[1..m], 
T[1..n])  
   L := array(1..m, 1..n)  
   z := 0  
   ret := {}  
  or i := 1..m  
     for j := 1..n  
       if S[i] = T[j]  
         if i = 1 or j = 1  
           L[i,j] := 1  
         else  
           L[i,j] := L[i-1,j-1] + 1  
         if L[i,j] > z  
           z := L[i,j]  
           ret := {}  
         if L[i,j] = z  
           ret := ret ∪ {S[i-z+1..i]}  
       else L[i,j]=0;  
   return ret

This algorithm runs in O(n m) time. The variable z is used to hold the length of the longest common substring found so far. The set ret is used to store the set of strings which are of length z. The set ret can be stored efficiently by just storing the index i, and that is the last character of the longest common substring (of size z) instead of S[i-z+1..z]. Thus all the longest common substrings would be, for each i in set ret, S[(ret[i]-z).,.(ret[i])].

The following method can be used to reduce the memory usage of an implementation:
Keep only the last and current row of the DP table to save memory (O(\min(m, n)) instead of O(n m))
Store only non-zero values in the rows. Hence this can be done using hash tables instead of arrays. This method is useful for large alphabets.

Algorithm for longest common substring matching

 Input : s 1 [ ]< - 
substring 1   
       s 2 [ ]< - substring 2  
 Output : longest common substring  
  l1 = string length (s1);    
  l2 = string length (s2);  
 for i=0 to l1 do  
     for j=0 to l2 do  
        for k=0 to (i+k)<l1 && (j+k)<l2 do  
 //sub function call to check are they equal   
       If (AreEqual(p1, p2, i, j, k))  
     // if same strings found then count as max  
       if (k>max)  
       lcss = i;  
       max = k;  
     end if  
       end if  
         end for  
 // if max is 0 then there is no common strings   
 if (max==-1)  
   display ("No substring found in both string")  
     else  
   display ("Longest common substring length is:" max+1)  
 display ("Longest common substring is:")  
 for i=lcss, j=0 to j<=max do  
 display( p1[i+j])  
 sub function call:  
 int AreEqual (char *p, char *q, int i, int j, int k)  
  for c=0 to c<=k do  
   if (p[i+c] != q[j+c])  
    return 0;  
   return 1;

Analysis of algorithm :

In this we take the run time of the program logic and values are varied based on the number of character in string1 and string2, then the total number of execution of all logical expressions. we carry this process say upto 20 character.
Suppose, we are taking the
string1 =”a”
string2 =”a”
in the above both substring are equal hence LCS found.

LCS	Window Length (w)	String1 (n)	String2 (m)	Count
	1	2	1	40
	2	4	3	226
	3	6	5	660
	4	8	7	1438
	5	10	9	2656
	6	12	11	4485
	7	14	13	6916
	8	16	15	8994
	9	18	17	13060
	10	20	19	18706

Again we increase characters in string1 and keeping string2 same as two, three etc ,.. character length as N.

We observe that the time complexity increases as the string1 increases.

The count value increases linearly as string length increases. We say that a function has linear complexity as the amount of work is proportional to the size of the input. Complexity = n*m*w.
If both strings are of same length = N,
then O (N^3)

Time complexity	LCS
S/E	Frequency	Total no. of steps

1	1	1
1	1	1
1	L1+1	L1+1
1	L2^2	L2^2
1	(L)^3	(L)^3
1	c	C
1	1	1
1	1	1
0	0	0
0	0	0
0	0	0
1	1	1
0	0	0
		L^3(7+L1+L2+C)

 Here is a implementation of longest common substring matching in C:  
 #include<stdio.h>  
 #include<string.h>  
 #include<conio.h>  
 //sub function to check if two strings are equal or not  
 int AreEqual(char *p, char *q, int i, int j, int k)  
 {  
  int c;  
  for(c=0; c<=k; ++c)  
  if(p[i+c] != q[j+c])  
  return 0;  
  return 1;  
 }  
 void main()  
 {  
 int l1,l2,temp;  
 int i, j, k, lcss, max=-1;  
 char s1[1024], s2[1024];  
 char *p1, *p2;  
  p1= s1;  
  p2= s2;  
  printf ("\n enter the 1st string ");  
  scanf("%[^\n]", s1);  
  getchar();  
  printf (" \nenter the 2nd string ");  
  scanf("%[^\n]", s2);  
  l1 = strlen(s1);  
  l2 = strlen(s2);  
 for(i=0; i<l1; ++i)  
 for(j=0; j<l2; ++j)  
 for(k=0; (i+k)<l1 && (j+k)<l2; ++k)  
 {  
 //sub function call to check are they equal ??  
         if(AreEqual(p1, p2, i, j, k))  
         {  
 // if same strings found then count as max  
         if (k>max)  
         {  
         lcss = i;  
         max = k;  
         }  
         }  
 }  
 // if max is 0 then there is no common strings :(  
 if (max==-1)  
 {  
 printf ("No substring found in both string ");  
  }  
  else  
 {  
 printf("Longest common substring length: %d", max+1);  
 printf("\nLongest common substring is:");  
 printf("lcs: ");  
 for(i=lcss, j=0; j<=max; ++j)  
 printf("%c", p1[i+j]);  
 printf("\n");  
 }  
  getch();  
 }

M.Sc Computer Science - A guidance to all computer science students

Saturday, 18 October 2014

Analysis of Longest common substring matching

1 comment:

Popular Posts

Follow us on FaceBook

Categories

Support

Featured post

List of Universities in Karnataka offering M.Sc Computer Science

Popular Posts

Labels

Site Links

Total Views