Tuesday, May 31, 2011

understanding regular expressions to finite state machines

i recently came across a question on designing a state machine for detecting a binary string sequence. the solution to such a problem has two approaches. the very commonly followed approach is analyze the "given" sequence over and over again and try to scribble state transitions and test the patterns on it to check if breaks. i believe this is the way hardware engineers work this out, primarily because they spend some time which is considerably less than what they spend for other problems they analyze :) i say hi to "theory of relativity" here..

the second approach is to have rules to analyze such problems because they only consist of binary patterns (0s and 1s) and a rule based solution is like some kind of a program, so you don't necessarily have to do much testing or alter, adjust your state machine as you try to solve the problem. the theory that deals with such kind of problems of pattern matching and sequence detection is the famous regular expression or the re. regular expressions can be analyzed as finite automata, or what can be called as our commonly known finite state machine.

so the best way to deal with such problems is to represent the sequence as a regular expression and just try converting it to finite state machine based on rules.

So some notations wrt re:

a* : an a* would mean 0 or more matches of the letter "a" followed by successive letters. for instance a*b matches "ab", "aab", "aaab" or just "b".

a+ : an a+ would mean 1 or more matches of the letter "a" followed by successive letters. for instance a+b matches "ab", "aab", "aaaab" but will not match just b.

a? : an a? would mean 0 or 1 matches of the letter "a" followed by successive letters. for instance a?b matches "ab" or just "b" but will not match "aaab".

the problem statement mentioned the sequence detection for the string 101011, which also meant that the state machine should be able to sustain 101010111 or a 111101011 as a match since both these string involve the string 101011.

Looking more closely at the sequence, a few points to notice.

a. match as many 1s preceeding this sequence so that we end up matching  11101011, or 11111111010111 or 11111111111111101011. to put this in the basic atom terminology, we match (1+)01011. 

b. match as many 10s preceeding this sequence so that we end up matching 10101011 or 101010101011 ... so we can end up consuming any number of 10s once we get past the first occurrence of 10. to put this in the basic atom terminology, we match (1+)0(10)+11

so that is pretty much translating problem into a regex string. The following slideslow will try to illustrate a method to translate a regular expression into a statemachine. There might have been mistakes in my understanding and representation. Comment or e-mail me if you think there is something wrongly interpreted.


Thanks
Shyam

3 comments:

  1. This crack works well if the match pattern is of the type 'one or more'. I don't understand how this tackles the problem of matching a definite sequences.

    Apart from that, this is the new approach i came across and it will be a great deal to automation.

    ReplyDelete
  2. The intent here is to set "rules" to solve pattern matching types of problems. As I mentioned earlier in the blog, the conventional way of rtl engineers writing sequence matching state machines is to apply logic by mind to every such problem and iterate over and over the result with different test vectors and freeze a solution.
    An alternative way of dealing with such problems is to just represent the given matching pattern as a resultant regular expression which is just writing the problem in terms of, if i can call it a mathematical representation.
    Once written in such a manner, it becomes easy to apply rules to get problem solved. It does not matter if it is a "one or more" or "zero or more"..

    ReplyDelete
  3. nice approach...EDA companies can buckle up thr shoes n develop pattern detector plugins (re -> rtl) in thr tools n increase licensing cost :)... Good one Shyam bhai!

    Thanks,Navneet

    ReplyDelete