Finite State Machine - abstract
Often data must be analyzed chunk by chunk, checked if all of those chunks are valid (if the chunk is valid itself and if it's valid in context of previous chunks) and when some actions must be taken according to each type of this chunks. This can be easily modeled using finite-state machines (FSM).
State Machine has its state and transitions. There is at least one entry state, there may be termination state.
One task can be done by using different FSMs, but of course we should create them as simple as possible. What is nice in those tools, is that they may be easily extended. You may add new states, transitions, split existing state into new ones or nest new FSM into existing state.
Title of this post may confusing to people looking for concepts to build their own compilers or parsers. Shortly, reading them requires parsing code on three levels: lexical, syntactical and semantical, but presented FSMs may be directly applied only in the first case.
In next posts we will explain in details this assembly language and show, how to made its virtual machine, where FSM will be used as a part of it.
Finite-State Machine used to parsing code
Below is an example of assembly language that we are going to parse. Each line contains zero or more words like numbers, labels, names of mnemonics or comments that are separated by whitespaces.
What each of those parts means and how to execute them isn't the case here, we will focus on splitting them, line byline and remove pointless whitespaces and comments.
main: int 1 call square int 0 ret square: store 0 load 0 load 0 mul load 0 mul ret
Below are given severals notes that characterise our language:
- comment starts from ; sign and ends at the end of the line,
- mnemonic may be indented by using either \t or any amount of spaces,
- allowed characters in mnemonics/labels are [0-9a-z:],
- mnemonic can't have more that one parameter.
With this knowledge we may design following FSM:
Finally, here's the full code of its Finite-State Machine and example of output from parsing given previous assembly source.
#!/usr/bin/env python import sys, re class FiniteStateMachine: def get_instruction_parts(self, instruction): """states: 0 - init, 1 - token, 2 - separator, 3 - end""" state, acumulator, bufer = 0, , '' for character in instruction + '\0': # init if state == 0: # init -> token (append) if re.compile(r"[0-9a-z]").match(character): bufer += character state = 1 # init -> separator (skip) elif character in [' ']: state = 2 # init -> end (append) elif character in [';', '\0']: state = 3 else: print "error, to transition from init state (%c)" % character # token elif state == 1: # token -> token (append) if re.compile(r"[0-9a-z.:]").match(character): bufer += character state = 1 # token -> separator (skip) elif character == ' ': acumulator.append(bufer) bufer = '' state = 2 # token -> end (append) elif character in [';', '\0']: acumulator.append(bufer) state = 3 else: print "error, to transition from token state (%c)" % character # separator elif state == 2: # separator -> token (append) if re.compile(r"[0-9a-z]").match(character): bufer += character state = 1 # separator -> separator (skip) elif character == ' ': state = 2 # separator -> end (append) elif character in [';', '\0']: state = 3 else: print "error, to transition from eparator state (%c)" % character # end elif state == 3: pass return acumulator if __name__=="__main__": fsm = FiniteStateMachine() for line in open(sys.argv, 'r').read().splitlines(): print fsm.get_instruction_parts(line)
bash-3.2$ python instructiondecoder.py cube.asm ['main:'] ['int', '1'] ['call', 'square'] ['int', '0'] ['ret']  ['square:'] ['store', '0'] ['load', '0'] ['load', '0'] ['mul'] ['load', '0'] ['mul'] ['ret']
One can ask, why create FSM and not use functions like split(), join() and replace() instead? The major benefit is ability to extend one created FSM. It can be also easily presented by using its diagrams. As example, mentioned code may be easily extended to deal with C-like strings (marked by " sign, where " sign may be also masked in the body of string).