How to use finite state machine in parsing of assembly language? ~ Robert Gawron

Finite State Machine - abstract

Often data must be analyzed chunk by chunk, checked if all of those chunks are valid (if the chunk is valid itself and if it's valid in context of previous chunks) and when some actions must be taken according to each type of this chunks. This can be easily modeled using finite-state machines (FSM).

State Machine has its state and transitions. There is at least one entry state, there may be termination state.

One task can be done by using different FSMs, but of course we should create them as simple as possible. What is nice in those tools, is that they may be easily extended. You may add new states, transitions, split existing state into new ones or nest new FSM into existing state.

Title of this post may confusing to people looking for concepts to build their own compilers or parsers. Shortly, reading them requires parsing code on three levels: lexical, syntactical and semantical, but presented FSMs may be directly applied only in the first case.

In next posts we will explain in details this assembly language and show, how to made its virtual machine, where FSM will be used as a part of it.

Finite-State Machine used to parsing code

Below is an example of assembly language that we are going to parse. Each line contains zero or more words like numbers, labels, names of mnemonics or comments that are separated by whitespaces.

What each of those parts means and how to execute them isn't the case here, we will focus on splitting them, line byline and remove pointless whitespaces and comments.

main:
    int 1
    call square
    int 0
ret

square:
    store 0
    load 0
    load 0
    mul
    load 0
    mul
ret

Below are given severals notes that characterise our language:

comment starts from ; sign and ends at the end of the line,
mnemonic may be indented by using either \t or any amount of spaces,
allowed characters in mnemonics/labels are [0-9a-z:],
mnemonic can't have more that one parameter.

With this knowledge we may design following FSM:

finite state machine used to parse a simple assembly language

Finally, here's the full code of its Finite-State Machine and example of output from parsing given previous assembly source.

#!/usr/bin/env python
import sys, re

class FiniteStateMachine:
    def get_instruction_parts(self, instruction):
        """states: 0 - init, 1 - token, 2 - separator, 3 - end"""
        state, acumulator, bufer = 0, [], ''
        for character in instruction + '\0':
            # init
            if state == 0:
                 # init -> token (append)
                if re.compile(r"[0-9a-z]").match(character):
                    bufer += character
                    state = 1 
                # init -> separator (skip)
                elif character in [' ']:
                    state = 2
                # init -> end (append)
                elif character in [';', '\0']:
                    state = 3
                else:
                    print "error, to transition from init state (%c)" % character

            # token
            elif state == 1:
                # token -> token (append)
                if re.compile(r"[0-9a-z.:]").match(character):
                    bufer += character
                    state = 1 
                # token -> separator (skip)
                elif character == ' ':
                    acumulator.append(bufer)
                    bufer = ''
                    state = 2
                # token -> end (append)
                elif character in [';', '\0']:
                    acumulator.append(bufer)
                    state = 3
                else:
                    print "error, to transition from token state (%c)" % character

            # separator
            elif state == 2:
                 # separator -> token (append)
                if re.compile(r"[0-9a-z]").match(character):
                    bufer += character
                    state = 1 
                # separator -> separator (skip)
                elif character  == ' ':
                    state = 2
                # separator -> end (append)
                elif character in [';', '\0']:
                    state = 3
                else:
                    print "error, to transition from eparator state (%c)" % character

            # end 
            elif state == 3:
                pass
 
        return acumulator


if __name__=="__main__":
    fsm = FiniteStateMachine()
    for line in open(sys.argv[1], 'r').read().splitlines():
        print fsm.get_instruction_parts(line)

bash-3.2$ python instructiondecoder.py cube.asm 
['main:']
['int', '1']
['call', 'square']
['int', '0']
['ret']
[]
['square:']
['store', '0']
['load', '0']
['load', '0']
['mul']
['load', '0']
['mul']
['ret']

Conclusions

One can ask, why create FSM and not use functions like split(), join() and replace() instead? The major benefit is ability to extend one created FSM. It can be also easily presented by using its diagrams. As example, mentioned code may be easily extended to deal with C-like strings (marked by " sign, where " sign may be also masked in the body of string).

Robert Gawron