Chapter 4: create a custom parser

After we made a graphical representation of the syntax of a language, it's quite easy to create a parser for that language out of it. Consider each graph as a flowchart for the corresponding procedure or keyword. And each flowchart corresponds with a few lines of source code. By combining the code snippets into one bigger procedure, the parser is done! For the time being we will consider each symbol as one-token words. As a result, each parser will look somthing like:

	MODULE Parser;
	
	FROM	InOut		IMPORT	Read;
	
	VAR	ch		: CHAR;
	
	PROCEDURE S;
	
	BEGIN
	   (* 	Declaration of the subroutines		*)
	END S;
	
	BEGIN
	   Read (ch);
	   S
	END Parser.
   
In which 'S' is the start-symbol (the first line in an EBNF diagram). Below, the graphs are shown once more, this time with the code fragments on it's side.

B1

Every structure with the form on the right will be replaced by either of the following code fragments:
	IF  ch IN L1  THEN  P (S1)
	ELSIF  ch IN L2  THEN  P (S2)
	   ...
	ELSIF  ch IN Ln  THEN  P (Sn)
	ELSE
	  error
	END
or
	CASE  ch  OF
	 L1 : P (S1)	|
	 L2 : P (S2)	|
	    ...
	 Ln : P (Sn)
	ELSE
	  error
	END
   


B2

Each graph with this kind of shape is translated into:
	P (S1);
	P (S2);
	 ...
	P (S3);
   


B3

Every structure with the shape of {s} on the right is translated by:
	WHILE  ch IN L  DO
	  P (S)
	END
   


B4

The graph on the right, corresponding with [s] in EBNF, is translated into
	IF  ch IN L  THEN  P (S)  END
   


B5

Every graph with an element in a rectangular box is replaced by a call to the mentioned procedure P (S).


B6

Every reference to a terminal symbol (i.e. enclosed in a circle) is replaced by a conditional READ command like in
	IF  ch = "x"  THEN  Read (ch)  ELSE  error  END
   


Remember the language and corresponding graph from Example 5? The simplified graph is on the right. The EBNF is below.

EXAMPLE 5:

	A = "x" | "(" B ")".
	B = AC.
	C = { "+" A }
   
Look how we will translate this graph into a parser module:
	MODULE Pars5;
	
	FROM	InOut		IMPORT	Read, WriteString, WriteLn;
	
	VAR	ch		: CHAR;


	PROCEDURE error;

	BEGIN
	   WriteString ("Error in parsing");
	   WriteLn;
	   HALT
	END error;


	PROCEDURE A;
	
	BEGIN
	   IF  ch = "x"  THEN
	      Read (ch)
	   ELSIF  ch = "("  THEN
	      Read (ch);
	      A;
	      WHILE  ch = "+"  DO
	         Read (ch);
		 A
	      END;
	      IF  ch = ")"  THEN  Read (ch)  ELSE  error  END
	   ELSE
	      error
	   END
	END A;
	
	BEGIN
	   REPEAT
	      Read (ch);
	      A
	   UNTIL  ch = '.';
	   WriteString ("No errors in syntax.");
	   WriteLn
	END Pars5.
   
See what happens if we compile this:
jan@beryllium:~/tmp/modula$ mocka
Mocka 0608m
>> i Pars5
>> p Pars5
.. Compiling Program Module Pars5 I/0003 II/0003
.. Linking Pars5
>> ./Pars5
(x+x).
No errors in syntax!
>> ./Pars5
Hi.
Error in parsing
>> ./Pars5
((x+(x+x)).
Error in parsing
>> ./Pars5
((x+(x+x))).
No errors in syntax!
>>
   
I call this 'hopeful'!


One more example: the SAB language of example 1. The 'compiled graph' is on the right, the EBNF is below. Make sure you understand the graph and convince yourself that it is correct. If you have questions, send a message to mocka@yahoogroups.com.

	S = AB.
	A = x | y.
	B = z | w.
   
We are going to use the graph to Modula-2 conversions of the B1-B6 figures above. That leads to the following parser:
MODULE Parse1;

IMPORT	InOut;

VAR	ch		: CHAR;


PROCEDURE error;

BEGIN
   InOut.WriteString ("Syntax error.");
   InOut.WriteLn;
   HALT
END error;


PROCEDURE A;

BEGIN
   CASE  ch  OF
    'x',
    'y' : InOut.Read (ch)
   ELSE
      error
   END
END A;

PROCEDURE B;

BEGIN
  CASE  ch  OF
   'z',
   'w' : InOut.Read (ch)
  ELSE
    error
  END
END B;


PROCEDURE S;

BEGIN
   A;
   B;
END S;

BEGIN
   REPEAT
      InOut.Read (ch);
      S
   UNTIL ch = '.';
   InOut.WriteString ("Syntax correct");
   InOut.WriteLn
END Parse1.
   
See what happens if we compile this:
jan@beryllium:~/tmp/modula$ mocka
Mocka 0608m
>> i Parse1
>> p Parse1
.. Compiling Program Module Parse1 I/0005 II/0005
.. Linking Parse1
>> ./Parse1
xz.
Syntax correct
>> ./Parse1
xy.
Syntax error.
>> ./Parse1
yw.
Syntax correct
   
Now I understand why Senor Droopy can be so happy....

Chapter 5: A scanner for EBNF

Until now, we have made scanners for one specific language or EBNF style production. Now it is the time to set out to make a universal scanner, based on some input mechanism. We start out with an EBNF scanner. EBNF can be expressed in EBNF itself:

	production   = identifier "=" expression "." .
	expression   = term {|term}.
	term	     = factor {factor}.
	factor	     = letter | literal | "(" expression ")" | "[" expression "]" | "{" expression "}".
   


A 'Letter' is any ASCII token from the set: {"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"}. There are some reasons to use the graph on the right instead of the full list:


A 'Digit' is any ASCII character from the set {"0" "1" "2" "3" "4" "5" "6" "7" "8" "9"}. The same reasons apply for Digit with respect to showing only a subset of the branches.


A 'Literal' is ANY printable ASCII character, enclosed between double quotes (").


In this example, an identifier is the name of an object, consisting of one single letter.


A 'Production' is a recepy for constructing the language in question.
In this case: "Identifier = expression ."


An expression consists of a series of one or more 'Terms'.


A term consists of a Factor, followed by zero or more other factors.


A factor is defined to contain a Letter, a Literal or an expression between sorts of brackets. Now we only need to combine the graphs and simplify the graph. Let's see how far we're going to come here. f you look at the production rules and the graphs, things tend to get messy real soon... The production boils down to an infinite series of parallel and series 'connected' diagrams of 'Factor'. So I aim to keep it a bit simple here...

And this is what it looks like... I think this graph cannot be simplified. It cab however be desimplified by full replacement of each 'factor n' rectangle by the equivalent on the right.


EBNF.mod

Below is the sourcecode for the EBNF checker depicted above. This source does not comply with that in the Wirth books for several reasons:

MODULE EBNF;

FROM	InOut		IMPORT	Read, Write, WriteLn, WriteString, WriteBf;
CONST	empty		= " ";
VAR	sym		: CHAR;


PROCEDURE GetSym;

BEGIN
  REPEAT 
    Read (sym); 
    Write (sym)
  UNTIL sym > empty
END GetSym;


PROCEDURE error;
	
BEGIN
  WriteString ("Syntax incorrect");
  WriteLn;
  HALT
END error;


PROCEDURE expression;

  PROCEDURE term;

    PROCEDURE factor;

    BEGIN
      IF ("A" <= CAP (sym)) & (CAP (sym) <="Z") THEN	(* nonterminal symbol	*)
	GetSym
	ELSIF sym = '"' THEN					(* terminal symbol	*)
	GetSym;
	GetSym;
	IF sym = '"' THEN GetSym ELSE error END
      ELSIF sym = "(" THEN
	GetSym;
	expression;
	IF sym = ")" THEN GetSym ELSE error END
      ELSIF sym = "[" THEN
	GetSym;
	expression;
	IF sym = "]" THEN GetSym ELSE error END
      ELSIF sym = "{" THEN  
	GetSym;
	expression;
	IF sym = "}" THEN GetSym ELSE error END
      ELSE  
	error
      END
    END factor;

  BEGIN					(* term *)
    factor;
    WHILE ("A" <= sym) & (sym <= "Z") OR (sym = '"') OR (sym ="(") OR (sym = "[") OR (sym = "{") DO 
      factor
    END
  END term;

BEGIN 					(* expression *)
  term;
  WHILE sym = "|" DO  
     GetSym;
     term
  END
END expression;

BEGIN 					(* main program	*)
  GetSym;
  REPEAT
    IF  ("A" <= sym) & (sym <= "Z")  THEN  GetSym  ELSE  error  END;
    IF  sym = "="  THEN  GetSym  ELSE  error  END;
    expression;
    IF  sym = "."  THEN  
      WriteBf;
      GetSym
    ELSE 
      error
    END;
  UNTIL sym="~"
END EBNF.
   
As you can see, the program closely follows the written EBNF rules:
EBNF Modula-2
expression = term {|term}.
PROCEDURE  expression;

BEGIN
  term;
  WHILE sym = "|" DO
    GetSym;
    term
  END
END expression;
     
An expression expects a 'term' as its first argument. If there are extra term's, they need to be preceeded by the pipe token '|'. So, as long as the next sym is a '|', the next symbol is a term as well so we need to check it.
term = factor {factor}.
PROCEDURE term

BEGIN
  factor;
  WHILE ("A" <= sym) & (sym <= "Z") OR (sym = '"')
	OR (sym ="(") OR (sym = "[") OR (sym = "{") DO
    factor
  END
END term;
Same here: a term must start with a 'Factor' and may be followed by other Factors. The condition in the WHILE statement checks if the next symbol is part of the set 'First (Factor)'.

Now let's see if it runs:

jan@beryllium:~/modula/m4m$ ./EBNF <SAB
S = AB.
A = x| y.
B = z|w.
~
jan@beryllium:~/modula/m4m$ ./EBNF
A = "x" | "(" B ")". B = AC. C = { "+" A }. ~
A = "x" | "(" B ")". B = AC. C = { "+" A }. ~
jan@beryllium:~/modula/m4m$
   
This is what you get when you check the EBNF syntax of the SAB example and from example 5 from previous chapters. I can only say: it works, for these simple grammars.


Page created on 4 September 2007 and

Page equipped with FroogleBuster technology