Chapter 4: create a custom parser
After we made a graphical representation of the syntax of a language, it's quite easy to create a parser for that language out of it. Consider each graph as a flowchart for the corresponding procedure or keyword. And each flowchart corresponds with a few lines of source code. By combining the code snippets into one bigger procedure, the parser is done! For the time being we will consider each symbol as one-token words. As a result, each parser will look somthing like:
MODULE Parser; FROM InOut IMPORT Read; VAR ch : CHAR; PROCEDURE S; BEGIN (* Declaration of the subroutines *) END S; BEGIN Read (ch); S END Parser.In which 'S' is the start-symbol (the first line in an EBNF diagram). Below, the graphs are shown once more, this time with the code fragments on it's side.
IF ch IN L1 THEN P (S1) ELSIF ch IN L2 THEN P (S2) ... ELSIF ch IN Ln THEN P (Sn) ELSE error END or CASE ch OF L1 : P (S1) | L2 : P (S2) | ... Ln : P (Sn) ELSE error END
P (S1); P (S2); ... P (S3);
WHILE ch IN L DO P (S) END
IF ch IN L THEN P (S) END
IF ch = "x" THEN Read (ch) ELSE error END
Remember the language and corresponding graph from Example 5? The simplified graph is on the right. The EBNF
is below.
EXAMPLE 5:
A = "x" | "(" B ")".
B = AC.
C = { "+" A }
Look how we will translate this graph into a parser module:
MODULE Pars5;
FROM InOut IMPORT Read, WriteString, WriteLn;
VAR ch : CHAR;
PROCEDURE error;
BEGIN
WriteString ("Error in parsing");
WriteLn;
HALT
END error;
PROCEDURE A;
BEGIN
IF ch = "x" THEN
Read (ch)
ELSIF ch = "(" THEN
Read (ch);
A;
WHILE ch = "+" DO
Read (ch);
A
END;
IF ch = ")" THEN Read (ch) ELSE error END
ELSE
error
END
END A;
BEGIN
REPEAT
Read (ch);
A
UNTIL ch = '.';
WriteString ("No errors in syntax.");
WriteLn
END Pars5.
See what happens if we compile this:
jan@beryllium:~/tmp/modula$ mocka Mocka 0608m >> i Pars5 >> p Pars5 .. Compiling Program Module Pars5 I/0003 II/0003 .. Linking Pars5 >> ./Pars5 (x+x). No errors in syntax! >> ./Pars5 Hi. Error in parsing >> ./Pars5 ((x+(x+x)). Error in parsing >> ./Pars5 ((x+(x+x))). No errors in syntax! >>I call this 'hopeful'!
One more example: the SAB language of example 1. The 'compiled graph' is on the right, the EBNF is below. Make
sure you understand the graph and convince yourself that it is correct. If you have questions, send a message
to mocka@yahoogroups.com.
S = AB. A = x | y. B = z | w.We are going to use the graph to Modula-2 conversions of the B1-B6 figures above. That leads to the following parser:
MODULE Parse1;
IMPORT InOut;
VAR ch : CHAR;
PROCEDURE error;
BEGIN
InOut.WriteString ("Syntax error.");
InOut.WriteLn;
HALT
END error;
PROCEDURE A;
BEGIN
CASE ch OF
'x',
'y' : InOut.Read (ch)
ELSE
error
END
END A;
PROCEDURE B;
BEGIN
CASE ch OF
'z',
'w' : InOut.Read (ch)
ELSE
error
END
END B;
PROCEDURE S;
BEGIN
A;
B;
END S;
BEGIN
REPEAT
InOut.Read (ch);
S
UNTIL ch = '.';
InOut.WriteString ("Syntax correct");
InOut.WriteLn
END Parse1.
See what happens if we compile this:
jan@beryllium:~/tmp/modula$ mocka Mocka 0608m >> i Parse1 >> p Parse1 .. Compiling Program Module Parse1 I/0005 II/0005 .. Linking Parse1 >> ./Parse1 xz. Syntax correct >> ./Parse1 xy. Syntax error. >> ./Parse1 yw. Syntax correctNow I understand why Senor Droopy can be so happy....
Chapter 5: A scanner for EBNF
Until now, we have made scanners for one specific language or EBNF style production. Now it is the time to set out to make a universal scanner, based on some input mechanism. We start out with an EBNF scanner. EBNF can be expressed in EBNF itself:
production = identifier "=" expression "." .
expression = term {|term}.
term = factor {factor}.
factor = letter | literal | "(" expression ")" | "[" expression "]" | "{" expression "}".
A 'Letter' is any ASCII token from the set: {"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"}. There are some reasons to use the graph on the right instead of the
full list:
A 'Digit' is any ASCII character from the set {"0" "1" "2" "3" "4" "5" "6" "7" "8" "9"}. The same reasons
apply for Digit with respect to showing only a subset of the branches.
A 'Literal' is ANY printable ASCII character, enclosed between double quotes (").
In this example, an identifier is the name of an object, consisting of one single letter.
A 'Production' is a recepy for constructing the language in question.
In this case: "Identifier = expression ."
An expression consists of a series of one or more 'Terms'.
A term consists of a Factor, followed by zero or more other factors.
A factor is defined to contain a Letter, a Literal or an expression between sorts of brackets. Now we only
need to combine the graphs and simplify the graph. Let's see how far we're going to come here. f you look at
the production rules and the graphs, things tend to get messy real soon... The production boils down to an
infinite series of parallel and series 'connected' diagrams of 'Factor'. So I aim to keep it a bit simple
here...
And this is what it looks like... I think this graph cannot be simplified. It cab however be desimplified by full replacement of each 'factor n' rectangle by the equivalent on the right.

EBNF.mod
Below is the sourcecode for the EBNF checker depicted above. This source does not comply with that in the Wirth books for several reasons:
MODULE EBNF;
FROM InOut IMPORT Read, Write, WriteLn, WriteString, WriteBf;
CONST empty = " ";
VAR sym : CHAR;
PROCEDURE GetSym;
BEGIN
REPEAT
Read (sym);
Write (sym)
UNTIL sym > empty
END GetSym;
PROCEDURE error;
BEGIN
WriteString ("Syntax incorrect");
WriteLn;
HALT
END error;
PROCEDURE expression;
PROCEDURE term;
PROCEDURE factor;
BEGIN
IF ("A" <= CAP (sym)) & (CAP (sym) <="Z") THEN (* nonterminal symbol *)
GetSym
ELSIF sym = '"' THEN (* terminal symbol *)
GetSym;
GetSym;
IF sym = '"' THEN GetSym ELSE error END
ELSIF sym = "(" THEN
GetSym;
expression;
IF sym = ")" THEN GetSym ELSE error END
ELSIF sym = "[" THEN
GetSym;
expression;
IF sym = "]" THEN GetSym ELSE error END
ELSIF sym = "{" THEN
GetSym;
expression;
IF sym = "}" THEN GetSym ELSE error END
ELSE
error
END
END factor;
BEGIN (* term *)
factor;
WHILE ("A" <= sym) & (sym <= "Z") OR (sym = '"') OR (sym ="(") OR (sym = "[") OR (sym = "{") DO
factor
END
END term;
BEGIN (* expression *)
term;
WHILE sym = "|" DO
GetSym;
term
END
END expression;
BEGIN (* main program *)
GetSym;
REPEAT
IF ("A" <= sym) & (sym <= "Z") THEN GetSym ELSE error END;
IF sym = "=" THEN GetSym ELSE error END;
expression;
IF sym = "." THEN
WriteBf;
GetSym
ELSE
error
END;
UNTIL sym="~"
END EBNF.
As you can see, the program closely follows the written EBNF rules:
| EBNF | Modula-2 |
|---|---|
| expression = term {|term}. |
PROCEDURE expression;
BEGIN
term;
WHILE sym = "|" DO
GetSym;
term
END
END expression;
|
| An expression expects a 'term' as its first argument. If there are extra term's, they need to be preceeded by the pipe token '|'. So, as long as the next sym is a '|', the next symbol is a term as well so we need to check it. | |
| term = factor {factor}. |
PROCEDURE term
BEGIN
factor;
WHILE ("A" <= sym) & (sym <= "Z") OR (sym = '"')
OR (sym ="(") OR (sym = "[") OR (sym = "{") DO
factor
END
END term;
|
| Same here: a term must start with a 'Factor' and may be followed by other Factors. The condition in the WHILE statement checks if the next symbol is part of the set 'First (Factor)'. | |
Now let's see if it runs:
jan@beryllium:~/modula/m4m$ ./EBNF <SAB
S = AB.
A = x| y.
B = z|w.
~
jan@beryllium:~/modula/m4m$ ./EBNF
A = "x" | "(" B ")". B = AC. C = { "+" A }. ~
A = "x" | "(" B ")". B = AC. C = { "+" A }. ~
jan@beryllium:~/modula/m4m$
This is what you get when you check the EBNF syntax of the SAB example and from example 5 from previous
chapters. I can only say: it works, for these simple grammars.
Page created on 4 September 2007 and
Page equipped with FroogleBuster technology