Compiler construction 1986

In the other file I told about the dutch book we happened to have in our posession which is dealing with creating and being able to extend a compiler for a non-trivial language. This book was the dutch edition of Compiler Construction by Niklaus Wirth some years before.
Although I realised that this book was special in that it uses Modula-2 as programming language and because of the added topics by Teus de Jong, I still was reluctant to put a translation of it online.

I came back from that opinion. The 1986 book is a mere 122 pages of which close to 50% is dedicated to source code listings. So the rest is not very much and thepages are small: close to A5 format. So, in A4 parlance, the book is just slightly over 60 pages.
Yet, this version of Compiler construction was written by a still young and relatively inexperienced Wirth. Sometimes it looks like you can only understand the later Compiler Construction books when you have closely followed the carreer of Niklaus Wirth. With this early book, this is not necessary yet since the career was still starting...

Therefore I decided to make summaries of each chapter and put these in a few HTML files. You are now reading the first of them. The essence of this topic is to present the book, in english, and to play with the assets of the book.

Chapter 1: Formal languages

A formal language contains sentences that are not made up of words, but with symbols. Syntax rules are composed to specify and recognize meaningful language constructs.

A sentence contains two aspects: a subject and a predicate. Our PET language is defined as follows:

EXAMPLE 0:

	sentence	= subject predicate
	subject		= cats | dogs
	predicate	= eat | sleep
   
The vertical bar '|' means 'OR'. If we now start the substitutions for the PET language, we end up with the following valid constructs:
	cats eat 
	cats sleep
	dogs eat
	dogs sleep
   
Now lets pretend we have an SAB language:
EXAMPLE 1:

	S = AB.
	A = x | y.
	B = z | w.
   
The SAB language is presented in what is known as BNF (Backus Naur Form). The first line is the BNF top level specification and all following lines of text give one more layer of explanation. The above BNF means: This makes our SAB language having four sentences: 'xz', 'xw', 'yz', 'yw'.

We can add recursion to the language like this:

EXAMPLE 2:

	S = xA.
	A = z | yA.
   
Here, x, y and z are the terminal symbols. S and A are productions. The shortest sentence we can make now is 'xz'. The next shortest is 'xyz'. Each time when we choose 'A = z', the substitution stops. But als long as we choose 'A = yA', substitution is in recursion. So the first five sentences we can formulate from this BNF set is
	xz
	xyz
	xyyz
	xyyyz
	xyyyyz
   
In EBNF (the Extended BNF) there are two extra operators: {} and []. These act as repeaters for the enclosed symbols:

operator does
{ σ } repeat symbol σ for any number of times, zero times included
[ σ ] repeat symbol σ zero or one time
ε Short for 'the empty set' or ∅

Which enables us to reformulate our second BNF production example to:

	S = x {y} z.
   
which makes the recursive element pop out immediately! And we don't need the auxilliary production 'A' any longer.
In the same breath we present the '[]' operator. It's meaning:
	S = x [y] z  => 'xz' , 'xyz' 
   

Chapter 2: analysing sentences

Building a sentence is nice, but a compiler is mostly working on analyzing sentences for correctness. A sentence is fed into the compiler and it must then try to resolve the correctness of it. For this, it is good to have the EBNF set of the related language and then play it in reverse.
Let's replay the SAB language of example 1:

EXAMPLE 1:

	S = AB.
	A = x | y.
	B = z | w.
   
The target is to see if the sentence 'xw' is part of the SAB language. We do so by top-down parsing. We know that 'S = AB' and 'A = x|y'. So we start subtitutions:

Step Action Formula Remaining Result
1 S xw Apply top level BNF rule
2 S = AB AB xw Substitute first element of 'A'
3 Substituted 'x' for 'A' xB xw We have a match!
4 Simplify formula B w Substitute element of 'B'
5 Use B = 'z' z w No match; try once more
6 Use B = 'w' w w Match!

So we have proved that 'xw' is part of the SAB language by repeated substitutions as specified in the EBNF description of the SAB language.

But things tend to be more complex than this case. Let's have a look at the following language:

      
EXAMPLE 3:

	S = A | B.
	A = xA | y.
	B = xB | z.
   
Can we now prove that 'xxz' is part of the language? Keep in mind that we are now a computer program, able to do 5 billion things in one second, but not able to see more than one letter ahead! So our future looks bright if we see that the 'A' production starts with the letter 'x' as in 'xxz'.

Step Action Formula Remaining Result
1 S xxz Apply top level BNF rule
2 S = A A xxz Substitute first element of 'A'
3 Use A = xA xA xxz We have a match!
4 Simplify formula A xz Resubstitute A
5 Use A = xA xA xz No match
6 Simplify A z Resubstitute A
7 Use A = y y z No match!

For the computer it now is perfectly clear that there will never be a match. The terminal symbols 'z' and 'y' are different and cannot be resubstituted any further. So that's the end of it all.
Of course, we, human beings, saw with one eye closed that choosing the 'S = B' path would lead to a very fast match, but this is due to our human abillity to grasp information by looking ahead close to a full page. Computers prefer to look ahead only in small amounts. Resulting in a 'No match'.

In this case, the problem arose because both the A and the B production started tith the same terminal symbol: 'x'. For this we introduce Rule 1 for syntax analysis:

Rule 1:

If we have productions Ri ranging from i=0 to i=n then the initial symbols must be distinct between all Ri.

In a formula:
	first (Ri) ∪ first (Rj) = ∅			for every i and j 
   
Now let's see if we can find the sentence 'x' from the following definition. See how far we can come.
EXAMPLE 4:

	S = Ax.
	A = [x].
   

Step Action Formula Remaining Result
1 S x Apply toplevel EBNF rule
2 S = Ax Ax x Substitute A
3 Use A = x xx x We have a partial match
4 Simplify x &empty Impossible

Things looked so hopefull. And then it all fell in pieces. The second substitution should not have been 'Use A = x' but 'Use A = ∅'. To counter for this kind of problem, we introduce

Rule 2:

For every non-terminal symbol 'A' which can result in ∅ the union of sets of first (A) and follow (A) symbols must be empty.

In example 4 we have:
	first (A) = x
	follow (A) = x
	first (A) ∪ follow (A)    <=>    {x} ∪ {x} # ∅
   

Chapter 3: syntax graphs

Expressing a syntax with EBNF is just one of the many options. One other option is to use graphs to represent a syntax. For us, humans, graphs are more explicit. The following graphs are EBNF dictated graphs.

A1

Every non-terminal symbol which can be represented by a series of productions as in 'A = R1 | R2 | R3 | ... | Rn' will be denoted in a graph as show on the right.


A2

Every term α1, α2 ... αn is depicted in this kind of graph


A3

If an element has the EBNF notation {S} it will be depicted graphically like this:


A4

If an element has the EBNF notation [S] it will be depicted graphically like this:


A5

If an element is a non-terminal symbol 'A' it is encased in a rectangle


A6

If an element is a terminal symbol 'S' it is enclosed in a circle


As of now terminal symbols will be enclosed in double quotes: "α" since that makes the easier to notice. Let's kick off with another example:

EXAMPLE 5:

	A = "x" | "(" B ")".
	B = AC.
	C = { "+" A }
   
producing, among others, the following sentences:
	x
	(x)
	(x+x)
	((x))
	((x+(x+x)))
   
Now, instead of elaborate analysis and trials let's just pour the EBNF rules in graphs like defined above. The result is on the right. As you can see, the graphs are just made by conbnecting the above elements into a unity. Like a colouring book for children.

The only thing left for now is substituting the B and C productions into the A production. the result is the graph below left. It just tells it all. In graphical language, Rule 1 says that two branches may not start with the same symbol. And Rule 2 says that in any point of the graph, it must be clear if the next symbol is still part of the current or already part of the next production.


Page created on 2 September 2007 and

Page equipped with FroogleBuster technology