Compiler construction 1986
In the other file I told about the dutch book we happened to have in our posession which is dealing with
creating and being able to extend a compiler for a non-trivial language. This book was the dutch edition of
Compiler Construction by Niklaus Wirth some years before.
Although I realised that this book was special in that it uses Modula-2 as programming language and because of
the added topics by Teus de Jong, I still was reluctant to put a translation of it online.
I came back from that opinion. The 1986 book is a mere 122 pages of which close to 50% is dedicated to source
code listings. So the rest is not very much and thepages are small: close to A5 format. So, in A4 parlance,
the book is just slightly over 60 pages.
Yet, this version of Compiler construction was written by a still young and relatively inexperienced Wirth.
Sometimes it looks like you can only understand the later Compiler Construction books when you have closely
followed the carreer of Niklaus Wirth. With this early book, this is not necessary yet since the career was
still starting...
Therefore I decided to make summaries of each chapter and put these in a few HTML files. You are now reading the first of them. The essence of this topic is to present the book, in english, and to play with the assets of the book.
Chapter 1: Formal languages
A formal language contains sentences that are not made up of words, but with symbols. Syntax rules are composed to specify and recognize meaningful language constructs.
A sentence contains two aspects: a subject and a predicate. Our PET language is defined as follows:
EXAMPLE 0: sentence = subject predicate subject = cats | dogs predicate = eat | sleepThe vertical bar '|' means 'OR'. If we now start the substitutions for the PET language, we end up with the following valid constructs:
cats eat cats sleep dogs eat dogs sleepNow lets pretend we have an SAB language:
EXAMPLE 1: S = AB. A = x | y. B = z | w.The SAB language is presented in what is known as BNF (Backus Naur Form). The first line is the BNF top level specification and all following lines of text give one more layer of explanation. The above BNF means:
We can add recursion to the language like this:
EXAMPLE 2: S = xA. A = z | yA.Here, x, y and z are the terminal symbols. S and A are productions. The shortest sentence we can make now is 'xz'. The next shortest is 'xyz'. Each time when we choose 'A = z', the substitution stops. But als long as we choose 'A = yA', substitution is in recursion. So the first five sentences we can formulate from this BNF set is
xz xyz xyyz xyyyz xyyyyzIn EBNF (the Extended BNF) there are two extra operators: {} and []. These act as repeaters for the enclosed symbols:
| operator | does |
|---|---|
| { σ } | repeat symbol σ for any number of times, zero times included |
| [ σ ] | repeat symbol σ zero or one time |
| ε | Short for 'the empty set' or ∅ |
Which enables us to reformulate our second BNF production example to:
S = x {y} z.
which makes the recursive element pop out immediately! And we don't need the auxilliary production 'A' any
longer.
S = x [y] z => 'xz' , 'xyz'
Chapter 2: analysing sentences
Building a sentence is nice, but a compiler is mostly working on analyzing sentences for correctness. A
sentence is fed into the compiler and it must then try to resolve the correctness of it. For this, it is good
to have the EBNF set of the related language and then play it in reverse.
Let's replay the SAB language of example 1:
EXAMPLE 1: S = AB. A = x | y. B = z | w.The target is to see if the sentence 'xw' is part of the SAB language. We do so by top-down parsing. We know that 'S = AB' and 'A = x|y'. So we start subtitutions:
| Step | Action | Formula | Remaining | Result |
|---|---|---|---|---|
| 1 | S | xw | Apply top level BNF rule | |
| 2 | S = AB | AB | xw | Substitute first element of 'A' |
| 3 | Substituted 'x' for 'A' | xB | xw | We have a match! |
| 4 | Simplify formula | B | w | Substitute element of 'B' |
| 5 | Use B = 'z' | z | w | No match; try once more |
| 6 | Use B = 'w' | w | w | Match! |
So we have proved that 'xw' is part of the SAB language by repeated substitutions as specified in the EBNF
description of the SAB language.
But things tend to be more complex than this case. Let's have a look at the following language:
EXAMPLE 3: S = A | B. A = xA | y. B = xB | z.Can we now prove that 'xxz' is part of the language? Keep in mind that we are now a computer program, able to do 5 billion things in one second, but not able to see more than one letter ahead! So our future looks bright if we see that the 'A' production starts with the letter 'x' as in 'xxz'.
| Step | Action | Formula | Remaining | Result |
|---|---|---|---|---|
| 1 | S | xxz | Apply top level BNF rule | |
| 2 | S = A | A | xxz | Substitute first element of 'A' |
| 3 | Use A = xA | xA | xxz | We have a match! |
| 4 | Simplify formula | A | xz | Resubstitute A |
| 5 | Use A = xA | xA | xz | No match |
| 6 | Simplify | A | z | Resubstitute A |
| 7 | Use A = y | y | z | No match! |
For the computer it now is perfectly clear that there will never be a match. The terminal symbols 'z' and 'y'
are different and cannot be resubstituted any further. So that's the end of it all.
Of course, we, human beings, saw with one eye closed that choosing the 'S = B' path would lead to a very fast
match, but this is due to our human abillity to grasp information by looking ahead close to a full page.
Computers prefer to look ahead only in small amounts. Resulting in a 'No match'.
In this case, the problem arose because both the A and the B production started tith the same terminal symbol: 'x'. For this we introduce Rule 1 for syntax analysis:
first (Ri) ∪ first (Rj) = ∅ for every i and jNow let's see if we can find the sentence 'x' from the following definition. See how far we can come.
EXAMPLE 4: S = Ax. A = [x].
| Step | Action | Formula | Remaining | Result |
|---|---|---|---|---|
| 1 | S | x | Apply toplevel EBNF rule | |
| 2 | S = Ax | Ax | x | Substitute A |
| 3 | Use A = x | xx | x | We have a partial match |
| 4 | Simplify | x | &empty | Impossible |
Things looked so hopefull. And then it all fell in pieces. The second substitution should not have been 'Use A = x' but 'Use A = ∅'. To counter for this kind of problem, we introduce
first (A) = x
follow (A) = x
first (A) ∪ follow (A) <=> {x} ∪ {x} # ∅
Chapter 3: syntax graphs
Expressing a syntax with EBNF is just one of the many options. One other option is to use graphs to represent a syntax. For us, humans, graphs are more explicit. The following graphs are EBNF dictated graphs.
As of now terminal symbols will be enclosed in double quotes: "α" since that makes the easier to notice.
Let's kick off with another example:
EXAMPLE 5:
A = "x" | "(" B ")".
B = AC.
C = { "+" A }
producing, among others, the following sentences:
x (x) (x+x) ((x)) ((x+(x+x)))Now, instead of elaborate analysis and trials let's just pour the EBNF rules in graphs like defined above. The result is on the right. As you can see, the graphs are just made by conbnecting the above elements into a unity. Like a colouring book for children.
The only thing left for now is substituting the B and C productions into the A production. the result is the
graph below left. It just tells it all. In graphical language, Rule 1 says that two branches may not start
with the same symbol. And Rule 2 says that in any point of the graph, it must be clear if the next symbol is
still part of the current or already part of the next production.
Page created on 2 September 2007 and
Page equipped with FroogleBuster technology