1 Consider the following grammar (describing LISP arithmetic):
X -> ( E ) 
E -> O | O T
O -> + | * | – | /
T -> n | X
X == executable, E == expression, T == term, n == number
terminals == ( ) n + * – / 
Find FIRST, FOLLOW and LR(0) sets for this grammar.
Is the grammar LR(0)? Is it SLR?
2.Give a rightmost derivation of the string (x+a)*x using:
S=> E
E=> E+T | T
T=> T*F | F
F=> i | (E)
The lexical analyzer returns a token i==identifier for variables ‘x’ and ‘a’.
Display the parse tree, the syntax tree, and the expression DAG.
3. The algorithm for DOM in the text is based on data flow analysis, but it is often desirable to find the DOM tree from the control flow graph without t need to do data flow. Describe a possible algorithm based on breadh-first search to find DOM given a control flow graph. (An overview description in Englishis sufficient, you do not need a formal specification or code of an algorithm)1 Consider the following grammar (describing LISP arithmetic):
X -> ( E )
E -> O | O T
O -> + | * | – | /
T -> n | X
X == executable, E == expression, T == term, n == number
terminals == ( ) n + * – /
Find FIRST, FOLLOW and LR(0) sets for this grammar.
Is the grammar LR(0)? Is it SLR?
2.Give a rightmost derivation of the string (x+a)*x using:
S=> E
E=> E+T | T
T=> T*F | F
F=> i | (E)
The lexical analyzer returns a token i==identifier for variables ‘x’ and ‘a’.
Display the parse tree, the syntax tree, and the expression DAG.
3. The algorithm for DOM in the text is based on data flow analysis, but it is often desirable to find the DOM tree from the control flow graph without t need to do data flow. Describe a possible algorithm based on breadh-first search to find DOM given a control flow graph. (An overview description in Englishis sufficient, you do not need a formal specification or code of an algorithm)Dr. Ernesto Gomez : CSE 570/670
The structure of a computer language translator
1. Overview and practical issues
Recalling Lecture 1 – we are going to define a computer language using a formal
description because we want to be able to decide in an unambiguous way if inout
text is in the language or not. (This is a big part of the reason for the invention
of formal systems – mathematicians got into arguments about whether something
was a proof of a theorem or not. Formal systems came into being as definitions
of the rules of resoning that are acceptable for proving stuff, so all could agree
that if the proof folllowed the rules, it was a correct proof). When we do this,
we split the translation problem into two hopefully simpler chunks: First, decide
if the text follows the syntax rules. If it doesn’t, we reject it as not being in the
language, rather than trying to guess what the programmer really wanted to say.
Once we know some string is in the language, then we decide what it means (the
semantics). Language theorists have several ways of defining meaning, but from
the point of view of compilers, we need to translate the input text into actions by
the computer. We therefore use operational semantics – the text means what the
computer is required to do.
How should we implement these concepts? We have seen that we have a range
of options, from a traditional interpreter to a compiler that outputs machine code,
with a whole range of possibilities in betwen. There are advantages and disadvan-
tages to every design choice, and as we have seen, there are real world versions of
every one of them.
Consider first the traditional compiler: we want it to translate from source text
to machine language. We could just build a large, monolithic program that incor-
porates both the input in a computer language and the machine executable output.
The problem here is, bot the input language and the machine environment are mov-
ing targets. Languages are revised periodically, sometimes drastically. For example,
there have been five official revisions of the C++ standard, in 1998, 2003, 2011,
2023 and 2017, and a new revision is due this year. But this is not the whole story
– C++ started in 1985, and a lot changed before the International Organization for
Standardization (ISO) defined the 1998 standard (for example, templates were not
in the original language).
The runtime environment also changes. Early compilers (COBOL, FORTRAN,
others) could build on the assumption that the generated code would have the ma-
chine to itself, but modern compilers need to target both the machine architecture
and the operating system that runs on it. Both of these change quickly with time,
and are not uniqe – at any given time we may need to generate code for multiple
versions of multiple architectures (different versions of Intel, AMD and ARM pro-
cessors just to cover the most basic types), and operating systems (multiple versions
of Windows, Linus, Unix, IOS, and others). We1. Notes on Code generation and analysis: (see CH 5,8,9 Aho +
Ullman)
What to generate depends on source language, architecture of target machine or
runtime sys.
Common intermediate codes: 3 address, high-level ( Lisp/Scheme, C, Modula 2)
parse tree/CFG. I-code is still research topic.
Three-address code.
Statements of form x := y op z
types:
assignment
binary op x := y op z => OP x y z
unary op x := op y => OP x y
copy – x := y => MOV x y
indexed assignment
x[i] := y => MOVI x y offset(i)
x := y[i] (i is offset on address of x | y)
pointer/address assignment
*x := y => MOV contents(address(x)) y
x := *y => MOV x content(address(y))
x := &y => MOV x addrsss(y)
jumps
unconditional jump: GOTO label(x) => JMP x
conditional jump: IF x op y GOTO z => COND x y z
calls
param x => PARM x => PAR x
..
call label(x) number-parm(y) => GOSUB x y
..
return value(x) => RET x
Loops := conditional backward jumps
// do {} while
label z
{ loop body }
if(x op y) jmp z
or
// while{}
label w
if NOT (x op y) jmp z
{ loop body }
jmp w
label z
1
2
Logic (if-then-else) := conditional/unconditional forward jumps
if( NOT (x op y ) ) jmp z
{ true block }
jmp endif
label z
{ false block }
label endif
Note: jumps are not L-attributed (inherited values) or synthezied values:
(L-attribute means on the parse tree to the left of where we need the value)
Backward jumps – label is on LH side of tree, but arbitrarily far
Forward jumps – label is on RH side of tree.
Therefore handling labels cannot be done in grammar, requires symbol table.
Implies => we can’t (easily) check correctness of IF-THEN-ELSE structure built
with jumps (so we dislike GOTOs!!). When we build structure using
jumps from IF-THEN-ELSE, we know it is correct.
Backpatching:
To handle things like addresses of forward jumps, we place marker when
we encounter the jump, fill it in on second pass after we find the labels
(error message if we don’t find the label).
Analysis of larger code blocks: Expression has value as synthesized attribute
– what is value of an assignment statement? (C++ says it is a boolean, value T-F)
Values/types transferred from one statement to another via memory/symbol
table.
Problem: when can values change?
Example : ++x, x++operator in C/C++ causes change during execution of a
statement, gets in the way of DAG representation/optimization
We dislike side effects because they get in way of optimizations/ interpretation of
semantics: .
x = 5; y = 10;
z = y*(x + y++); // what is value of z?
// otherwise
z = (++y)*(x + y);
.
Change in the value of y is a side effect of the z assignment statements.
Questions: – When do we know that contents of a given memory location/register
is unchanged? (so we can use common expression elimination). –
3
When can we change statement/evaluation order without affecting meaning? (so
we can reorder things so can keep values in registers or cache).
Statements in a basic block may be analyzed together, since they will be executed
Dr. Ernesto Gomez : CSE 570/670
We now consider parsing algorithms. This material is in chapter 4
1. An LR parser
Having developed algorithms for FIRST and FOLLOW sets, we have seen how
construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets
are states, and the GOTO functions are transitions between states which occur
when we “read” a specific symbol, as in our finite auotmata. We extend the meaning
of “read” to mean “PUSH a symbol to stack top”, this happens when we move the
first symbol in the (unprocessed) input text and push it onto the stack, or when
we pop symbols from the stack corresponding to a handle (the right-hand side of a
production) and PUSH the left-hand symbol on the production on the stack. The
first case (a standard read action) gets a terminal symbol on the stack, the second
case gets a non-terminal symbol on the stack. Terminal characters are handled just
like we would in a finite automaton, non-terminals are pushed on the stack when
we reach a state whicht has an item of the form A → α. ; an item of this form
means we have finished processing a handle that corresponds to α and it is on stack
top, and that we use the production A → α to POP α and PUSH A.
We will continue to use our expression grammar for examples,
S → E
E → E + T| E → T
T → T ∗F | T → F
F →id | F ⇁num | F → (E)
with N = {S,E,T,F}, T = {+,∗,(,), id, num}
We will also make reference to figure 4.31, page 244 of the text, which gives an
LR(0) automaton for this grammar, built using the algorithms in the previous set
of notes and in section 4.6.2 in Aho and Ullman.
1.1. Parsing with a shift-reduce automaton. Our automaton is table driven,
much like the Deterministic Finite Automata you have worked on previously in
Lab 1. The difference is that our control table is divided into 2 sections, called the
Action table and the Goto table (these names are somewhat confusing, for example
the Goto table is not identical to the GOTO function – also note that since the
rows are the same for both tables, the are usually written as two sections of the
same control tables).
The states of the parser are given by the LR sets (in our example, these LR(0)
sets, but the same method works for LR(k), the only difference being the parse
tables.
Rows of of our control table are numbered with the state numbers we assigned
to the LR sets when we constructed them. The numbering depends on the order in
which we generate the sets, and makes no difference to the parse function, the only
fixed thing is that the start .state – state 0, in row 0 – is generated from a single
item S → .α corresponding to the start production.
Columns of the Action section of the control table are labelled with all the
characters ∈ T , and there is an added column for the symbol $ which denotes end
of input text (we have seen this convention when we generated the FOLLOW sets).
Columns of the Goto section are labelled with all the symbols ∈ N. Notice that
every combination (state, X ∈ N∪T |���������
���
����������� �
������������
�! #”%$’&)(+*,(+*+-/.�01&)23$54563798:(1*+;+$'(5;16=< >#[email protected]:C�DEA:C3F:?=A:F:GIHKJLGMC=NOJLP�QRG�S�P!CTJUG�VWJL?=AXD’A:C=NOA:@YQ3ZEF:?=P:?=H
[ P:CTJLG�VWJU?=A:D]^JU_3G�@BG`A:C3ZEC3F�P:a�AXCWbTJU_=ZcC3FedfG�[email protected]%?=HiJBQRGjZcCTJUG`kUl=kUGMJUG`NmJLA:nTZEC3FhZcCTJLP
A!SMS�P!?3CTJfJL_3G�S�P!CTJUG�VWJoZcC�dp_3ZqSr_jZsJ#ZqHpHUA:ZEN ut P:kpGMVvA:@Bl3DcG!uZsauwxHLAgbYyUz’PWP:n{P!?vJ`|~}�HU_=AXkLn�| �
ZEC�S�DqA:HLHM�3SMP:CTJUGMVTJoJLGMDEDEHpb!P:?OJU_3ZqHpZqH#AXCjG�V3A:@�l=DcG�P:a�HUP:@BG�JL_3ZcC=F w�a~wfb!GMDED�JU_3G�[email protected]�JU_=ZcC3F
[email protected]@BZEC3F)ZEC�JU_=GfPvS�GIAXC’�gZcJ�ZqH1GMZcJU_3G`k�A:C�?3kUF!GMCTJ1doAXkLC3ZEC3F#P!k1A#Q�A:N�l3krA:S�JUZqSMA:D
� P!n:G [ ?3DcJU?3krAXD’SMP:CTJUGMVWJoA:DEHUP%l=DEAgbvHxAYDqAXkLF:G�l=AXkUJoZcCjZcCTJLGMkLl3kUGMJLAXJUZEP:C�P:[email protected]{AXCODEA:C3F:?=A:F:G
w�C3P:�RG`C=HUZc�!G%HiJLA�[email protected]�ZECeHUP:@BGBS�?3DcJU?=kUGIH�S`AXC�QRGBZcC�Hi?3DcJUZEC3FjZcCePXJL_3GMkrHM��G`�:GMCedp_3GMC�JU_3G
[email protected]�DEA:C3F:?�AXF:G�ZqH#HUl�P!n:GMCjZEC�Q�P:JU_��Kdp_3G`C�[email protected]��:GINOa�[email protected] [ ?3Q=A�JLP{��?3G`kiJLP��pZESMP���QRPXJL_
� l�AXC3ZqHi_�HUlRG`AXnWZEC3F�S�P!?3CTJUkLZcGIHL�uA!H�A�Sr_3ZEDqN5�!w�a�P!?3C=N�JU_�A�JxHUP:@BG#SMP:@[email protected]:C{GM�!GMkLbvN3Agb [ ?3Q=A:C
dxP:krN3HxdfG`kUG�A�J#QRG`HiJ#[email protected]�lRP:DEZcJUG�ZECj�K?3GMkUJUP{�pZqSMA:Cj?�HUA:F:G
}#@%Q3ZEF:?3P!?=HM�JL_3G�HUA:@�G�l3_3krA:HUG�SMAXCjQRG�?=C=NvGMkrHiJUPWPvN�@Y?=DsJLZcl3DEG�doAgbWH w�C^JL_3ZEHpS`A:HUG:�WdxG
A:kUG�JrAXDEnTZEC3F^A:Q�P!?vJ#JL_3G%[email protected]�G�Vvl3kLG`HLHUZcP!CjZEC�JL_3G%[email protected]�S�P!CTJUG�VWJ [ P_C=HUZEN3GMkpJL_3G%HUZcF!C=H#P!C
JL_3G�H�JrAXZEkLHfZcCOJU_=ZEHoQ3?=ZcDqNvZEC3F=~w���AXC�Nj���)� 3� PWG`How���@BG`A:C^ZEC^JLP%JL_3G�Q3?3ZEDEN3ZcC3F��vP:koZEC^JLP
JL_3G�HiJLA:ZckLdfG`DcD�����kfSMP:C=HUZEN3GMkI+Zca1SMAXJ�a�PWPWN�ZEHKa�G`NBJLP%S`A�JLH`�WAXC=N{Q�AXQWbYa�PWPvNBJLPYQ=A:Q3ZcGIHM�Tdp_=AXJ
ZqH�yiSr_=GMG`HUG�a�PWPWNW���
}#C�ZcCTJUG`kUGIH�JLZcC=F�JU_3ZEC3FeA:Q�[email protected]{AXC,DqAXC=F:?=A:F:GBZEH�JU_=AXJ%G`�:GMC�ZEC=S�P!kUkLG`S�J�G�Vvl3kLG`HLHiZEP:C�H
S`AXkLkUb�@BGIAXC3ZEC3F=1a�P:kfG�[email protected]:�Tdp_3G`C�dfG�AXkLG#H�JrAXkUJUZEC3F�JUPYDcGIAXkLC�A�DEA:C3F:?�AXF:GpdxG)S`AXC{@{AXn!G
P!?3kLHUGMDE�:GIH�?3C=NvG`kLHiJUPWPvN�QTb%PXJL_3GMkrHuQ�GMa�P:kLGpdfGp_�Ag�:Gxa�?3DEDcb%@BA!H�JLGMkLG`N%JU_=GpF:[email protected]@{AXk yiw1_=Ag�!G
_W?3C3F!kUb!�)ZqHoC3PXJ#SMP:kLkUGIS J`�TQ3?vJ#dxG�dfP!?3DENO?3C=N3GMkrH�JrAXC=N^ZsJpJLP%@BG`A:C^JL_3G�HilRG`A:n:G`kxZqHx_W?3C3F:kLb
}#DED�JU_=G�A:Q�P��!G)@{AXn!G`[email protected]{AXC^DqAXC=F:?=A:F:G`HfNvkrA:HiJUZqSMAXDEDEb�?=C=Hi?=ZsJrAXQ3DEG�a�P:koS�P:@[email protected]%?3C3ZqSMA�JLZcC=F
dpZcJU_¡A�S�[email protected]�l=?vJUG`k �uGMP!l3DEG�_=Ag�:GjDcP!C3F�JUkLZEG`N�JUP,F:GMJ�A�[email protected]:C9DqAXC=F:?=A:F:G�ZcCTJLGMkUa¢A:S�GjJLP
SMP:CTJUkLP:D�S�[email protected]�l=?vJUG`kLH��¢SMP:@Bl3?vJLGMkrHKJL_=A�J#A!SMS�G`lvJo�:P:ZqS�G�S�[email protected]Dr. Ernesto Gomez : CSE 570/670
1. Miscellaneous updates
These will be incorporated into Lecture Notes 1-4, where these topics are covered
1.1. Top down parsing. I have just found a reference on modern top-down pars-
ing methods : https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler.
Have not had a chance to review in detail, but at first glance this looks good – it
is also very new, the site went up in December 2019. It links to a selection of the
papers where the techniques are developed, and it introduces a parser generator
equivalent to Yacc for topdown parsing. This material will probably be covered in
class next year, but you can start looking at it now – If your work in the future
involves compilers or any large appication that incorporates a language, you will
need to know about this.
1.2. Non-determinism and Finite Automata (this text has been incorpo-
rated in Lecture Notes 3. In Lecture Notes 3, we suggested an approach to
constructing a deterministic finite automaton from a non-deterministic one using
depth-first search and backtracking, with a stack. The text book constructs (chap-
ter 3) a detailed algorithm, first for converting a non-deterministic finite automaton
(NFA) to a deterministic (DFA)y, and then another algorithm to minimize the re-
sulting DFA to the smallest possible DFA that accepts the same language. We then
go full circle by converting the DFA to a regular expression – this whole sequence
serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that
NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same
set of languages.
Further, the ability to minimize a DFA gives us the ability to determine if any
pair of language definitions (RE, DFA or NFA) accept the same language – if two
RE, or two FA minimize to the same DFA, then they accept the same language –
that is, they mean the same thing.
All of this is of great theoretical interest but very little practical interest. The
problem is, the algorithm to convert NFA to DFA has exponential complexity.
Therefore the conversion can only be done for small languages and automata – the
amount of work required for a more complex automaton or expression grows to fast.
(The same applies to the backtracking algorithm we suggested in Notes 3 – it is a
backtracking algorithm, and such algorithms also have exponential complexity).
1.3. Lex and Yacc. The front end to the compiler uses a https://www.sa nity.io/blog/why-
we-wrote-yet-another-parser-compilerlexical analyzer, built on top of finite automata,
defined by regular expressions. An example of this is described in 5.7, Lecture Notes
3. You can find a RE that recognizes the format of integers, so it can read text and
pick out integers, get their value, and report this to the parse program. We do this
because that means we can define something like: “addition → number+number”
and we don’t have to define the details of the number format when we define what
addition looks lDr. Ernesto Gomez : CSE 570/670
1. A note on parsing strategy
Suppose we have a set of TERMINALS T= { terminals – things that compose
text in the language }. T stands for “Terminals”, they are a more general form of
the definition of alphabet Σ. The alphabet is a finite list of symbols. The set of
terminals is also a finite list, but the terminals themselves don’t have to be finite.
Imagine we want to describe the syntax for adding a list of numbers. It might look
something like:
ADD_LIST = { RESULT = SUMS, where SUMS = NUMBER or SUMS =
SUMS + NUMBER }
(In Chapter 4 we will call this kind of thing a context-free grammar).
T= {NUMBER, =, +}. Why not SUMS and RESULT? We can see that these
two words are placeholders (we will be calling them NON-TERMINALS later),
when we actually express a sum, it will look like: NUMBER = NUMBER + …
+ NUMBER. SUM and RESULT never appear in the actual sum, all we have are
numbers, and the symbols “+” and “=”.
So : if we want to express a sum, we could start from RESULT = SUMS, then
expand SUMS into SUMS+NUMBER. We can keep on doing this to SUMS as long
as we want, then when we want to stop we use the rule SUMS = NUMBER. The
we can add everything, and replace RESULT with whatever we have added ad we
end up with NUMBER = list of numbers separated by + signs. (do we need a rule
RESULT = NUMBER?). Anyway, once we have replaced everything with numbers
and symbols + and =, we stop – there are no rules for changing a number or {+,=}
into anything else. That is why we call them “terminals” – they end (terminate) the
sequence of converting our syntactic definition to a particular sum.
(Whar, then are the terms RESULT and SUMS? They don’t appear in the final
expression, we have rules that allow us to change them into something else – they are
not “terminals” so we call them “non-terminals” (sometimes we can be very literal
in our naming conventions!). Now, what are we to make of the term NUMBER?
The word NUMBER appears in our description for the sum of a list of numbers
– but when we actually write such a sum down, we will replace each instance of
NUMBER with actual numbers – NUMBER=NUMBER+NUMBER would actually
be something like 42=29+13. So NUMBER is a terminal, but it is not a fixed
symbol – rather it describes a pattern that allows us to generate the actual text.
For example, NUMBER=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, the regular expression wi used
before to describe what an integer looks like. When we said in (lecture notes 2)
that we were going to use multiple machine types in translation, this is the kind of
thing we meant. )
We will use finite automata to describe simple patterns that we will then use to
simplify higher-level definitions, this is in Chapter 3 of our text.
1.1. Derivation. What we have done: ADD_LIST is a set of rules that describe
what adding a list of numbers look like – it is a formal language definition, that
describes which strings in (numbers, + and = signs)∗ are actually in the langugage
ADDDr. Ernesto Gomez : CSE 570/670
We now consider parsing algorithms. This material is in chapter 4
1. Support algorithms : First and Follow
Recall that parsing is the application of grammar definitions in reverse. Take
our example grammar for arithmetic expressions:
S → E
E → E + T| E → T
T → T ∗F | T → F
F →id | F ⇁num | F → (E)
with N = {S,E,T,F}, T = {+,∗,(,), id, num}
(Same grammar, with added terminal “num” and using “or” for a more compact
representation).
In order to pars, we need to have some idea of what different grammar rules can
generate, so we can make decisions on what text is a candidate for replacement by
the left-hand side of a rule. (In the terminology we defined in the previous lecture
notes, we need to identify the “handle” – which is in (N ∪T)∗
FIRST sets tell us, given any symbol in (N ∪T) what is the first character that
can be produced by that symbol, FOLLOW tells us what can appear immediately
after each symbol in N.
The following material gives a slight variation on the algorithms presented in
4.4.2,
2. The following material gives a slight variation on the algorithms
presented in 4.4.2
3. First Sets:
To generate: FIRST(X) for all X ∈ N ∪T.
(1) If X ∈ T , FIRST(X) = {X}.
(2) If X → ε is a production, add ε to FIRST(X).
(3) If X → Y1Y2 . . .Yi . . .Ykis a production: Base case i = 1. Induction:
add everything in FIRST(Yi) except for ε to FIRST(X), then if ε is in
FIRST(Yi) increment i and repeat; else stop. If i > k add ε to FIRST(X)
and stop. Repeat for all productions.
(4) Repeat step 3 until there is no change to any of the FIRST sets.
4. Follow Sets:
To generate: FOLLOW(X) for all X ∈ N.
Add end marker $ /∈ N ∪T to symbol set.
(1) Place $in FOLLOW(S), where S is the start symbol in G.
(2) If A → αBβ is a production, add everything in FIRST(β) except ε to
FOLLOW(B). Repeat for every production and every variable that is not
at the end of the production
(3) If A → αBβ and ε is in FIRST(β), or A → αB is a production, add
everything in FOLLOW(A) to FOLLOW(B).
(4) Repeat step 3 for every non-terminal in every production until nothing new
is added to any of the FOLLOW sets.
1
2
5. LR parsing automata
We here discuss material from sections 4.5 to 4.7 in Aho and Ullman, on bottom-
up parsers using LR methods.
We are not going into recursive descent topdown methods (section 4.4) because
they only work for a strict subset of the languages that can be parsed by the LR(k)
bottom-up methods. The top-down method described in
https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler
is more general than LR(k), it is claimed to be able to parse all the context-
free languages, whereas LR(k) can parse a subset of the deterministic context free
grammars. As developed so far, the new top-down methods and tools work sub-
stantially more slowly than LR(k) – both in complexity which is worst-case cubic,
and in speed on equivalent languages. At present, the bottom-up methods that we
will study are more praDr. Ernesto Gomez : CSE 570/670
We revisit some concepts from lecture 1, then start on formal language defini-
tions.
1. Thoughts about human and computer languages
Human languages tend to be contextual and ambiguous.
Contextual: the meaning of anything we say must be interpreted taking into
account the context in which it is said. For example: if I say “Look out! A shark!”
in class, context tells you this is an example of something. If I yell the same thing
while we are swimming in the ocean, it is either an urgent warning or a bad practical
joke. Cultural context also plays a large part in interpretation of human language.
Inoffensive statements in some cultures can be insulting in others, even when the
same language is spoken in both – when I moved from Cuba to Puerto Rico (both
Spanish speaking countries) as a child, I found that some common everyday Cuban
words were at best impolite in Puerto Rican usage.
Ambiguous: the same phrase can be understood multiple ways. In this case, we
are talking about the same expression in the same context. Consider the signs on
the stairs in this building: IN and OUT. Does IN mean in to the building, or in to
the stairwell? Or consider: if cat food is fed to cats, and baby food to babies, what
is “cheese food”?
An interesting thing about human language is that even incorrect expressions
carry meaning: for example, when we are starting to learn a language we can make
ourselves understood by others before we have fully mastered the grammar. “I have
hungry” is not correct, but we would understand it to mean the speaker is hungry.
All the above makes human languages drastically unsuitable for communicating
with a computer. People have long tried to get a human language interface to
control computers (computers that accept voice commands are the latest variation
on this). What we really want, however, is a computer that does what we mean,
and in human language we can’t always tell what we mean by what we actually
say.
Our best solution to this problem to date is to design artificial languages for
computers, with properties different from human languages. We want the following
properties:
• Freedom from ambiguity: A statement in a computer language should have
only one meaning.
• No dependence on external context: The meaning of a statement should be
understandable given only the statement itself (or at least the program of
which it is a part), and the meaning should not change with circumstances.
To these we add two practical consideration:
• It should be possible to determine the meaning of a statement without doing
too much work or spending too much time. That is, the time complexity
of translation should be small.
• The syntactic correctness of a statement should be unambiguous. That is,
there should be no doubt, independent of meaning, whether a statement is
correct (in the language) or not.
As computers become more powerful, what was considered “too much work” in the
past may become practical, bDr. Ernesto Gomez : CSE 570/670
We will be covering material from :
Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; “Com-
pilers – Principles, Techniques and Tools”,Second Edition, Addison Wes-
ley
(This is the classic text on Compilers – it has been around in various versions
and updates since the 1970s and has been revised and updated, most recently in
the second edition in 2007. Unlike many other texts, it gets better in every edition.
It is often called “the Dragon book”, in part because it has a dragon on the
cover, but mainly because it is not a friendly book – you will find that there are
places where you have to study one page for an hour to understand a theorem or
algorithm. It feels like the dragon is eating your brain – it is worth it, however –
or it will feel that it is worth it after your brain has been rewired. You will love
the book, appreciate and understand it the third time you read it. And you will
come back to it – it is the compiler text that has shaped everybody in the field,
the standard reference for compilers. It belongs in the professional library of every
computer scientist and engineer.)
A tentative schedule:
Chapter 1 is an introduction, chapter 2 gives an outline of how compilers work,
usint top-down parsing. We will be concentrating on bottom up parsing in the
class, this starts with chapter 3. We will study most of chapters 3,4,5 ad then skip
around the rest of the book – details wil appear in lecture notes.
We will not cover everything in the book,re and we will explore some topics that
are not in the book – these will appear in the notes. All material covered in class
may appear in examinations.
1. motivation
Why study compilers? Few of us will ever be called upon to write or modify a
compiler. Here are some justifications for all the work involved.
<1> Pragmatism: Compilers are tools that we use to build all our applications. We
need to understand the properties and limitations of our tools.
<2> Art: To the right(?) kind of mind, compilers are beautiful!
<3> Science: We need to be able to express things in a language understandable to
us, then translate it to a language understandable to machines. The problem: how
do we make sure that the meaning is translated?
• Syntax: the textual rules for expressing things in a language.
• Semantics: the meaning conveyed by the expression.
• The syntax is fully visible, but the meaning can be context dependent, may
require additional knowledge to interpret.
• Syntax + (context) + (world knowledge) => Semantics.
• Ambiguity: same syntax can mean different things. English is ambiguous.
• “Time flies like an arrow.” – What does it mean?Consider: Is the subject
“time” or “flies” (or “time flies”, a kind of fly)? Is the verb “time”, “flies”,
or “like”?
• Early computer translation example:”Out of sight, out of mind.” => to
Russian; back from Russian => “Invisible insanity.”
1
2
Two tracks: “the Dragon book”: Alfred V. Aho, Monica S. Lam, Ravi SeDr. Ernesto Gomez : CSE 570/670
1. Syntax-directed translation
Human languages communicate information through a combination of syntax,
meaning and context. Arguably meaning and context are more important, we can
communicate well with bad grammar. One of the consequences is that human
languages convey a much richer, but less precise meaning. Computer languages are
much more limited in what they communicate, all they do is specify what actions a
machine will perform to carry out some algorithm. We design computer languages
to be unambiguous – any string in the language should have a unique meaning.
We use context free grammars because this makes meaning local to the text we
are translating – an expression in the language should be translatable only using
information that appears next to it in the text. Technically, a context-free grammar
lets us guarantee that we can always decide if a text string is in the language or
not, and we can do this efficiently (The LR grammars and shift-reduce parsers we
have defined earlier can parse in linear time, other algorithms may take somewhat
longer (and allow dealing with extensions to context-free grammars) but we can
still guarantee worst-case polynomial time, usually better than n3).
2. Simple translation withoug optimization.
Consider our expression grammar from chapter 4, which might be appropriate
for a calculator – we here extend it to an assign statement. We add A (assign
statement) to the non-terminals, and the symbols v,= to the terminals. In this
context, istands for either a vatiable or a numeric value. We have annotated each
line with an expression in curly brackets that says what it does – the arithmetic
operators have their usual meaning, and we are adding two functtions – the comment
field here describes what the functions do. We are using a pseudo-code to describe
what we do to translate the expression, but the actual code, like what we would
write in the curly brackets in Yacc, depends on what we we are translating our
statments to – another language? binary code? functions of an interpreter?
2.1. Assign statement.
(1) A → v = E { store( E.value, v ) // if v is a variable, store a value to its
address in memory – else error }
(2) E1 → E + T { E1.value = E.value + T.value }
(3) E → T { E.value = T.value }
(4) T1 → T ∗F { T1.value = T.value + F.value }
(5) T → F{ T.value = F.value }
(6) F → i { F.value = value_of( i ) // if i is a string, get numeric value – else
if i is a variable, get its value – else error}
(7) F → (E) { F.value = E.value }
Note that we have added something that is not a synctactic property, in productions
1 and 6, the term “variable”. We can define the syntax for what a variable should
look like ( a string tht begins with a letter, followed by more letters, digits, special
characters like $ and _ which ends just before a space, line end, punctuation or
operator which are not allowed in the name ). In modern languages, however, such
a name is only a variable if it ha���������
���
����������� �
������������
�! #”%$’&)(+*,(+*+-/.�01&)23$54563798:(1*+;+$'(5;16=< >#[email protected]:C�DEA:C3F:?=A:F:GIHKJLGMC=NOJLP�QRG�S�P!CTJUG�VWJL?=AXD’A:C=NOA:@YQ3ZEF:?=P:?=H
[ P:CTJLG�VWJU?=A:D]^JU_3G�@BG`A:C3ZEC3F�P:a�AXCWbTJU_=ZcC3FedfG�[email protected]%?=HiJBQRGjZcCTJUG`kUl=kUGMJUG`NmJLA:nTZEC3FhZcCTJLP
A!SMS�P!?3CTJfJL_3G�S�P!CTJUG�VWJoZcC�dp_3ZqSr_jZsJ#ZqHpHUA:ZEN ut P:kpGMVvA:@Bl3DcG!uZsauwxHLAgbYyUz’PWP:n{P!?vJ`|~}HU_=AXkLn�| €
ZEC�S�DqA:HLHM3SMP:CTJUGMVTJoJLGMDEDEHpb!P:?OJU_3ZqHpZqH#AXCjG�V3A:@�l=DcG‚P:aƒHUP:@BG�JL_3ZcC=F w„a~wfb!GMDED�JU_3G�[email protected]�JU_=ZcC3F
[email protected]@BZEC3F)ZEC‚JU_=GfPvS�GIAXC’gZcJƒZqH1GMZcJU_3G`kƒA:C‚?3kUF!GMCTJ1doAXkLC3ZEC3F#P!k1A#Q�A:N‚l3krA:S�JUZqSMA:D
… P!n:G [ ?3DcJU?3krAXD’SMP:CTJUGMVWJoA:DEHUP%l=DEAgbvHxAYDqAXkLF:G�l=AXkUJoZcCjZcCTJLGMkLl3kUGMJLAXJUZEP:C�P:[email protected]{AXCODEA:C3F:?=A:F:G
w†C3P:‡RG`C=HUZcˆ!G%HiJLA‰[email protected]ŠZECeHUP:@BGBS�?3DcJU?=kUGIHŠS`AXC�QRGBZcC�Hi?3DcJUZEC3FjZcCePXJL_3GMkrHM�G`ˆ:GMCedp_3GMC�JU_3G
[email protected]�DEA:C3F:?�AXF:G�ZqH#HUl�P!n:GMCjZEC�Q�P:JU_�‹Kdp_3G`C�[email protected]‰ˆ:GINOaŒ[email protected] [ ?3Q=A�JLP{Ž?3G`kiJLP�pZESMP‘ŒQRPXJL_
’ l�AXC3ZqHi_�HUlRG`AXnWZEC3F�S�P!?3CTJUkLZcGIHL“uA!HŽA�Sr_3ZEDqN5!wƒaŒP!?3C=N�JU_�A‰JxHUP:@BG#SMP:@[email protected]:C{GMˆ!GMkLbvN3Agb [ ?3Q=A:C
dxP:krN3HxdfG`kUG�A‰J#QRG`HiJ#[email protected]�lRP:DEZcJUG‚ZECjK?3GMkUJUP{pZqSMA:Cj?�HUA:F:G
}#@%Q3ZEF:?3P!?=HMƒJL_3G‚HUA:@�G�l3_3krA:HUG�SMAXCjQRG�?=C=NvGMkrHiJUPWPvN�@Y?=DsJLZcl3DEG‚doAgbWH w†C^JL_3ZEHpS`A:HUG:WdxG
A:kUG�JrAXDEnTZEC3F^A:Q�P!?vJ#JL_3G%[email protected]‚G�Vvl3kLG`HLHUZcP!CjZEC�JL_3G%[email protected]�S�P!CTJUG�VWJ [ P_C=HUZEN3GMkpJL_3G%HUZcF!C=H#P!C
JL_3G‚H”JrAXZEkLHfZcCOJU_=ZEHoQ3?=ZcDqNvZEC3F=~w”•–AXC�Nj—�˜)™ 3š PWG`How”•›@BG`A:C^ZEC^JLP%JL_3G�Q3?3ZEDEN3ZcC3F�vP:koZEC^JLP
JL_3GŠHiJLA:ZckLdfG`DcDœ�—ŠkfSMP:C=HUZEN3GMkI+Zca1SMAXJŽaŒPWPWN�ZEHKaŒG`NBJLP%S`A‰JLH`WAXC=N{Q�AXQWbYaŒPWPvNBJLPYQ=A:Q3ZcGIHMTdp_=AXJ
ZqHŽyiSr_=GMG`HUG�aŒPWPWNW€”œ
}#CžZcCTJUG`kUGIH”JLZcC=F�JU_3ZEC3FeA:Q�[email protected]{AXC,DqAXC=F:?=A:F:GBZEH�JU_=AXJ%G`ˆ:GMCŸZEC=S�P!kUkLG`S�J�G�Vvl3kLG`HLHiZEP:C�H
S`AXkLkUb�@BGIAXC3ZEC3F=1aŒP:kfG�[email protected]:Tdp_3G`C�dfGŠAXkLG#H”JrAXkUJUZEC3F‚JUPYDcGIAXkLC�A‚DEA:C3F:?�AXF:GpdxG)S`AXC{@{AXn!G
P!?3kLHUGMDEˆ:GIHƒ?3C=NvG`kLHiJUPWPvN�QTb%PXJL_3GMkrHuQ�GMaŒP:kLGpdfGp_�Agˆ:GxaŒ?3DEDcb%@BA!H”JLGMkLG`N%JU_=GpF:[email protected]@{AXk yiw1_=Agˆ!G
_W?3C3F!kUb!€)ZqHoC3PXJ#SMP:kLkUGIS J`TQ3?vJ#dxG�dfP!?3DENO?3C=N3GMkrH”JrAXC=N^ZsJpJLP%@BG`A:C^JL_3G‚HilRG`A:n:G`kxZqHx_W?3C3F:kLb
}#DED�JU_=G�A:Q�P‰ˆ!G)@{AXn!G`[email protected]{AXC^DqAXC=F:?=A:F:G`HfNvkrA:HiJUZqSMAXDEDEb�?=C=Hi?=ZsJrAXQ3DEG�aŒP:koS�P:@[email protected]%?3C3ZqSMA‰JLZcC=F
dpZcJU_¡AŸS�[email protected]�l=?vJUG`k uGMP!l3DEG�_=Agˆ:GjDcP!C3FŸJUkLZEG`NžJUP,F:GMJ�AŸ[email protected]:C9DqAXC=F:?=A:F:G�ZcCTJLGMkUa¢A:S�GjJLP
SMP:CTJUkLP:D�S�[email protected]�l=?vJUG`kLH�¢SMP:@Bl3?vJLGMkrHKJL_=A‰J#A!SMS�G`lvJoˆ:P:ZqS�G�S�[email protected]Dr. Ernesto Gomez : CSE 570/670
1. Miscellaneous updates
These will be incorporated into Lecture Notes 1-4, where these topics are covered
1.1. Top down parsing. I have just found a reference on modern top-down pars-
ing methods : https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler.
Have not had a chance to review in detail, but at first glance this looks good – it
is also very new, the site went up in December 2019. It links to a selection of the
papers where the techniques are developed, and it introduces a parser generator
equivalent to Yacc for topdown parsing. This material will probably be covered in
class next year, but you can start looking at it now – If your work in the future
involves compilers or any large appication that incorporates a language, you will
need to know about this.
1.2. Non-determinism and Finite Automata (this text has been incorpo-
rated in Lecture Notes 3. In Lecture Notes 3, we suggested an approach to
constructing a deterministic finite automaton from a non-deterministic one using
depth-first search and backtracking, with a stack. The text book constructs (chap-
ter 3) a detailed algorithm, first for converting a non-deterministic finite automaton
(NFA) to a deterministic (DFA)y, and then another algorithm to minimize the re-
sulting DFA to the smallest possible DFA that accepts the same language. We then
go full circle by converting the DFA to a regular expression – this whole sequence
serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that
NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same
set of languages.
Further, the ability to minimize a DFA gives us the ability to determine if any
pair of language definitions (RE, DFA or NFA) accept the same language – if two
RE, or two FA minimize to the same DFA, then they accept the same language –
that is, they mean the same thing.
All of this is of great theoretical interest but very little practical interest. The
problem is, the algorithm to convert NFA to DFA has exponential complexity.
Therefore the conversion can only be done for small languages and automata – the
amount of work required for a more complex automaton or expression grows to fast.
(The same applies to the backtracking algorithm we suggested in Notes 3 – it is a
backtracking algorithm, and such algorithms also have exponential complexity).
1.3. Lex and Yacc. The front end to the compiler uses a https://www.sa nity.io/blog/why-
we-wrote-yet-another-parser-compilerlexical analyzer, built on top of finite automata,
defined by regular expressions. An example of this is described in 5.7, Lecture Notes
3. You can find a RE that recognizes the format of integers, so it can read text and
pick out integers, get their value, and report this to the parse program. We do this
because that means we can define something like: “addition → number+number”
and we don’t have to define the details of the number format when we define what
addition looks lDr. Ernesto Gomez : CSE 570/670
The structure of a computer language translator
1. Overview and practical issues
Recalling Lecture 1 – we are going to define a computer language using a formal
description because we want to be able to decide in an unambiguous way if inout
text is in the language or not. (This is a big part of the reason for the invention
of formal systems – mathematicians got into arguments about whether something
was a proof of a theorem or not. Formal systems came into being as definitions
of the rules of resoning that are acceptable for proving stuff, so all could agree
that if the proof folllowed the rules, it was a correct proof). When we do this,
we split the translation problem into two hopefully simpler chunks: First, decide
if the text follows the syntax rules. If it doesn’t, we reject it as not being in the
language, rather than trying to guess what the programmer really wanted to say.
Once we know some string is in the language, then we decide what it means (the
semantics). Language theorists have several ways of defining meaning, but from
the point of view of compilers, we need to translate the input text into actions by
the computer. We therefore use operational semantics – the text means what the
computer is required to do.
How should we implement these concepts? We have seen that we have a range
of options, from a traditional interpreter to a compiler that outputs machine code,
with a whole range of possibilities in betwen. There are advantages and disadvan-
tages to every design choice, and as we have seen, there are real world versions of
every one of them.
Consider first the traditional compiler: we want it to translate from source text
to machine language. We could just build a large, monolithic program that incor-
porates both the input in a computer language and the machine executable output.
The problem here is, bot the input language and the machine environment are mov-
ing targets. Languages are revised periodically, sometimes drastically. For example,
there have been five official revisions of the C++ standard, in 1998, 2003, 2011,
2023 and 2017, and a new revision is due this year. But this is not the whole story
– C++ started in 1985, and a lot changed before the International Organization for
Standardization (ISO) defined the 1998 standard (for example, templates were not
in the original language).
The runtime environment also changes. Early compilers (COBOL, FORTRAN,
others) could build on the assumption that the generated code would have the ma-
chine to itself, but modern compilers need to target both the machine architecture
and the operating system that runs on it. Both of these change quickly with time,
and are not uniqe – at any given time we may need to generate code for multiple
versions of multiple architectures (different versions of Intel, AMD and ARM pro-
cessors just to cover the most basic types), and operating systems (multiple versions
of Windows, Linus, Unix, IOS, and others). WeDr. Ernesto Gomez : CSE 570/670
1. A note on parsing strategy
Suppose we have a set of TERMINALS T= { terminals – things that compose
text in the language }. T stands for “Terminals”, they are a more general form of
the definition of alphabet Σ. The alphabet is a finite list of symbols. The set of
terminals is also a finite list, but the terminals themselves don’t have to be finite.
Imagine we want to describe the syntax for adding a list of numbers. It might look
something like:
ADD_LIST = { RESULT = SUMS, where SUMS = NUMBER or SUMS =
SUMS + NUMBER }
(In Chapter 4 we will call this kind of thing a context-free grammar).
T= {NUMBER, =, +}. Why not SUMS and RESULT? We can see that these
two words are placeholders (we will be calling them NON-TERMINALS later),
when we actually express a sum, it will look like: NUMBER = NUMBER + …
+ NUMBER. SUM and RESULT never appear in the actual sum, all we have are
numbers, and the symbols “+” and “=”.
So : if we want to express a sum, we could start from RESULT = SUMS, then
expand SUMS into SUMS+NUMBER. We can keep on doing this to SUMS as long
as we want, then when we want to stop we use the rule SUMS = NUMBER. The
we can add everything, and replace RESULT with whatever we have added ad we
end up with NUMBER = list of numbers separated by + signs. (do we need a rule
RESULT = NUMBER?). Anyway, once we have replaced everything with numbers
and symbols + and =, we stop – there are no rules for changing a number or {+,=}
into anything else. That is why we call them “terminals” – they end (terminate) the
sequence of converting our syntactic definition to a particular sum.
(Whar, then are the terms RESULT and SUMS? They don’t appear in the final
expression, we have rules that allow us to change them into something else – they are
not “terminals” so we call them “non-terminals” (sometimes we can be very literal
in our naming conventions!). Now, what are we to make of the term NUMBER?
The word NUMBER appears in our description for the sum of a list of numbers
– but when we actually write such a sum down, we will replace each instance of
NUMBER with actual numbers – NUMBER=NUMBER+NUMBER would actually
be something like 42=29+13. So NUMBER is a terminal, but it is not a fixed
symbol – rather it describes a pattern that allows us to generate the actual text.
For example, NUMBER=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, the regular expression wi used
before to describe what an integer looks like. When we said in (lecture notes 2)
that we were going to use multiple machine types in translation, this is the kind of
thing we meant. )
We will use finite automata to describe simple patterns that we will then use to
simplify higher-level definitions, this is in Chapter 3 of our text.
1.1. Derivation. What we have done: ADD_LIST is a set of rules that describe
what adding a list of numbers look like – it is a formal language definition, that
describes which strings in (numbers, + and = signs)∗ are actually in the langugage
ADDDr. Ernesto Gomez : CSE 570/670
We now consider parsing algorithms. This material is in chapter 4
1. Support algorithms : First and Follow
Recall that parsing is the application of grammar definitions in reverse. Take
our example grammar for arithmetic expressions:
S → E
E → E + T| E → T
T → T ∗F | T → F
F →id | F ⇁num | F → (E)
with N = {S,E,T,F}, T = {+,∗,(,), id, num}
(Same grammar, with added terminal “num” and using “or” for a more compact
representation).
In order to pars, we need to have some idea of what different grammar rules can
generate, so we can make decisions on what text is a candidate for replacement by
the left-hand side of a rule. (In the terminology we defined in the previous lecture
notes, we need to identify the “handle” – which is in (N ∪T)∗
FIRST sets tell us, given any symbol in (N ∪T) what is the first character that
can be produced by that symbol, FOLLOW tells us what can appear immediately
after each symbol in N.
The following material gives a slight variation on the algorithms presented in
4.4.2,
2. The following material gives a slight variation on the algorithms
presented in 4.4.2
3. First Sets:
To generate: FIRST(X) for all X ∈ N ∪T.
(1) If X ∈ T , FIRST(X) = {X}.
(2) If X → ε is a production, add ε to FIRST(X).
(3) If X → Y1Y2 . . .Yi . . .Ykis a production: Base case i = 1. Induction:
add everything in FIRST(Yi) except for ε to FIRST(X), then if ε is in
FIRST(Yi) increment i and repeat; else stop. If i > k add ε to FIRST(X)
and stop. Repeat for all productions.
(4) Repeat step 3 until there is no change to any of the FIRST sets.
4. Follow Sets:
To generate: FOLLOW(X) for all X ∈ N.
Add end marker $ /∈ N ∪T to symbol set.
(1) Place $in FOLLOW(S), where S is the start symbol in G.
(2) If A → αBβ is a production, add everything in FIRST(β) except ε to
FOLLOW(B). Repeat for every production and every variable that is not
at the end of the production
(3) If A → αBβ and ε is in FIRST(β), or A → αB is a production, add
everything in FOLLOW(A) to FOLLOW(B).
(4) Repeat step 3 for every non-terminal in every production until nothing new
is added to any of the FOLLOW sets.
1
2
5. LR parsing automata
We here discuss material from sections 4.5 to 4.7 in Aho and Ullman, on bottom-
up parsers using LR methods.
We are not going into recursive descent topdown methods (section 4.4) because
they only work for a strict subset of the languages that can be parsed by the LR(k)
bottom-up methods. The top-down method described in
https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler
is more general than LR(k), it is claimed to be able to parse all the context-
free languages, whereas LR(k) can parse a subset of the deterministic context free
grammars. As developed so far, the new top-down methods and tools work sub-
stantially more slowly than LR(k) – both in complexity which is worst-case cubic,
and in speed on equivalent languages. At present, the bottom-up methods that we
will study are more praDr. Ernesto Gomez : CSE 570/670
We now consider parsing algorithms. This material is in chapter 4
1. An LR parser
Having developed algorithms for FIRST and FOLLOW sets, we have seen how
construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets
are states, and the GOTO functions are transitions between states which occur
when we “read” a specific symbol, as in our finite auotmata. We extend the meaning
of “read” to mean “PUSH a symbol to stack top”, this happens when we move the
first symbol in the (unprocessed) input text and push it onto the stack, or when
we pop symbols from the stack corresponding to a handle (the right-hand side of a
production) and PUSH the left-hand symbol on the production on the stack. The
first case (a standard read action) gets a terminal symbol on the stack, the second
case gets a non-terminal symbol on the stack. Terminal characters are handled just
like we would in a finite automaton, non-terminals are pushed on the stack when
we reach a state whicht has an item of the form A → α. ; an item of this form
means we have finished processing a handle that corresponds to α and it is on stack
top, and that we use the production A → α to POP α and PUSH A.
We will continue to use our expression grammar for examples,
S → E
E → E + T| E → T
T → T ∗F | T → F
F →id | F ⇁num | F → (E)
with N = {S,E,T,F}, T = {+,∗,(,), id, num}
We will also make reference to figure 4.31, page 244 of the text, which gives an
LR(0) automaton for this grammar, built using the algorithms in the previous set
of notes and in section 4.6.2 in Aho and Ullman.
1.1. Parsing with a shift-reduce automaton. Our automaton is table driven,
much like the Deterministic Finite Automata you have worked on previously in
Lab 1. The difference is that our control table is divided into 2 sections, called the
Action table and the Goto table (these names are somewhat confusing, for example
the Goto table is not identical to the GOTO function – also note that since the
rows are the same for both tables, the are usually written as two sections of the
same control tables).
The states of the parser are given by the LR sets (in our example, these LR(0)
sets, but the same method works for LR(k), the only difference being the parse
tables.
Rows of of our control table are numbered with the state numbers we assigned
to the LR sets when we constructed them. The numbering depends on the order in
which we generate the sets, and makes no difference to the parse function, the only
fixed thing is that the start .state – state 0, in row 0 – is generated from a single
item S → .α corresponding to the start production.
Columns of the Action section of the control table are labelled with all the
characters ∈ T , and there is an added column for the symbol $ which denotes end
of input text (we have seen this convention when we generated the FOLLOW sets).
Columns of the Goto section are labelled with all the symbols ∈ N. Notice that
every combination (state, X ∈ N∪T |Dr. Ernesto Gomez : CSE 570/670
Look at section 3.5 in the text, which goes into how LEX works and how to use it.
There is a useful summary of the material in Chapter 3, starting on page 189.
1. Parsing
Read section 4.1, which is a general introduction to parsing. In particular, 4.1.2
lists three equivalent grammar specifications for the same arithmetic expressions –
we will be using these, and variants of them in most of the examples. 4.1.3 and
4.1.4 are useful background, but: With modern compilers and computer speeds, the
compilation cycle is much faster than it used to be – trying to find and report all the
errors in a program can be necessary if the edit-compile program cycle takes hours
(typically it took a day or more before the personal computer era), but when you
can compile in minutes or less on an interactive system, it is legitimate to report
the first one or two errors and halt the compilation. In practice, if we see a list of
50 errors (as an example), most of us will correct the first two and then try again,
based on the experience that many of the later errors are triggered by earlier errors,
many errors will disappear when we fix the early ones.
1.1. Context-Free languages and grammars (CFG).
1.1.1. Parsing and Derivation. We have defined a language L as a subset, selected
from a set of strings that are combinations of specific building blocks. For regular
languages, the building blocks were a set of symbols we called an alphabet Σ, we
later allowed some of our building blocks to be more complicated things, that could
be defined by regular expressions (keywords like if and while, things like number,
built up of a restricted set of alphabetic characters and having a format that can
be defined by a regular expression (RE)). We have called this the set of terminals
T. Therefore L ⊆ T∗ (or Σ∗) and we have a set of rules defined in some way that
say what combinations of symbols are in the set L and what combinations are not.
We divide languages into classes, depending on how complicated the rules can
get, and what theoretical machine can implement the rules. Our simplest class, the
regular languages, are defined by a regular expression and an alphabet Σ. There
are is an allowed start state, denoting the symbol(s) that can appear at the start
of a string, and what can be done next after seeing a particular character. The
rules can be used to generate a string by successive application of rules, this is
called a derivation. Or, given a string, the rules can be used to decide if the
string is in the language or not, called a parse. We have seen that a particular
machine, the Finite State Automaton (FA), which can be deterministic (DFA) or
non-determinstin (NFA), is equivalent to a regular expression , and in fact that
NFA is equivalent to DFA. This means – any derivation that can be done using the
regular expression can also be done by a DFA or an NFA, and that any language that
can be implemented on an FA of any kind can be define1. Notes on Code generation and analysis: (see CH 5,8,9 Aho +
Ullman)
What to generate depends on source language, architecture of target machine or
runtime sys.
Common intermediate codes: 3 address, high-level ( Lisp/Scheme, C, Modula 2)
parse tree/CFG. I-code is still research topic.
Three-address code.
Statements of form x := y op z
types:
assignment
binary op x := y op z => OP x y z
unary op x := op y => OP x y
copy – x := y => MOV x y
indexed assignment
x[i] := y => MOVI x y offset(i)
x := y[i] (i is offset on address of x | y)
pointer/address assignment
*x := y => MOV contents(address(x)) y
x := *y => MOV x content(address(y))
x := &y => MOV x addrsss(y)
jumps
unconditional jump: GOTO label(x) => JMP x
conditional jump: IF x op y GOTO z => COND x y z
calls
param x => PARM x => PAR x
..
call label(x) number-parm(y) => GOSUB x y
..
return value(x) => RET x
Loops := conditional backward jumps
// do {} while
label z
{ loop body }
if(x op y) jmp z
or
// while{}
label w
if NOT (x op y) jmp z
{ loop body }
jmp w
label z
1
2
Logic (if-then-else) := conditional/unconditional forward jumps
if( NOT (x op y ) ) jmp z
{ true block }
jmp endif
label z
{ false block }
label endif
Note: jumps are not L-attributed (inherited values) or synthezied values:
(L-attribute means on the parse tree to the left of where we need the value)
Backward jumps – label is on LH side of tree, but arbitrarily far
Forward jumps – label is on RH side of tree.
Therefore handling labels cannot be done in grammar, requires symbol table.
Implies => we can’t (easily) check correctness of IF-THEN-ELSE structure built
with jumps (so we dislike GOTOs!!). When we build structure using
jumps from IF-THEN-ELSE, we know it is correct.
Backpatching:
To handle things like addresses of forward jumps, we place marker when
we encounter the jump, fill it in on second pass after we find the labels
(error message if we don’t find the label).
Analysis of larger code blocks: Expression has value as synthesized attribute
– what is value of an assignment statement? (C++ says it is a boolean, value T-F)
Values/types transferred from one statement to another via memory/symbol
table.
Problem: when can values change?
Example : ++x, x++operator in C/C++ causes change during execution of a
statement, gets in the way of DAG representation/optimization
We dislike side effects because they get in way of optimizations/ interpretation of
semantics: .
x = 5; y = 10;
z = y*(x + y++); // what is value of z?
// otherwise
z = (++y)*(x + y);
.
Change in the value of y is a side effect of the z assignment statements.
Questions: – When do we know that contents of a given memory location/register
is unchanged? (so we can use common expression elimination). –
3
When can we change statement/evaluation order without affecting meaning? (so
we can reorder things so can keep values in registers or cache).
Statements in a basic block may be analyzed together, since they will be executed
Dr. Ernesto Gomez : CSE 570/670
We will be covering material from :
Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; “Com-
pilers – Principles, Techniques and Tools”,Second Edition, Addison Wes-
ley
(This is the classic text on Compilers – it has been around in various versions
and updates since the 1970s and has been revised and updated, most recently in
the second edition in 2007. Unlike many other texts, it gets better in every edition.
It is often called “the Dragon book”, in part because it has a dragon on the
cover, but mainly because it is not a friendly book – you will find that there are
places where you have to study one page for an hour to understand a theorem or
algorithm. It feels like the dragon is eating your brain – it is worth it, however –
or it will feel that it is worth it after your brain has been rewired. You will love
the book, appreciate and understand it the third time you read it. And you will
come back to it – it is the compiler text that has shaped everybody in the field,
the standard reference for compilers. It belongs in the professional library of every
computer scientist and engineer.)
A tentative schedule:
Chapter 1 is an introduction, chapter 2 gives an outline of how compilers work,
usint top-down parsing. We will be concentrating on bottom up parsing in the
class, this starts with chapter 3. We will study most of chapters 3,4,5 ad then skip
around the rest of the book – details wil appear in lecture notes.
We will not cover everything in the book,re and we will explore some topics that
are not in the book – these will appear in the notes. All material covered in class
may appear in examinations.
1. motivation
Why study compilers? Few of us will ever be called upon to write or modify a
compiler. Here are some justifications for all the work involved.
<1> Pragmatism: Compilers are tools that we use to build all our applications. We
need to understand the properties and limitations of our tools.
<2> Art: To the right(?) kind of mind, compilers are beautiful!
<3> Science: We need to be able to express things in a language understandable to
us, then translate it to a language understandable to machines. The problem: how
do we make sure that the meaning is translated?
• Syntax: the textual rules for expressing things in a language.
• Semantics: the meaning conveyed by the expression.
• The syntax is fully visible, but the meaning can be context dependent, may
require additional knowledge to interpret.
• Syntax + (context) + (world knowledge) => Semantics.
• Ambiguity: same syntax can mean different things. English is ambiguous.
• “Time flies like an arrow.” – What does it mean?Consider: Is the subject
“time” or “flies” (or “time flies”, a kind of fly)? Is the verb “time”, “flies”,
or “like”?
• Early computer translation example:”Out of sight, out of mind.” => to
Russian; back from Russian => “Invisible insanity.”
1
2
Two tracks: “the Dragon book”: Alfred V. Aho, Monica S. Lam, Ravi SeDr. Ernesto Gomez : CSE 570/670
We revisit some concepts from lecture 1, then start on formal language defini-
tions.
1. Thoughts about human and computer languages
Human languages tend to be contextual and ambiguous.
Contextual: the meaning of anything we say must be interpreted taking into
account the context in which it is said. For example: if I say “Look out! A shark!”
in class, context tells you this is an example of something. If I yell the same thing
while we are swimming in the ocean, it is either an urgent warning or a bad practical
joke. Cultural context also plays a large part in interpretation of human language.
Inoffensive statements in some cultures can be insulting in others, even when the
same language is spoken in both – when I moved from Cuba to Puerto Rico (both
Spanish speaking countries) as a child, I found that some common everyday Cuban
words were at best impolite in Puerto Rican usage.
Ambiguous: the same phrase can be understood multiple ways. In this case, we
are talking about the same expression in the same context. Consider the signs on
the stairs in this building: IN and OUT. Does IN mean in to the building, or in to
the stairwell? Or consider: if cat food is fed to cats, and baby food to babies, what
is “cheese food”?
An interesting thing about human language is that even incorrect expressions
carry meaning: for example, when we are starting to learn a language we can make
ourselves understood by others before we have fully mastered the grammar. “I have
hungry” is not correct, but we would understand it to mean the speaker is hungry.
All the above makes human languages drastically unsuitable for communicating
with a computer. People have long tried to get a human language interface to
control computers (computers that accept voice commands are the latest variation
on this). What we really want, however, is a computer that does what we mean,
and in human language we can’t always tell what we mean by what we actually
say.
Our best solution to this problem to date is to design artificial languages for
computers, with properties different from human languages. We want the following
properties:
• Freedom from ambiguity: A statement in a computer language should have
only one meaning.
• No dependence on external context: The meaning of a statement should be
understandable given only the statement itself (or at least the program of
which it is a part), and the meaning should not change with circumstances.
To these we add two practical consideration:
• It should be possible to determine the meaning of a statement without doing
too much work or spending too much time. That is, the time complexity
of translation should be small.
• The syntactic correctness of a statement should be unambiguous. That is,
there should be no doubt, independent of meaning, whether a statement is
correct (in the language) or not.
As computers become more powerful, what was considered “too much work” in the
past may become practical, bDr. Ernesto Gomez : CSE 570/670
1. Syntax-directed translation
Human languages communicate information through a combination of syntax,
meaning and context. Arguably meaning and context are more important, we can
communicate well with bad grammar. One of the consequences is that human
languages convey a much richer, but less precise meaning. Computer languages are
much more limited in what they communicate, all they do is specify what actions a
machine will perform to carry out some algorithm. We design computer languages
to be unambiguous – any string in the language should have a unique meaning.
We use context free grammars because this makes meaning local to the text we
are translating – an expression in the language should be translatable only using
information that appears next to it in the text. Technically, a context-free grammar
lets us guarantee that we can always decide if a text string is in the language or
not, and we can do this efficiently (The LR grammars and shift-reduce parsers we
have defined earlier can parse in linear time, other algorithms may take somewhat
longer (and allow dealing with extensions to context-free grammars) but we can
still guarantee worst-case polynomial time, usually better than n3).
2. Simple translation withoug optimization.
Consider our expression grammar from chapter 4, which might be appropriate
for a calculator – we here extend it to an assign statement. We add A (assign
statement) to the non-terminals, and the symbols v,= to the terminals. In this
context, istands for either a vatiable or a numeric value. We have annotated each
line with an expression in curly brackets that says what it does – the arithmetic
operators have their usual meaning, and we are adding two functtions – the comment
field here describes what the functions do. We are using a pseudo-code to describe
what we do to translate the expression, but the actual code, like what we would
write in the curly brackets in Yacc, depends on what we we are translating our
statments to – another language? binary code? functions of an interpreter?
2.1. Assign statement.
(1) A → v = E { store( E.value, v ) // if v is a variable, store a value to its
address in memory – else error }
(2) E1 → E + T { E1.value = E.value + T.value }
(3) E → T { E.value = T.value }
(4) T1 → T ∗F { T1.value = T.value + F.value }
(5) T → F{ T.value = F.value }
(6) F → i { F.value = value_of( i ) // if i is a string, get numeric value – else
if i is a variable, get its value – else error}
(7) F → (E) { F.value = E.value }
Note that we have added something that is not a synctactic property, in productions
1 and 6, the term “variable”. We can define the syntax for what a variable should
look like ( a string tht begins with a letter, followed by more letters, digits, special
characters like $ and _ which ends just before a space, line end, punctuation or
operator which are not allowed in the name ). In modern languages, however, such
a name is only a variable if it ha




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.