1 Consider the following grammar (describing LISP arithmetic):

X -> ( E )

E -> O | O T

O -> + | * | – | /

T -> n | X

X == executable, E == expression, T == term, n == number

terminals == ( ) n + * – /

Find FIRST, FOLLOW and LR(0) sets for this grammar.

Is the grammar LR(0)? Is it SLR?

2.Give a rightmost derivation of the string (x+a)*x using:

S=> E

E=> E+T | T

T=> T*F | F

F=> i | (E)

The lexical analyzer returns a token i==identifier for variables ‘x’ and ‘a’.

Display the parse tree, the syntax tree, and the expression DAG.

3. The algorithm for DOM in the text is based on data flow analysis, but it is often desirable to find the DOM tree from the control flow graph without t need to do data flow. Describe a possible algorithm based on breadh-first search to find DOM given a control flow graph. (An overview description in Englishis sufficient, you do not need a formal specification or code of an algorithm)1 Consider the following grammar (describing LISP arithmetic):

X -> ( E )

E -> O | O T

O -> + | * | – | /

T -> n | X

X == executable, E == expression, T == term, n == number

terminals == ( ) n + * – /

Find FIRST, FOLLOW and LR(0) sets for this grammar.

Is the grammar LR(0)? Is it SLR?

2.Give a rightmost derivation of the string (x+a)*x using:

S=> E

E=> E+T | T

T=> T*F | F

F=> i | (E)

The lexical analyzer returns a token i==identifier for variables ‘x’ and ‘a’.

Display the parse tree, the syntax tree, and the expression DAG.

3. The algorithm for DOM in the text is based on data flow analysis, but it is often desirable to find the DOM tree from the control flow graph without t need to do data flow. Describe a possible algorithm based on breadh-first search to find DOM given a control flow graph. (An overview description in Englishis sufficient, you do not need a formal specification or code of an algorithm)Dr. Ernesto Gomez : CSE 570/670

The structure of a computer language translator

1. Overview and practical issues

Recalling Lecture 1 – we are going to define a computer language using a formal

description because we want to be able to decide in an unambiguous way if inout

text is in the language or not. (This is a big part of the reason for the invention

of formal systems – mathematicians got into arguments about whether something

was a proof of a theorem or not. Formal systems came into being as definitions

of the rules of resoning that are acceptable for proving stuff, so all could agree

that if the proof folllowed the rules, it was a correct proof). When we do this,

we split the translation problem into two hopefully simpler chunks: First, decide

if the text follows the syntax rules. If it doesn’t, we reject it as not being in the

language, rather than trying to guess what the programmer really wanted to say.

Once we know some string is in the language, then we decide what it means (the

semantics). Language theorists have several ways of defining meaning, but from

the point of view of compilers, we need to translate the input text into actions by

the computer. We therefore use operational semantics – the text means what the

computer is required to do.

How should we implement these concepts? We have seen that we have a range

of options, from a traditional interpreter to a compiler that outputs machine code,

with a whole range of possibilities in betwen. There are advantages and disadvan-

tages to every design choice, and as we have seen, there are real world versions of

every one of them.

Consider first the traditional compiler: we want it to translate from source text

to machine language. We could just build a large, monolithic program that incor-

porates both the input in a computer language and the machine executable output.

The problem here is, bot the input language and the machine environment are mov-

ing targets. Languages are revised periodically, sometimes drastically. For example,

there have been five official revisions of the C++ standard, in 1998, 2003, 2011,

2023 and 2017, and a new revision is due this year. But this is not the whole story

– C++ started in 1985, and a lot changed before the International Organization for

Standardization (ISO) defined the 1998 standard (for example, templates were not

in the original language).

The runtime environment also changes. Early compilers (COBOL, FORTRAN,

others) could build on the assumption that the generated code would have the ma-

chine to itself, but modern compilers need to target both the machine architecture

and the operating system that runs on it. Both of these change quickly with time,

and are not uniqe – at any given time we may need to generate code for multiple

versions of multiple architectures (different versions of Intel, AMD and ARM pro-

cessors just to cover the most basic types), and operating systems (multiple versions

of Windows, Linus, Unix, IOS, and others). We1. Notes on Code generation and analysis: (see CH 5,8,9 Aho +

Ullman)

What to generate depends on source language, architecture of target machine or

runtime sys.

Common intermediate codes: 3 address, high-level ( Lisp/Scheme, C, Modula 2)

parse tree/CFG. I-code is still research topic.

Three-address code.

Statements of form x := y op z

types:

assignment

binary op x := y op z => OP x y z

unary op x := op y => OP x y

copy – x := y => MOV x y

indexed assignment

x[i] := y => MOVI x y offset(i)

x := y[i] (i is offset on address of x | y)

pointer/address assignment

*x := y => MOV contents(address(x)) y

x := *y => MOV x content(address(y))

x := &y => MOV x addrsss(y)

jumps

unconditional jump: GOTO label(x) => JMP x

conditional jump: IF x op y GOTO z => COND x y z

calls

param x => PARM x => PAR x

..

call label(x) number-parm(y) => GOSUB x y

..

return value(x) => RET x

Loops := conditional backward jumps

// do {} while

label z

{ loop body }

if(x op y) jmp z

or

// while{}

label w

if NOT (x op y) jmp z

{ loop body }

jmp w

label z

1

2

Logic (if-then-else) := conditional/unconditional forward jumps

if( NOT (x op y ) ) jmp z

{ true block }

jmp endif

label z

{ false block }

label endif

Note: jumps are not L-attributed (inherited values) or synthezied values:

(L-attribute means on the parse tree to the left of where we need the value)

Backward jumps – label is on LH side of tree, but arbitrarily far

Forward jumps – label is on RH side of tree.

Therefore handling labels cannot be done in grammar, requires symbol table.

Implies => we can’t (easily) check correctness of IF-THEN-ELSE structure built

with jumps (so we dislike GOTOs!!). When we build structure using

jumps from IF-THEN-ELSE, we know it is correct.

Backpatching:

To handle things like addresses of forward jumps, we place marker when

we encounter the jump, fill it in on second pass after we find the labels

(error message if we don’t find the label).

Analysis of larger code blocks: Expression has value as synthesized attribute

– what is value of an assignment statement? (C++ says it is a boolean, value T-F)

Values/types transferred from one statement to another via memory/symbol

table.

Problem: when can values change?

Example : ++x, x++operator in C/C++ causes change during execution of a

statement, gets in the way of DAG representation/optimization

We dislike side effects because they get in way of optimizations/ interpretation of

semantics: .

x = 5; y = 10;

z = y*(x + y++); // what is value of z?

// otherwise

z = (++y)*(x + y);

.

Change in the value of y is a side effect of the z assignment statements.

Questions: – When do we know that contents of a given memory location/register

is unchanged? (so we can use common expression elimination). –

3

When can we change statement/evaluation order without affecting meaning? (so

we can reorder things so can keep values in registers or cache).

Statements in a basic block may be analyzed together, since they will be executed

Dr. Ernesto Gomez : CSE 570/670

We now consider parsing algorithms. This material is in chapter 4

1. An LR parser

Having developed algorithms for FIRST and FOLLOW sets, we have seen how

construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets

are states, and the GOTO functions are transitions between states which occur

when we “read” a specific symbol, as in our finite auotmata. We extend the meaning

of “read” to mean “PUSH a symbol to stack top”, this happens when we move the

first symbol in the (unprocessed) input text and push it onto the stack, or when

we pop symbols from the stack corresponding to a handle (the right-hand side of a

production) and PUSH the left-hand symbol on the production on the stack. The

first case (a standard read action) gets a terminal symbol on the stack, the second

case gets a non-terminal symbol on the stack. Terminal characters are handled just

like we would in a finite automaton, non-terminals are pushed on the stack when

we reach a state whicht has an item of the form A → α. ; an item of this form

means we have finished processing a handle that corresponds to α and it is on stack

top, and that we use the production A → α to POP α and PUSH A.

We will continue to use our expression grammar for examples,

S → E

E → E + T| E → T

T → T ∗F | T → F

F →id | F ⇁num | F → (E)

with N = {S,E,T,F}, T = {+,∗,(,), id, num}

We will also make reference to figure 4.31, page 244 of the text, which gives an

LR(0) automaton for this grammar, built using the algorithms in the previous set

of notes and in section 4.6.2 in Aho and Ullman.

1.1. Parsing with a shift-reduce automaton. Our automaton is table driven,

much like the Deterministic Finite Automata you have worked on previously in

Lab 1. The difference is that our control table is divided into 2 sections, called the

Action table and the Goto table (these names are somewhat confusing, for example

the Goto table is not identical to the GOTO function – also note that since the

rows are the same for both tables, the are usually written as two sections of the

same control tables).

The states of the parser are given by the LR sets (in our example, these LR(0)

sets, but the same method works for LR(k), the only difference being the parse

tables.

Rows of of our control table are numbered with the state numbers we assigned

to the LR sets when we constructed them. The numbering depends on the order in

which we generate the sets, and makes no difference to the parse function, the only

fixed thing is that the start .state – state 0, in row 0 – is generated from a single

item S → .α corresponding to the start production.

Columns of the Action section of the control table are labelled with all the

characters ∈ T , and there is an added column for the symbol $ which denotes end

of input text (we have seen this convention when we generated the FOLLOW sets).

Columns of the Goto section are labelled with all the symbols ∈ N. Notice that

every combination (state, X ∈ N∪T |���������

���

����������� �

������������

�! #”%$’&)(+*,(+*+-/.�01&)23$54563798:(1*+;+$'(5;16=<
>#[email protected]:C�DEA:C3F:?=A:F:GIHKJLGMC=NOJLP�QRG�S�P!CTJUG�VWJL?=AXD’A:C=NOA:@YQ3ZEF:?=P:?=H

[ P:CTJLG�VWJU?=A:D]^JU_3G�@BG`A:C3ZEC3F�P:a�AXCWbTJU_=ZcC3FedfG�[email protected]%?=HiJBQRGjZcCTJUG`kUl=kUGMJUG`NmJLA:nTZEC3FhZcCTJLP

A!SMS�P!?3CTJfJL_3G�S�P!CTJUG�VWJoZcC�dp_3ZqSr_jZsJ#ZqHpHUA:ZEN ut P:kpGMVvA:@Bl3DcG!uZsauwxHLAgbYyUz’PWP:n{P!?vJ`|~}�HU_=AXkLn�| �

ZEC�S�DqA:HLHM�3SMP:CTJUGMVTJoJLGMDEDEHpb!P:?OJU_3ZqHpZqH#AXCjG�V3A:@�l=DcG�P:a�HUP:@BG�JL_3ZcC=F w�a~wfb!GMDED�JU_3G�[email protected]�JU_=ZcC3F

[email protected]@BZEC3F)ZEC�JU_=GfPvS�GIAXC’�gZcJ�ZqH1GMZcJU_3G`k�A:C�?3kUF!GMCTJ1doAXkLC3ZEC3F#P!k1A#Q�A:N�l3krA:S�JUZqSMA:D

� P!n:G [ ?3DcJU?3krAXD’SMP:CTJUGMVWJoA:DEHUP%l=DEAgbvHxAYDqAXkLF:G�l=AXkUJoZcCjZcCTJLGMkLl3kUGMJLAXJUZEP:C�P:[email protected]{AXCODEA:C3F:?=A:F:G

w�C3P:�RG`C=HUZc�!G%HiJLA�[email protected]�ZECeHUP:@BGBS�?3DcJU?=kUGIH�S`AXC�QRGBZcC�Hi?3DcJUZEC3FjZcCePXJL_3GMkrHM��G`�:GMCedp_3GMC�JU_3G

[email protected]�DEA:C3F:?�AXF:G�ZqH#HUl�P!n:GMCjZEC�Q�P:JU_��Kdp_3G`C�[email protected]��:GINOa�[email protected] [ ?3Q=A�JLP{��?3G`kiJLP��pZESMP���QRPXJL_

� l�AXC3ZqHi_�HUlRG`AXnWZEC3F�S�P!?3CTJUkLZcGIHL�uA!H�A�Sr_3ZEDqN5�!w�a�P!?3C=N�JU_�A�JxHUP:@BG#SMP:@[email protected]:C{GM�!GMkLbvN3Agb [ ?3Q=A:C

dxP:krN3HxdfG`kUG�A�J#QRG`HiJ#[email protected]�lRP:DEZcJUG�ZECj�K?3GMkUJUP{�pZqSMA:Cj?�HUA:F:G

}#@%Q3ZEF:?3P!?=HM�JL_3G�HUA:@�G�l3_3krA:HUG�SMAXCjQRG�?=C=NvGMkrHiJUPWPvN�@Y?=DsJLZcl3DEG�doAgbWH w�C^JL_3ZEHpS`A:HUG:�WdxG

A:kUG�JrAXDEnTZEC3F^A:Q�P!?vJ#JL_3G%[email protected]�G�Vvl3kLG`HLHUZcP!CjZEC�JL_3G%[email protected]�S�P!CTJUG�VWJ [ P_C=HUZEN3GMkpJL_3G%HUZcF!C=H#P!C

JL_3G�H�JrAXZEkLHfZcCOJU_=ZEHoQ3?=ZcDqNvZEC3F=~w���AXC�Nj���)� 3� PWG`How���@BG`A:C^ZEC^JLP%JL_3G�Q3?3ZEDEN3ZcC3F��vP:koZEC^JLP

JL_3G�HiJLA:ZckLdfG`DcD�����kfSMP:C=HUZEN3GMkI+Zca1SMAXJ�a�PWPWN�ZEHKa�G`NBJLP%S`A�JLH`�WAXC=N{Q�AXQWbYa�PWPvNBJLPYQ=A:Q3ZcGIHM�Tdp_=AXJ

ZqH�yiSr_=GMG`HUG�a�PWPWNW���

}#C�ZcCTJUG`kUGIH�JLZcC=F�JU_3ZEC3FeA:Q�[email protected]{AXC,DqAXC=F:?=A:F:GBZEH�JU_=AXJ%G`�:GMC�ZEC=S�P!kUkLG`S�J�G�Vvl3kLG`HLHiZEP:C�H

S`AXkLkUb�@BGIAXC3ZEC3F=1a�P:kfG�[email protected]:�Tdp_3G`C�dfG�AXkLG#H�JrAXkUJUZEC3F�JUPYDcGIAXkLC�A�DEA:C3F:?�AXF:GpdxG)S`AXC{@{AXn!G

P!?3kLHUGMDE�:GIH�?3C=NvG`kLHiJUPWPvN�QTb%PXJL_3GMkrHuQ�GMa�P:kLGpdfGp_�Ag�:Gxa�?3DEDcb%@BA!H�JLGMkLG`N%JU_=GpF:[email protected]@{AXk yiw1_=Ag�!G

_W?3C3F!kUb!�)ZqHoC3PXJ#SMP:kLkUGIS J`�TQ3?vJ#dxG�dfP!?3DENO?3C=N3GMkrH�JrAXC=N^ZsJpJLP%@BG`A:C^JL_3G�HilRG`A:n:G`kxZqHx_W?3C3F:kLb

}#DED�JU_=G�A:Q�P��!G)@{AXn!G`[email protected]{AXC^DqAXC=F:?=A:F:G`HfNvkrA:HiJUZqSMAXDEDEb�?=C=Hi?=ZsJrAXQ3DEG�a�P:koS�P:@[email protected]%?3C3ZqSMA�JLZcC=F

dpZcJU_¡A�S�[email protected]�l=?vJUG`k �uGMP!l3DEG�_=Ag�:GjDcP!C3F�JUkLZEG`N�JUP,F:GMJ�A�[email protected]:C9DqAXC=F:?=A:F:G�ZcCTJLGMkUa¢A:S�GjJLP

SMP:CTJUkLP:D�S�[email protected]�l=?vJUG`kLH��¢SMP:@Bl3?vJLGMkrHKJL_=A�J#A!SMS�G`lvJo�:P:ZqS�G�S�[email protected]Dr. Ernesto Gomez : CSE 570/670

1. Miscellaneous updates

These will be incorporated into Lecture Notes 1-4, where these topics are covered

1.1. Top down parsing. I have just found a reference on modern top-down pars-

ing methods : https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler.

Have not had a chance to review in detail, but at first glance this looks good – it

is also very new, the site went up in December 2019. It links to a selection of the

papers where the techniques are developed, and it introduces a parser generator

equivalent to Yacc for topdown parsing. This material will probably be covered in

class next year, but you can start looking at it now – If your work in the future

involves compilers or any large appication that incorporates a language, you will

need to know about this.

1.2. Non-determinism and Finite Automata (this text has been incorpo-

rated in Lecture Notes 3. In Lecture Notes 3, we suggested an approach to

constructing a deterministic finite automaton from a non-deterministic one using

depth-first search and backtracking, with a stack. The text book constructs (chap-

ter 3) a detailed algorithm, first for converting a non-deterministic finite automaton

(NFA) to a deterministic (DFA)y, and then another algorithm to minimize the re-

sulting DFA to the smallest possible DFA that accepts the same language. We then

go full circle by converting the DFA to a regular expression – this whole sequence

serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that

NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same

set of languages.

Further, the ability to minimize a DFA gives us the ability to determine if any

pair of language definitions (RE, DFA or NFA) accept the same language – if two

RE, or two FA minimize to the same DFA, then they accept the same language –

that is, they mean the same thing.

All of this is of great theoretical interest but very little practical interest. The

problem is, the algorithm to convert NFA to DFA has exponential complexity.

Therefore the conversion can only be done for small languages and automata – the

amount of work required for a more complex automaton or expression grows to fast.

(The same applies to the backtracking algorithm we suggested in Notes 3 – it is a

backtracking algorithm, and such algorithms also have exponential complexity).

1.3. Lex and Yacc. The front end to the compiler uses a https://www.sa nity.io/blog/why-

we-wrote-yet-another-parser-compilerlexical analyzer, built on top of finite automata,

defined by regular expressions. An example of this is described in 5.7, Lecture Notes

3. You can find a RE that recognizes the format of integers, so it can read text and

pick out integers, get their value, and report this to the parse program. We do this

because that means we can define something like: “addition → number+number”

and we don’t have to define the details of the number format when we define what

addition looks lDr. Ernesto Gomez : CSE 570/670

1. A note on parsing strategy

Suppose we have a set of TERMINALS T= { terminals – things that compose

text in the language }. T stands for “Terminals”, they are a more general form of

the definition of alphabet Σ. The alphabet is a finite list of symbols. The set of

terminals is also a finite list, but the terminals themselves don’t have to be finite.

Imagine we want to describe the syntax for adding a list of numbers. It might look

something like:

ADD_LIST = { RESULT = SUMS, where SUMS = NUMBER or SUMS =

SUMS + NUMBER }

(In Chapter 4 we will call this kind of thing a context-free grammar).

T= {NUMBER, =, +}. Why not SUMS and RESULT? We can see that these

two words are placeholders (we will be calling them NON-TERMINALS later),

when we actually express a sum, it will look like: NUMBER = NUMBER + …

+ NUMBER. SUM and RESULT never appear in the actual sum, all we have are

numbers, and the symbols “+” and “=”.

So : if we want to express a sum, we could start from RESULT = SUMS, then

expand SUMS into SUMS+NUMBER. We can keep on doing this to SUMS as long

as we want, then when we want to stop we use the rule SUMS = NUMBER. The

we can add everything, and replace RESULT with whatever we have added ad we

end up with NUMBER = list of numbers separated by + signs. (do we need a rule

RESULT = NUMBER?). Anyway, once we have replaced everything with numbers

and symbols + and =, we stop – there are no rules for changing a number or {+,=}

into anything else. That is why we call them “terminals” – they end (terminate) the

sequence of converting our syntactic definition to a particular sum.

(Whar, then are the terms RESULT and SUMS? They don’t appear in the final

expression, we have rules that allow us to change them into something else – they are

not “terminals” so we call them “non-terminals” (sometimes we can be very literal

in our naming conventions!). Now, what are we to make of the term NUMBER?

The word NUMBER appears in our description for the sum of a list of numbers

– but when we actually write such a sum down, we will replace each instance of

NUMBER with actual numbers – NUMBER=NUMBER+NUMBER would actually

be something like 42=29+13. So NUMBER is a terminal, but it is not a fixed

symbol – rather it describes a pattern that allows us to generate the actual text.

For example, NUMBER=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, the regular expression wi used

before to describe what an integer looks like. When we said in (lecture notes 2)

that we were going to use multiple machine types in translation, this is the kind of

thing we meant. )

We will use finite automata to describe simple patterns that we will then use to

simplify higher-level definitions, this is in Chapter 3 of our text.

1.1. Derivation. What we have done: ADD_LIST is a set of rules that describe

what adding a list of numbers look like – it is a formal language definition, that

describes which strings in (numbers, + and = signs)∗ are actually in the langugage

ADDDr. Ernesto Gomez : CSE 570/670

We now consider parsing algorithms. This material is in chapter 4

1. Support algorithms : First and Follow

Recall that parsing is the application of grammar definitions in reverse. Take

our example grammar for arithmetic expressions:

S → E

E → E + T| E → T

T → T ∗F | T → F

F →id | F ⇁num | F → (E)

with N = {S,E,T,F}, T = {+,∗,(,), id, num}

(Same grammar, with added terminal “num” and using “or” for a more compact

representation).

In order to pars, we need to have some idea of what different grammar rules can

generate, so we can make decisions on what text is a candidate for replacement by

the left-hand side of a rule. (In the terminology we defined in the previous lecture

notes, we need to identify the “handle” – which is in (N ∪T)∗

FIRST sets tell us, given any symbol in (N ∪T) what is the first character that

can be produced by that symbol, FOLLOW tells us what can appear immediately

after each symbol in N.

The following material gives a slight variation on the algorithms presented in

4.4.2,

2. The following material gives a slight variation on the algorithms

presented in 4.4.2

3. First Sets:

To generate: FIRST(X) for all X ∈ N ∪T.

(1) If X ∈ T , FIRST(X) = {X}.

(2) If X → ε is a production, add ε to FIRST(X).

(3) If X → Y1Y2 . . .Yi . . .Ykis a production: Base case i = 1. Induction:

add everything in FIRST(Yi) except for ε to FIRST(X), then if ε is in

FIRST(Yi) increment i and repeat; else stop. If i > k add ε to FIRST(X)

and stop. Repeat for all productions.

(4) Repeat step 3 until there is no change to any of the FIRST sets.

4. Follow Sets:

To generate: FOLLOW(X) for all X ∈ N.

Add end marker $ /∈ N ∪T to symbol set.

(1) Place $in FOLLOW(S), where S is the start symbol in G.

(2) If A → αBβ is a production, add everything in FIRST(β) except ε to

FOLLOW(B). Repeat for every production and every variable that is not

at the end of the production

(3) If A → αBβ and ε is in FIRST(β), or A → αB is a production, add

everything in FOLLOW(A) to FOLLOW(B).

(4) Repeat step 3 for every non-terminal in every production until nothing new

is added to any of the FOLLOW sets.

1

2

5. LR parsing automata

We here discuss material from sections 4.5 to 4.7 in Aho and Ullman, on bottom-

up parsers using LR methods.

We are not going into recursive descent topdown methods (section 4.4) because

they only work for a strict subset of the languages that can be parsed by the LR(k)

bottom-up methods. The top-down method described in

https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler

is more general than LR(k), it is claimed to be able to parse all the context-

free languages, whereas LR(k) can parse a subset of the deterministic context free

grammars. As developed so far, the new top-down methods and tools work sub-

stantially more slowly than LR(k) – both in complexity which is worst-case cubic,

and in speed on equivalent languages. At present, the bottom-up methods that we

will study are more praDr. Ernesto Gomez : CSE 570/670

We revisit some concepts from lecture 1, then start on formal language defini-

tions.

1. Thoughts about human and computer languages

Human languages tend to be contextual and ambiguous.

Contextual: the meaning of anything we say must be interpreted taking into

account the context in which it is said. For example: if I say “Look out! A shark!”

in class, context tells you this is an example of something. If I yell the same thing

while we are swimming in the ocean, it is either an urgent warning or a bad practical

joke. Cultural context also plays a large part in interpretation of human language.

Inoffensive statements in some cultures can be insulting in others, even when the

same language is spoken in both – when I moved from Cuba to Puerto Rico (both

Spanish speaking countries) as a child, I found that some common everyday Cuban

words were at best impolite in Puerto Rican usage.

Ambiguous: the same phrase can be understood multiple ways. In this case, we

are talking about the same expression in the same context. Consider the signs on

the stairs in this building: IN and OUT. Does IN mean in to the building, or in to

the stairwell? Or consider: if cat food is fed to cats, and baby food to babies, what

is “cheese food”?

An interesting thing about human language is that even incorrect expressions

carry meaning: for example, when we are starting to learn a language we can make

ourselves understood by others before we have fully mastered the grammar. “I have

hungry” is not correct, but we would understand it to mean the speaker is hungry.

All the above makes human languages drastically unsuitable for communicating

with a computer. People have long tried to get a human language interface to

control computers (computers that accept voice commands are the latest variation

on this). What we really want, however, is a computer that does what we mean,

and in human language we can’t always tell what we mean by what we actually

say.

Our best solution to this problem to date is to design artificial languages for

computers, with properties different from human languages. We want the following

properties:

• Freedom from ambiguity: A statement in a computer language should have

only one meaning.

• No dependence on external context: The meaning of a statement should be

understandable given only the statement itself (or at least the program of

which it is a part), and the meaning should not change with circumstances.

To these we add two practical consideration:

• It should be possible to determine the meaning of a statement without doing

too much work or spending too much time. That is, the time complexity

of translation should be small.

• The syntactic correctness of a statement should be unambiguous. That is,

there should be no doubt, independent of meaning, whether a statement is

correct (in the language) or not.

As computers become more powerful, what was considered “too much work” in the

past may become practical, bDr. Ernesto Gomez : CSE 570/670

We will be covering material from :

Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; “Com-

pilers – Principles, Techniques and Tools”,Second Edition, Addison Wes-

ley

(This is the classic text on Compilers – it has been around in various versions

and updates since the 1970s and has been revised and updated, most recently in

the second edition in 2007. Unlike many other texts, it gets better in every edition.

It is often called “the Dragon book”, in part because it has a dragon on the

cover, but mainly because it is not a friendly book – you will find that there are

places where you have to study one page for an hour to understand a theorem or

algorithm. It feels like the dragon is eating your brain – it is worth it, however –

or it will feel that it is worth it after your brain has been rewired. You will love

the book, appreciate and understand it the third time you read it. And you will

come back to it – it is the compiler text that has shaped everybody in the field,

the standard reference for compilers. It belongs in the professional library of every

computer scientist and engineer.)

A tentative schedule:

Chapter 1 is an introduction, chapter 2 gives an outline of how compilers work,

usint top-down parsing. We will be concentrating on bottom up parsing in the

class, this starts with chapter 3. We will study most of chapters 3,4,5 ad then skip

around the rest of the book – details wil appear in lecture notes.

We will not cover everything in the book,re and we will explore some topics that

are not in the book – these will appear in the notes. All material covered in class

may appear in examinations.

1. motivation

Why study compilers? Few of us will ever be called upon to write or modify a

compiler. Here are some justifications for all the work involved.

<1> Pragmatism: Compilers are tools that we use to build all our applications. We

need to understand the properties and limitations of our tools.

<2> Art: To the right(?) kind of mind, compilers are beautiful!

<3> Science: We need to be able to express things in a language understandable to

us, then translate it to a language understandable to machines. The problem: how

do we make sure that the meaning is translated?

• Syntax: the textual rules for expressing things in a language.

• Semantics: the meaning conveyed by the expression.

• The syntax is fully visible, but the meaning can be context dependent, may

require additional knowledge to interpret.

• Syntax + (context) + (world knowledge) => Semantics.

• Ambiguity: same syntax can mean different things. English is ambiguous.

• “Time flies like an arrow.” – What does it mean?Consider: Is the subject

“time” or “flies” (or “time flies”, a kind of fly)? Is the verb “time”, “flies”,

or “like”?

• Early computer translation example:”Out of sight, out of mind.” => to

Russian; back from Russian => “Invisible insanity.”

1

2

Two tracks: “the Dragon book”: Alfred V. Aho, Monica S. Lam, Ravi SeDr. Ernesto Gomez : CSE 570/670

1. Syntax-directed translation

Human languages communicate information through a combination of syntax,

meaning and context. Arguably meaning and context are more important, we can

communicate well with bad grammar. One of the consequences is that human

languages convey a much richer, but less precise meaning. Computer languages are

much more limited in what they communicate, all they do is specify what actions a

machine will perform to carry out some algorithm. We design computer languages

to be unambiguous – any string in the language should have a unique meaning.

We use context free grammars because this makes meaning local to the text we

are translating – an expression in the language should be translatable only using

information that appears next to it in the text. Technically, a context-free grammar

lets us guarantee that we can always decide if a text string is in the language or

not, and we can do this efficiently (The LR grammars and shift-reduce parsers we

have defined earlier can parse in linear time, other algorithms may take somewhat

longer (and allow dealing with extensions to context-free grammars) but we can

still guarantee worst-case polynomial time, usually better than n3).

2. Simple translation withoug optimization.

Consider our expression grammar from chapter 4, which might be appropriate

for a calculator – we here extend it to an assign statement. We add A (assign

statement) to the non-terminals, and the symbols v,= to the terminals. In this

context, istands for either a vatiable or a numeric value. We have annotated each

line with an expression in curly brackets that says what it does – the arithmetic

operators have their usual meaning, and we are adding two functtions – the comment

field here describes what the functions do. We are using a pseudo-code to describe

what we do to translate the expression, but the actual code, like what we would

write in the curly brackets in Yacc, depends on what we we are translating our

statments to – another language? binary code? functions of an interpreter?

2.1. Assign statement.

(1) A → v = E { store( E.value, v ) // if v is a variable, store a value to its

address in memory – else error }

(2) E1 → E + T { E1.value = E.value + T.value }

(3) E → T { E.value = T.value }

(4) T1 → T ∗F { T1.value = T.value + F.value }

(5) T → F{ T.value = F.value }

(6) F → i { F.value = value_of( i ) // if i is a string, get numeric value – else

if i is a variable, get its value – else error}

(7) F → (E) { F.value = E.value }

Note that we have added something that is not a synctactic property, in productions

1 and 6, the term “variable”. We can define the syntax for what a variable should

look like ( a string tht begins with a letter, followed by more letters, digits, special

characters like $ and _ which ends just before a space, line end, punctuation or

operator which are not allowed in the name ). In modern languages, however, such

a name is only a variable if it ha���������

���

����������� �

������������

�! #”%$’&)(+*,(+*+-/.�01&)23$54563798:(1*+;+$'(5;16=<
>#[email protected]:C�DEA:C3F:?=A:F:GIHKJLGMC=NOJLP�QRG�S�P!CTJUG�VWJL?=AXD’A:C=NOA:@YQ3ZEF:?=P:?=H

[ P:CTJLG�VWJU?=A:D]^JU_3G�@BG`A:C3ZEC3F�P:a�AXCWbTJU_=ZcC3FedfG�[email protected]%?=HiJBQRGjZcCTJUG`kUl=kUGMJUG`NmJLA:nTZEC3FhZcCTJLP

A!SMS�P!?3CTJfJL_3G�S�P!CTJUG�VWJoZcC�dp_3ZqSr_jZsJ#ZqHpHUA:ZEN ut P:kpGMVvA:@Bl3DcG!uZsauwxHLAgbYyUz’PWP:n{P!?vJ`|~}HU_=AXkLn�| €

ZEC�S�DqA:HLHM3SMP:CTJUGMVTJoJLGMDEDEHpb!P:?OJU_3ZqHpZqH#AXCjG�V3A:@�l=DcG‚P:aƒHUP:@BG�JL_3ZcC=F w„a~wfb!GMDED�JU_3G�[email protected]�JU_=ZcC3F

[email protected]@BZEC3F)ZEC‚JU_=GfPvS�GIAXC’gZcJƒZqH1GMZcJU_3G`kƒA:C‚?3kUF!GMCTJ1doAXkLC3ZEC3F#P!k1A#Q�A:N‚l3krA:S�JUZqSMA:D

… P!n:G [ ?3DcJU?3krAXD’SMP:CTJUGMVWJoA:DEHUP%l=DEAgbvHxAYDqAXkLF:G�l=AXkUJoZcCjZcCTJLGMkLl3kUGMJLAXJUZEP:C�P:[email protected]{AXCODEA:C3F:?=A:F:G

w†C3P:‡RG`C=HUZcˆ!G%HiJLA‰[email protected]ŠZECeHUP:@BGBS�?3DcJU?=kUGIHŠS`AXC�QRGBZcC�Hi?3DcJUZEC3FjZcCePXJL_3GMkrHM�G`ˆ:GMCedp_3GMC�JU_3G

[email protected]�DEA:C3F:?�AXF:G�ZqH#HUl�P!n:GMCjZEC�Q�P:JU_�‹Kdp_3G`C�[email protected]‰ˆ:GINOaŒ[email protected] [ ?3Q=A�JLP{Ž?3G`kiJLP�pZESMP‘ŒQRPXJL_

’ l�AXC3ZqHi_�HUlRG`AXnWZEC3F�S�P!?3CTJUkLZcGIHL“uA!HŽA�Sr_3ZEDqN5!wƒaŒP!?3C=N�JU_�A‰JxHUP:@BG#SMP:@[email protected]:C{GMˆ!GMkLbvN3Agb [ ?3Q=A:C

dxP:krN3HxdfG`kUG�A‰J#QRG`HiJ#[email protected]�lRP:DEZcJUG‚ZECjK?3GMkUJUP{pZqSMA:Cj?�HUA:F:G

}#@%Q3ZEF:?3P!?=HMƒJL_3G‚HUA:@�G�l3_3krA:HUG�SMAXCjQRG�?=C=NvGMkrHiJUPWPvN�@Y?=DsJLZcl3DEG‚doAgbWH w†C^JL_3ZEHpS`A:HUG:WdxG

A:kUG�JrAXDEnTZEC3F^A:Q�P!?vJ#JL_3G%[email protected]‚G�Vvl3kLG`HLHUZcP!CjZEC�JL_3G%[email protected]�S�P!CTJUG�VWJ [ P_C=HUZEN3GMkpJL_3G%HUZcF!C=H#P!C

JL_3G‚H”JrAXZEkLHfZcCOJU_=ZEHoQ3?=ZcDqNvZEC3F=~w”•–AXC�Nj—�˜)™ 3š PWG`How”•›@BG`A:C^ZEC^JLP%JL_3G�Q3?3ZEDEN3ZcC3F�vP:koZEC^JLP

JL_3GŠHiJLA:ZckLdfG`DcDœ�—ŠkfSMP:C=HUZEN3GMkI+Zca1SMAXJŽaŒPWPWN�ZEHKaŒG`NBJLP%S`A‰JLH`WAXC=N{Q�AXQWbYaŒPWPvNBJLPYQ=A:Q3ZcGIHMTdp_=AXJ

ZqHŽyiSr_=GMG`HUG�aŒPWPWNW€”œ

}#CžZcCTJUG`kUGIH”JLZcC=F�JU_3ZEC3FeA:Q�[email protected]{AXC,DqAXC=F:?=A:F:GBZEH�JU_=AXJ%G`ˆ:GMCŸZEC=S�P!kUkLG`S�J�G�Vvl3kLG`HLHiZEP:C�H

S`AXkLkUb�@BGIAXC3ZEC3F=1aŒP:kfG�[email protected]:Tdp_3G`C�dfGŠAXkLG#H”JrAXkUJUZEC3F‚JUPYDcGIAXkLC�A‚DEA:C3F:?�AXF:GpdxG)S`AXC{@{AXn!G

P!?3kLHUGMDEˆ:GIHƒ?3C=NvG`kLHiJUPWPvN�QTb%PXJL_3GMkrHuQ�GMaŒP:kLGpdfGp_�Agˆ:GxaŒ?3DEDcb%@BA!H”JLGMkLG`N%JU_=GpF:[email protected]@{AXk yiw1_=Agˆ!G

_W?3C3F!kUb!€)ZqHoC3PXJ#SMP:kLkUGIS J`TQ3?vJ#dxG�dfP!?3DENO?3C=N3GMkrH”JrAXC=N^ZsJpJLP%@BG`A:C^JL_3G‚HilRG`A:n:G`kxZqHx_W?3C3F:kLb

}#DED�JU_=G�A:Q�P‰ˆ!G)@{AXn!G`[email protected]{AXC^DqAXC=F:?=A:F:G`HfNvkrA:HiJUZqSMAXDEDEb�?=C=Hi?=ZsJrAXQ3DEG�aŒP:koS�P:@[email protected]%?3C3ZqSMA‰JLZcC=F

dpZcJU_¡AŸS�[email protected]�l=?vJUG`k uGMP!l3DEG�_=Agˆ:GjDcP!C3FŸJUkLZEG`NžJUP,F:GMJ�AŸ[email protected]:C9DqAXC=F:?=A:F:G�ZcCTJLGMkUa¢A:S�GjJLP

SMP:CTJUkLP:D�S�[email protected]�l=?vJUG`kLH�¢SMP:@Bl3?vJLGMkrHKJL_=A‰J#A!SMS�G`lvJoˆ:P:ZqS�G�S�[email protected]Dr. Ernesto Gomez : CSE 570/670

1. Miscellaneous updates

These will be incorporated into Lecture Notes 1-4, where these topics are covered

1.1. Top down parsing. I have just found a reference on modern top-down pars-

ing methods : https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler.

Have not had a chance to review in detail, but at first glance this looks good – it

is also very new, the site went up in December 2019. It links to a selection of the

papers where the techniques are developed, and it introduces a parser generator

equivalent to Yacc for topdown parsing. This material will probably be covered in

class next year, but you can start looking at it now – If your work in the future

involves compilers or any large appication that incorporates a language, you will

need to know about this.

1.2. Non-determinism and Finite Automata (this text has been incorpo-

rated in Lecture Notes 3. In Lecture Notes 3, we suggested an approach to

constructing a deterministic finite automaton from a non-deterministic one using

depth-first search and backtracking, with a stack. The text book constructs (chap-

ter 3) a detailed algorithm, first for converting a non-deterministic finite automaton

(NFA) to a deterministic (DFA)y, and then another algorithm to minimize the re-

sulting DFA to the smallest possible DFA that accepts the same language. We then

go full circle by converting the DFA to a regular expression – this whole sequence

serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that

NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same

set of languages.

Further, the ability to minimize a DFA gives us the ability to determine if any

pair of language definitions (RE, DFA or NFA) accept the same language – if two

RE, or two FA minimize to the same DFA, then they accept the same language –

that is, they mean the same thing.

All of this is of great theoretical interest but very little practical interest. The

problem is, the algorithm to convert NFA to DFA has exponential complexity.

Therefore the conversion can only be done for small languages and automata – the

amount of work required for a more complex automaton or expression grows to fast.

(The same applies to the backtracking algorithm we suggested in Notes 3 – it is a

backtracking algorithm, and such algorithms also have exponential complexity).

1.3. Lex and Yacc. The front end to the compiler uses a https://www.sa nity.io/blog/why-

we-wrote-yet-another-parser-compilerlexical analyzer, built on top of finite automata,

defined by regular expressions. An example of this is described in 5.7, Lecture Notes

3. You can find a RE that recognizes the format of integers, so it can read text and

pick out integers, get their value, and report this to the parse program. We do this

because that means we can define something like: “addition → number+number”

and we don’t have to define the details of the number format when we define what

addition looks lDr. Ernesto Gomez : CSE 570/670

The structure of a computer language translator

1. Overview and practical issues

Recalling Lecture 1 – we are going to define a computer language using a formal

description because we want to be able to decide in an unambiguous way if inout

text is in the language or not. (This is a big part of the reason for the invention

of formal systems – mathematicians got into arguments about whether something

was a proof of a theorem or not. Formal systems came into being as definitions

of the rules of resoning that are acceptable for proving stuff, so all could agree

that if the proof folllowed the rules, it was a correct proof). When we do this,

we split the translation problem into two hopefully simpler chunks: First, decide

if the text follows the syntax rules. If it doesn’t, we reject it as not being in the

language, rather than trying to guess what the programmer really wanted to say.

Once we know some string is in the language, then we decide what it means (the

semantics). Language theorists have several ways of defining meaning, but from

the point of view of compilers, we need to translate the input text into actions by

the computer. We therefore use operational semantics – the text means what the

computer is required to do.

How should we implement these concepts? We have seen that we have a range

of options, from a traditional interpreter to a compiler that outputs machine code,

with a whole range of possibilities in betwen. There are advantages and disadvan-

tages to every design choice, and as we have seen, there are real world versions of

every one of them.

Consider first the traditional compiler: we want it to translate from source text

to machine language. We could just build a large, monolithic program that incor-

porates both the input in a computer language and the machine executable output.

The problem here is, bot the input language and the machine environment are mov-

ing targets. Languages are revised periodically, sometimes drastically. For example,

there have been five official revisions of the C++ standard, in 1998, 2003, 2011,

2023 and 2017, and a new revision is due this year. But this is not the whole story

– C++ started in 1985, and a lot changed before the International Organization for

Standardization (ISO) defined the 1998 standard (for example, templates were not

in the original language).

The runtime environment also changes. Early compilers (COBOL, FORTRAN,

others) could build on the assumption that the generated code would have the ma-

chine to itself, but modern compilers need to target both the machine architecture

and the operating system that runs on it. Both of these change quickly with time,

and are not uniqe – at any given time we may need to generate code for multiple

versions of multiple architectures (different versions of Intel, AMD and ARM pro-

cessors just to cover the most basic types), and operating systems (multiple versions

of Windows, Linus, Unix, IOS, and others). WeDr. Ernesto Gomez : CSE 570/670

1. A note on parsing strategy

Suppose we have a set of TERMINALS T= { terminals – things that compose

text in the language }. T stands for “Terminals”, they are a more general form of

the definition of alphabet Σ. The alphabet is a finite list of symbols. The set of

terminals is also a finite list, but the terminals themselves don’t have to be finite.

Imagine we want to describe the syntax for adding a list of numbers. It might look

something like:

ADD_LIST = { RESULT = SUMS, where SUMS = NUMBER or SUMS =

SUMS + NUMBER }

(In Chapter 4 we will call this kind of thing a context-free grammar).

T= {NUMBER, =, +}. Why not SUMS and RESULT? We can see that these

two words are placeholders (we will be calling them NON-TERMINALS later),

when we actually express a sum, it will look like: NUMBER = NUMBER + …

+ NUMBER. SUM and RESULT never appear in the actual sum, all we have are

numbers, and the symbols “+” and “=”.

So : if we want to express a sum, we could start from RESULT = SUMS, then

expand SUMS into SUMS+NUMBER. We can keep on doing this to SUMS as long

as we want, then when we want to stop we use the rule SUMS = NUMBER. The

we can add everything, and replace RESULT with whatever we have added ad we

end up with NUMBER = list of numbers separated by + signs. (do we need a rule

RESULT = NUMBER?). Anyway, once we have replaced everything with numbers

and symbols + and =, we stop – there are no rules for changing a number or {+,=}

into anything else. That is why we call them “terminals” – they end (terminate) the

sequence of converting our syntactic definition to a particular sum.

(Whar, then are the terms RESULT and SUMS? They don’t appear in the final

expression, we have rules that allow us to change them into something else – they are

not “terminals” so we call them “non-terminals” (sometimes we can be very literal

in our naming conventions!). Now, what are we to make of the term NUMBER?

The word NUMBER appears in our description for the sum of a list of numbers

– but when we actually write such a sum down, we will replace each instance of

NUMBER with actual numbers – NUMBER=NUMBER+NUMBER would actually

be something like 42=29+13. So NUMBER is a terminal, but it is not a fixed

symbol – rather it describes a pattern that allows us to generate the actual text.

For example, NUMBER=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, the regular expression wi used

before to describe what an integer looks like. When we said in (lecture notes 2)

that we were going to use multiple machine types in translation, this is the kind of

thing we meant. )

We will use finite automata to describe simple patterns that we will then use to

simplify higher-level definitions, this is in Chapter 3 of our text.

1.1. Derivation. What we have done: ADD_LIST is a set of rules that describe

what adding a list of numbers look like – it is a formal language definition, that

describes which strings in (numbers, + and = signs)∗ are actually in the langugage

ADDDr. Ernesto Gomez : CSE 570/670

We now consider parsing algorithms. This material is in chapter 4

1. Support algorithms : First and Follow

Recall that parsing is the application of grammar definitions in reverse. Take

our example grammar for arithmetic expressions:

S → E

E → E + T| E → T

T → T ∗F | T → F

F →id | F ⇁num | F → (E)

with N = {S,E,T,F}, T = {+,∗,(,), id, num}

(Same grammar, with added terminal “num” and using “or” for a more compact

representation).

In order to pars, we need to have some idea of what different grammar rules can

generate, so we can make decisions on what text is a candidate for replacement by

the left-hand side of a rule. (In the terminology we defined in the previous lecture

notes, we need to identify the “handle” – which is in (N ∪T)∗

FIRST sets tell us, given any symbol in (N ∪T) what is the first character that

can be produced by that symbol, FOLLOW tells us what can appear immediately

after each symbol in N.

The following material gives a slight variation on the algorithms presented in

4.4.2,

2. The following material gives a slight variation on the algorithms

presented in 4.4.2

3. First Sets:

To generate: FIRST(X) for all X ∈ N ∪T.

(1) If X ∈ T , FIRST(X) = {X}.

(2) If X → ε is a production, add ε to FIRST(X).

(3) If X → Y1Y2 . . .Yi . . .Ykis a production: Base case i = 1. Induction:

add everything in FIRST(Yi) except for ε to FIRST(X), then if ε is in

FIRST(Yi) increment i and repeat; else stop. If i > k add ε to FIRST(X)

and stop. Repeat for all productions.

(4) Repeat step 3 until there is no change to any of the FIRST sets.

4. Follow Sets:

To generate: FOLLOW(X) for all X ∈ N.

Add end marker $ /∈ N ∪T to symbol set.

(1) Place $in FOLLOW(S), where S is the start symbol in G.

(2) If A → αBβ is a production, add everything in FIRST(β) except ε to

FOLLOW(B). Repeat for every production and every variable that is not

at the end of the production

(3) If A → αBβ and ε is in FIRST(β), or A → αB is a production, add

everything in FOLLOW(A) to FOLLOW(B).

(4) Repeat step 3 for every non-terminal in every production until nothing new

is added to any of the FOLLOW sets.

1

2

5. LR parsing automata

We here discuss material from sections 4.5 to 4.7 in Aho and Ullman, on bottom-

up parsers using LR methods.

We are not going into recursive descent topdown methods (section 4.4) because

they only work for a strict subset of the languages that can be parsed by the LR(k)

bottom-up methods. The top-down method described in

https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler

is more general than LR(k), it is claimed to be able to parse all the context-

free languages, whereas LR(k) can parse a subset of the deterministic context free

grammars. As developed so far, the new top-down methods and tools work sub-

stantially more slowly than LR(k) – both in complexity which is worst-case cubic,

and in speed on equivalent languages. At present, the bottom-up methods that we

will study are more praDr. Ernesto Gomez : CSE 570/670

We now consider parsing algorithms. This material is in chapter 4

1. An LR parser

Having developed algorithms for FIRST and FOLLOW sets, we have seen how

construct LR0 sets with CLOSURE and GOTO functions, such that our LR0 sets

are states, and the GOTO functions are transitions between states which occur

when we “read” a specific symbol, as in our finite auotmata. We extend the meaning

of “read” to mean “PUSH a symbol to stack top”, this happens when we move the

first symbol in the (unprocessed) input text and push it onto the stack, or when

we pop symbols from the stack corresponding to a handle (the right-hand side of a

production) and PUSH the left-hand symbol on the production on the stack. The

first case (a standard read action) gets a terminal symbol on the stack, the second

case gets a non-terminal symbol on the stack. Terminal characters are handled just

like we would in a finite automaton, non-terminals are pushed on the stack when

we reach a state whicht has an item of the form A → α. ; an item of this form

means we have finished processing a handle that corresponds to α and it is on stack

top, and that we use the production A → α to POP α and PUSH A.

We will continue to use our expression grammar for examples,

S → E

E → E + T| E → T

T → T ∗F | T → F

F →id | F ⇁num | F → (E)

with N = {S,E,T,F}, T = {+,∗,(,), id, num}

We will also make reference to figure 4.31, page 244 of the text, which gives an

LR(0) automaton for this grammar, built using the algorithms in the previous set

of notes and in section 4.6.2 in Aho and Ullman.

1.1. Parsing with a shift-reduce automaton. Our automaton is table driven,

much like the Deterministic Finite Automata you have worked on previously in

Lab 1. The difference is that our control table is divided into 2 sections, called the

Action table and the Goto table (these names are somewhat confusing, for example

the Goto table is not identical to the GOTO function – also note that since the

rows are the same for both tables, the are usually written as two sections of the

same control tables).

The states of the parser are given by the LR sets (in our example, these LR(0)

sets, but the same method works for LR(k), the only difference being the parse

tables.

Rows of of our control table are numbered with the state numbers we assigned

to the LR sets when we constructed them. The numbering depends on the order in

which we generate the sets, and makes no difference to the parse function, the only

fixed thing is that the start .state – state 0, in row 0 – is generated from a single

item S → .α corresponding to the start production.

Columns of the Action section of the control table are labelled with all the

characters ∈ T , and there is an added column for the symbol $ which denotes end

of input text (we have seen this convention when we generated the FOLLOW sets).

Columns of the Goto section are labelled with all the symbols ∈ N. Notice that

every combination (state, X ∈ N∪T |Dr. Ernesto Gomez : CSE 570/670

Look at section 3.5 in the text, which goes into how LEX works and how to use it.

There is a useful summary of the material in Chapter 3, starting on page 189.

1. Parsing

Read section 4.1, which is a general introduction to parsing. In particular, 4.1.2

lists three equivalent grammar specifications for the same arithmetic expressions –

we will be using these, and variants of them in most of the examples. 4.1.3 and

4.1.4 are useful background, but: With modern compilers and computer speeds, the

compilation cycle is much faster than it used to be – trying to find and report all the

errors in a program can be necessary if the edit-compile program cycle takes hours

(typically it took a day or more before the personal computer era), but when you

can compile in minutes or less on an interactive system, it is legitimate to report

the first one or two errors and halt the compilation. In practice, if we see a list of

50 errors (as an example), most of us will correct the first two and then try again,

based on the experience that many of the later errors are triggered by earlier errors,

many errors will disappear when we fix the early ones.

1.1. Context-Free languages and grammars (CFG).

1.1.1. Parsing and Derivation. We have defined a language L as a subset, selected

from a set of strings that are combinations of specific building blocks. For regular

languages, the building blocks were a set of symbols we called an alphabet Σ, we

later allowed some of our building blocks to be more complicated things, that could

be defined by regular expressions (keywords like if and while, things like number,

built up of a restricted set of alphabetic characters and having a format that can

be defined by a regular expression (RE)). We have called this the set of terminals

T. Therefore L ⊆ T∗ (or Σ∗) and we have a set of rules defined in some way that

say what combinations of symbols are in the set L and what combinations are not.

We divide languages into classes, depending on how complicated the rules can

get, and what theoretical machine can implement the rules. Our simplest class, the

regular languages, are defined by a regular expression and an alphabet Σ. There

are is an allowed start state, denoting the symbol(s) that can appear at the start

of a string, and what can be done next after seeing a particular character. The

rules can be used to generate a string by successive application of rules, this is

called a derivation. Or, given a string, the rules can be used to decide if the

string is in the language or not, called a parse. We have seen that a particular

machine, the Finite State Automaton (FA), which can be deterministic (DFA) or

non-determinstin (NFA), is equivalent to a regular expression , and in fact that

NFA is equivalent to DFA. This means – any derivation that can be done using the

regular expression can also be done by a DFA or an NFA, and that any language that

can be implemented on an FA of any kind can be define1. Notes on Code generation and analysis: (see CH 5,8,9 Aho +

Ullman)

What to generate depends on source language, architecture of target machine or

runtime sys.

Common intermediate codes: 3 address, high-level ( Lisp/Scheme, C, Modula 2)

parse tree/CFG. I-code is still research topic.

Three-address code.

Statements of form x := y op z

types:

assignment

binary op x := y op z => OP x y z

unary op x := op y => OP x y

copy – x := y => MOV x y

indexed assignment

x[i] := y => MOVI x y offset(i)

x := y[i] (i is offset on address of x | y)

pointer/address assignment

*x := y => MOV contents(address(x)) y

x := *y => MOV x content(address(y))

x := &y => MOV x addrsss(y)

jumps

unconditional jump: GOTO label(x) => JMP x

conditional jump: IF x op y GOTO z => COND x y z

calls

param x => PARM x => PAR x

..

call label(x) number-parm(y) => GOSUB x y

..

return value(x) => RET x

Loops := conditional backward jumps

// do {} while

label z

{ loop body }

if(x op y) jmp z

or

// while{}

label w

if NOT (x op y) jmp z

{ loop body }

jmp w

label z

1

2

Logic (if-then-else) := conditional/unconditional forward jumps

if( NOT (x op y ) ) jmp z

{ true block }

jmp endif

label z

{ false block }

label endif

Note: jumps are not L-attributed (inherited values) or synthezied values:

(L-attribute means on the parse tree to the left of where we need the value)

Backward jumps – label is on LH side of tree, but arbitrarily far

Forward jumps – label is on RH side of tree.

Therefore handling labels cannot be done in grammar, requires symbol table.

Implies => we can’t (easily) check correctness of IF-THEN-ELSE structure built

with jumps (so we dislike GOTOs!!). When we build structure using

jumps from IF-THEN-ELSE, we know it is correct.

Backpatching:

To handle things like addresses of forward jumps, we place marker when

we encounter the jump, fill it in on second pass after we find the labels

(error message if we don’t find the label).

Analysis of larger code blocks: Expression has value as synthesized attribute

– what is value of an assignment statement? (C++ says it is a boolean, value T-F)

Values/types transferred from one statement to another via memory/symbol

table.

Problem: when can values change?

Example : ++x, x++operator in C/C++ causes change during execution of a

statement, gets in the way of DAG representation/optimization

We dislike side effects because they get in way of optimizations/ interpretation of

semantics: .

x = 5; y = 10;

z = y*(x + y++); // what is value of z?

// otherwise

z = (++y)*(x + y);

.

Change in the value of y is a side effect of the z assignment statements.

Questions: – When do we know that contents of a given memory location/register

is unchanged? (so we can use common expression elimination). –

3

When can we change statement/evaluation order without affecting meaning? (so

we can reorder things so can keep values in registers or cache).

Statements in a basic block may be analyzed together, since they will be executed

Dr. Ernesto Gomez : CSE 570/670

We will be covering material from :

Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; “Com-

pilers – Principles, Techniques and Tools”,Second Edition, Addison Wes-

ley

(This is the classic text on Compilers – it has been around in various versions

and updates since the 1970s and has been revised and updated, most recently in

the second edition in 2007. Unlike many other texts, it gets better in every edition.

It is often called “the Dragon book”, in part because it has a dragon on the

cover, but mainly because it is not a friendly book – you will find that there are

places where you have to study one page for an hour to understand a theorem or

algorithm. It feels like the dragon is eating your brain – it is worth it, however –

or it will feel that it is worth it after your brain has been rewired. You will love

the book, appreciate and understand it the third time you read it. And you will

come back to it – it is the compiler text that has shaped everybody in the field,

the standard reference for compilers. It belongs in the professional library of every

computer scientist and engineer.)

A tentative schedule:

Chapter 1 is an introduction, chapter 2 gives an outline of how compilers work,

usint top-down parsing. We will be concentrating on bottom up parsing in the

class, this starts with chapter 3. We will study most of chapters 3,4,5 ad then skip

around the rest of the book – details wil appear in lecture notes.

We will not cover everything in the book,re and we will explore some topics that

are not in the book – these will appear in the notes. All material covered in class

may appear in examinations.

1. motivation

Why study compilers? Few of us will ever be called upon to write or modify a

compiler. Here are some justifications for all the work involved.

<1> Pragmatism: Compilers are tools that we use to build all our applications. We

need to understand the properties and limitations of our tools.

<2> Art: To the right(?) kind of mind, compilers are beautiful!

<3> Science: We need to be able to express things in a language understandable to

us, then translate it to a language understandable to machines. The problem: how

do we make sure that the meaning is translated?

• Syntax: the textual rules for expressing things in a language.

• Semantics: the meaning conveyed by the expression.

• The syntax is fully visible, but the meaning can be context dependent, may

require additional knowledge to interpret.

• Syntax + (context) + (world knowledge) => Semantics.

• Ambiguity: same syntax can mean different things. English is ambiguous.

• “Time flies like an arrow.” – What does it mean?Consider: Is the subject

“time” or “flies” (or “time flies”, a kind of fly)? Is the verb “time”, “flies”,

or “like”?

• Early computer translation example:”Out of sight, out of mind.” => to

Russian; back from Russian => “Invisible insanity.”

1

2

Two tracks: “the Dragon book”: Alfred V. Aho, Monica S. Lam, Ravi SeDr. Ernesto Gomez : CSE 570/670

We revisit some concepts from lecture 1, then start on formal language defini-

tions.

1. Thoughts about human and computer languages

Human languages tend to be contextual and ambiguous.

Contextual: the meaning of anything we say must be interpreted taking into

account the context in which it is said. For example: if I say “Look out! A shark!”

in class, context tells you this is an example of something. If I yell the same thing

while we are swimming in the ocean, it is either an urgent warning or a bad practical

joke. Cultural context also plays a large part in interpretation of human language.

Inoffensive statements in some cultures can be insulting in others, even when the

same language is spoken in both – when I moved from Cuba to Puerto Rico (both

Spanish speaking countries) as a child, I found that some common everyday Cuban

words were at best impolite in Puerto Rican usage.

Ambiguous: the same phrase can be understood multiple ways. In this case, we

are talking about the same expression in the same context. Consider the signs on

the stairs in this building: IN and OUT. Does IN mean in to the building, or in to

the stairwell? Or consider: if cat food is fed to cats, and baby food to babies, what

is “cheese food”?

An interesting thing about human language is that even incorrect expressions

carry meaning: for example, when we are starting to learn a language we can make

ourselves understood by others before we have fully mastered the grammar. “I have

hungry” is not correct, but we would understand it to mean the speaker is hungry.

All the above makes human languages drastically unsuitable for communicating

with a computer. People have long tried to get a human language interface to

control computers (computers that accept voice commands are the latest variation

on this). What we really want, however, is a computer that does what we mean,

and in human language we can’t always tell what we mean by what we actually

say.

Our best solution to this problem to date is to design artificial languages for

computers, with properties different from human languages. We want the following

properties:

• Freedom from ambiguity: A statement in a computer language should have

only one meaning.

• No dependence on external context: The meaning of a statement should be

understandable given only the statement itself (or at least the program of

which it is a part), and the meaning should not change with circumstances.

To these we add two practical consideration:

• It should be possible to determine the meaning of a statement without doing

too much work or spending too much time. That is, the time complexity

of translation should be small.

• The syntactic correctness of a statement should be unambiguous. That is,

there should be no doubt, independent of meaning, whether a statement is

correct (in the language) or not.

As computers become more powerful, what was considered “too much work” in the

past may become practical, bDr. Ernesto Gomez : CSE 570/670

1. Syntax-directed translation

Human languages communicate information through a combination of syntax,

meaning and context. Arguably meaning and context are more important, we can

communicate well with bad grammar. One of the consequences is that human

languages convey a much richer, but less precise meaning. Computer languages are

much more limited in what they communicate, all they do is specify what actions a

machine will perform to carry out some algorithm. We design computer languages

to be unambiguous – any string in the language should have a unique meaning.

We use context free grammars because this makes meaning local to the text we

are translating – an expression in the language should be translatable only using

information that appears next to it in the text. Technically, a context-free grammar

lets us guarantee that we can always decide if a text string is in the language or

not, and we can do this efficiently (The LR grammars and shift-reduce parsers we

have defined earlier can parse in linear time, other algorithms may take somewhat

longer (and allow dealing with extensions to context-free grammars) but we can

still guarantee worst-case polynomial time, usually better than n3).

2. Simple translation withoug optimization.

Consider our expression grammar from chapter 4, which might be appropriate

for a calculator – we here extend it to an assign statement. We add A (assign

statement) to the non-terminals, and the symbols v,= to the terminals. In this

context, istands for either a vatiable or a numeric value. We have annotated each

line with an expression in curly brackets that says what it does – the arithmetic

operators have their usual meaning, and we are adding two functtions – the comment

field here describes what the functions do. We are using a pseudo-code to describe

what we do to translate the expression, but the actual code, like what we would

write in the curly brackets in Yacc, depends on what we we are translating our

statments to – another language? binary code? functions of an interpreter?

2.1. Assign statement.

(1) A → v = E { store( E.value, v ) // if v is a variable, store a value to its

address in memory – else error }

(2) E1 → E + T { E1.value = E.value + T.value }

(3) E → T { E.value = T.value }

(4) T1 → T ∗F { T1.value = T.value + F.value }

(5) T → F{ T.value = F.value }

(6) F → i { F.value = value_of( i ) // if i is a string, get numeric value – else

if i is a variable, get its value – else error}

(7) F → (E) { F.value = E.value }

Note that we have added something that is not a synctactic property, in productions

1 and 6, the term “variable”. We can define the syntax for what a variable should

look like ( a string tht begins with a letter, followed by more letters, digits, special

characters like $ and _ which ends just before a space, line end, punctuation or

operator which are not allowed in the name ). In modern languages, however, such

a name is only a variable if it ha

#### Why Choose Us

- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee

#### How it Works

- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "
**PAPER DETAILS**" section. - Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “
**CREATE ACCOUNT & SIGN IN**” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page. - From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.