【Stanford Compilers】Notes

Purpose of the compiler is

to detect non-valid programs
to translate to valid ones

Lexical Analysis

partition the strings into tokens

Token Class: Identifier, Keyword …

Regular Languages

Def. The regular expressions over $\sum$ are the smallest set of expressions including

Union
Concatenation
Iteration

Formal Languages

Def. Let $\sum$ be a set of characters (an alphabet). A language over $\sum$ is a set of strings of characters drawn from $\sum$

A regular language is a type of formal language that can be generated by a regular expression or recognized by a finite automaton (FA). Regular languages are a subset of the class of formal languages.

meaning(mapping from syntax to semantics) is many to one rather than one to many.

Lexical Specification

maximal match
prioritized match
Error match
Error = {all strings not in the lexical specification}

Finite Automata

regular expressions = specification
finite automata = implementation

a finite automaton consists of

An input alphabet $\sum$
A set of states $S$
A start state $n$
A set of accepting states $F\subseteq S$
A set of transitions $state \rightarrow^{input} state$

Deterministic Finite Automata(DFA)

one transition per input per state
No $\epsilon-$moves

Nondeterministic Finite Automata(NFA)

can have multiple transitions for one input in a given state
can have $\epsilon-$moves

DFAs are faster to execute

Subset Method for NFA Determinization

$\epsilon-closure$
transition

Partition Method for DFA Minimization

Syntactic Analysis (Parsing)

CFG

CFG (context-free grammars) are a natural notation for the recursive structure of programming languages.

A CFG consists of

A set of terminals $T$
A set of non-terminals $N$
A start symbol $S$
A set of productions (Productions can be read as rules)

steps:

Begin with a string with only the start symbol $S$
Replace any non-terminal $X$ in the string by the right-hand side of some production $X \rightarrow Y_1…Y_n$
Repeat 2. until there are no non-terminals

Def. Let $G$ be a context-free grammar with start symbol $S$. Then the language $L(G)$ of $G$ is: $$ {a_1…a_n| a} $$

Terminals are so-called because there are no rules for replacing them
Once generated, terminals are permanent
Terminals ought to be tokens of the language

Derivations

A derivation is a sequence of productions $$ S\rightarrow … \rightarrow … \rightarrow … \rightarrow … $$ A derivation can be drawn as a tree

start symbol is the tree’s root
For a production $X\rightarrow Y_1…Y_n$ add children $Y_1…Y_n$ to node $X$

left-most derivation: at each step, replace the left-most non-terminal

right-most derivation: at each step, replace the right-most non-terminal

right-most and left-most derivations have the same parse tree

Parse Tree

A parse tree has

terminals at the leaves
non-terminals at the interior nodes

A in-order traversal of the leaves is the original input

The parse tree shows the association of operations, the input string does not

Ambiguity

A grammar is ambiguous if it has more than one parse tree for some string. Equivalently, there is more than one right-most or left-most derivation for some string

Impossible to convert automatically an ambiguous grammar to an unanbiguous one
Used with care, ambiguity can simplify the grammar
- sometimes allows more natural definitions
- we need disambiguation mechanisms
most tools allow precedence and associativity declarations to disambiguate grammars

Error Handling

Panic mode

when an error is detected:
- discard tokens until one with a clear role is found
- continue from there
Looking for synchronizing tokens
- typically the statement or expression terminators

Error productions

specify known common mistakes in the grammar

Abstract Syntax Tree

Like parse trees but ignore some details

A parse tree

Traces the operation of the parser
Captures nesting structure
But too much information
- Parentheses
- Single-successor nodes

Abstract Syntax Tree

Also captures the nesting structure
But abstracts from the concrete syntax
- more compact and easier to use
An important data structure in a compiler

Recursive Descent Algorithm

The parse tree is constructed
- from the top to down
- from left to right
Terminals are seen in order of appearance in the token stream

Limitation

if a production for non-terminal X succeeds cannot backtrack to try a different production for X later
Left-recursion must be eliminated first

sufficient for grammars where for any non-terminal at most one production can succeed

left recursion

eg: $S->Sa$, can be eliminated

Predictive Parser

like recursive-descent but parser can “predict” which production to use

by looking at the next few tokens
no backtracking

Predictive parsers accept LL(k) (left-to-right left-most k-tokens-lookahead) grammars

LL(1): at each step, only one choice of production

left-factoring: to eliminate the common prefixes of multiple productions for one non-terminal

steps:

for the leftmost non-terminal $S$
look at the next input token $a$
choose the production shown at $[S, a]$ (parsing table)
reject on reaching error state
accept on end of input & empty stack

use a stack to record frontier of parse tree

frontier:

non-terminals that have yet to be expanded
terminals that have yet to matched against the input

top of stack: leftmost pending terminal or non-terminal

consider non-terminal $A$, production $A\rightarrow \alpha$ and token $t$
$T[A,t] = \alpha$ in two cases:
if $\alpha \rightarrow^{*} t\beta$
- $\alpha$ can derive a $t$ in the first position
- we say that $t\in First(\alpha)$
if $A\rightarrow \alpha$ and $\alpha \rightarrow^{} \epsilon$ and $S\rightarrow^{} \beta A t \delta$
- useful if stack has $A$, input is $t$, and $A$ cannot derive $t$
- in this case only option is to get rid of $A$ (by deriving $\epsilon$) : can work only if $t$ can follow $A$ in at least one derivation
- We say that $t\in Follow(A)$

First Set

Def. $$ First(X)={t|X\rightarrow^{}t\alpha}\cup{\epsilon | X\rightarrow^{}\epsilon} $$ Algorithm

$First(t)={t}$
$\epsilon\in First(X)$
- if $X\rightarrow \epsilon$
- if $X\rightarrow A_1…A_n$ and $\epsilon\in First(A_i)$ for $1\le i \le n$
$First(\alpha)\subseteq First(X)$ if $X\rightarrow A_1…A_n\alpha$ and $\epsilon\in First(A_i)$ for $1\le i \le n$

Follow Set

Def. $$ Follow(X) = {t|S\rightarrow^{*} \beta A t \delta} $$ intuition

if $X\rightarrow AB$ then $First(B)\subseteq Follow(A)$ and $Follow(X)\subseteq Follow(B)$
if $B\rightarrow^{*}\epsilon$ then $Follow(X)\subseteq Follow(A)$
if $S$ is the start symbol then $$\in Follow(S)$

Algorithm

$$\in Follow(S)$
$First(\beta) -{\epsilon}\subseteq Follow(X)$ for each production $A\rightarrow\alpha X\beta$
$Follow(A)\subseteq Follow(X)$ for each production $A\rightarrow \alpha X\beta$ where $\epsilon\in First(\beta)$

Parsing Table

for each production $A\rightarrow \alpha$ in G do:

for each terminal $t\in First(\alpha)$ do $T[A,t] = \alpha$
if $\epsilon\in First(\alpha)$, for each $t\in Follow(A)$ do $T[A,t] = \alpha$
if $\epsilon \in First(\alpha)$ and $$\in Follow(A)$ do $T[A,$] = \alpha$

LL(1) Table

if any entry is multiply defined then G is not LL(1)
Most programming language CFGs are not LL(1)

Bottom-Up Parsing

Bottom-Up parsing is more general than (deterministic) top-down parsing
Bottom-Up is the preferred method
Bottom-Up parsing reduces a string to the start symbol by inverting productions
A bottom-up parser traces a rightmost derivation in reverse
Bottom-up parsing uses only two kinds of actions
- Shift
- Reduce

shift-reduce conflict: it is legal to shift or reduce

reduce-reduce conflict: it is legal to reduce by two different productions (bad)

Handle: a reduction that also allows further reductions back to the start symbol

handles always appear at the top of the stack
never to the left of the rightmost non-terminal

Bottom-up parsing algorithms are based on recognizing handles

viable prefix

Def. $\alpha$ is a viable prefix if there is an $\omega$ such that $\alpha|\omega$ is a state of a shift-reduce parser

a viable prefix does not extend past the right end of the handle
viable prefix is a prefix of the handle
as long as a parser has viable prefixes on the stack no parsing error has been detected

For any grammar, the set of viable prefixes is a regular language

Recognizing VP

steps:

add a dummy production $S^{’}\rightarrow S$ to $G$
The NFA states are the items of $G$
for item $E\rightarrow \alpha .X\beta$ add transition: $$ (E\rightarrow \alpha.X\beta)\rightarrow^{X}(E\rightarrow \alpha X.\beta) $$
For item $E\rightarrow \alpha.X\beta$ and production $X\rightarrow \gamma$ add $$ (E\rightarrow \alpha.X\beta)\rightarrow^{\epsilon}(X\rightarrow.\gamma) $$
every state is an accepting state
start state is $S^{’}\rightarrow .S$

LR(0) Parsing

LR(0) Parsing steps:

LR(0) Parsing: Assume
- stack contains $\alpha$
- next input is $t$
- DFA on input $\alpha$ terminates in state $s$
Reduce by $X\rightarrow\beta$ if
- $s$ contains item $X\rightarrow\beta.$
Shift if
- $s$ contains item $X\rightarrow\beta.t\omega$
- equivalent to saying $s$ has a transition labeled $t$
LR(0) gas a reduce/reduce conflict if:
- any state has two reduce items
- $X\rightarrow \beta.$ and $Y\rightarrow \omega$
LR(0) has a shift/reduce conflict if:
- any state has a reduce item and a shift item
- $X\rightarrow \beta.$ and $Y\rightarrow\omega .t \delta$

SLR Parsing

simple left-to-right right-most parsing

Assume
- stack contains $\alpha$
- next input is $t$
- DFA on input $\alpha$ terminates in state $s$
Reduce by $X\rightarrow\beta$ if
- $s$ contains item $X\rightarrow\beta.$ and $t\in Follow(X)$
Shift if
- $s$ contains item $X\rightarrow\beta.t\omega$
- equivalent to saying $s$ has a transition labeled $t$
LR(0) gas a reduce/reduce conflict if:
- any state has two reduce items
- $X\rightarrow \beta.$ and $Y\rightarrow \omega$
LR(0) has a shift/reduce conflict if:
- any state has a reduce item and a shift item
- $X\rightarrow \beta.$ and $Y\rightarrow\omega .t \delta$

steps:

let $M$ be DFA for viable prefixes of $G$
let $|x_1…x_n$$ be initial configuration
repeat until configuration is $S|$$
- let $\alpha|\omega$ be current configuration
- run $M$ on current stack $\alpha$
- if $M$ rejects $\alpha$, report parsing error
- if $M$ accepts $\alpha$ with items $I$, let $a$ be next input
  - shift if $X\rightarrow \beta.a\gamma\in I$
  - reduce if $X\rightarrow \beta.\in I$ and $a\in Follow(X)$
  - report parsing error if neither applies

Improvement

change stack to contain pairs $$ <Symbol,DFA\space State> $$ the bottom of the stack is $<any, start>$ where

$any$ is any dummy symbol
$start$ is the start of state of the DFA
Define $goto[i,A]=j$ if $state_i\rightarrow^{A}state_j$
$goto$ is just the transition function of the DFA

Semantic Analysis

last “front end” phase

catches all remaining errors

Scope

matching identifier declarations with uses

the scope of an identifier is the portion of a program in which that identifier is accessible

most languages have static scope: scope depends only on the program text, not run-time behavior

a few languages are dynamically scoped: scope depends on execution of the program

Symbol Tables

much of semantic analysis can be expressed as a recursive descent of an AST

A symbol table is a data structure that tracks the current bindings of identifiers

Types

the notion varies from language to language
consensus
- a set of values
- a set of operations on those values
classes are one instantiation of the modern notion of type

the goal of type checking is to ensure that operations are used with the correct types

three kinds of languages:

Statically typed: all or almost all checking of types is done as part of compilation (C, Java)
Dynamically typed: Almost all checking of types is done as part of program execution
Untyped: No type checking (machine code)

A lot of code is written in statically typed languages with an “escape” mechanism

Type Checking is the process of verifying fully typed programs
Type Inference is the process of filling in missing type information
The two are different, but the terms are often used interchangeably

Type Checking

A type system is sound if

whenever $\vdash e: T$
then $e$ evaluates to a value of type $T$

we only want sound rules, but some sound rules are better than others

Type checking proves fact $e: T$

proof is on the structure of the AST
proof has the shape of the AST
one type rule is used for each AST node

in the type rule used for a node $e$:

Hypotheses are the proofs of types of $e$’s subexpressions
conclusion is the type of $e$

types are computed in a bottom-up pass over the AST

Type environment

a type environment gives types for free variables

a type environment is a function from Object Identifiers to Types
A variable is free in an expression if it is not defined within the expression

let $O$ be a function from $ObjectIdentifiers$ to $Types$

the sentence $O \vdash e: T$ is read: Under the assumption that variables have the types given by $O$, it is provable that the expression $e$ has the type $T$

and we can write new rules: $$ \frac{O(x)=T}{O\vdash x: T} $$ The type environment is passed down the AST from the root towards the leaves

Types are computed up the AST from the leaves towards the root

Subtyping

Define a relation $\le$ on classes

$X\le X$
$X\le Y$ if $X$ inherits from $Y$
$X\le Z$ if $X\le Y$ and $Y\le Z$

Typing Methods

method environment $M$

General themes

Type rules are defined on the structure of expressions
Types of variables are modeled by an environment
cam be implemented in a single traversal over the AST
Type environment is passes down the tree
Types are passed up the tree

SELF_TYPE

if $SELF_TYPE $ appears textually in the class $C$ as the declared type of $E$ then $$ dynamic_type(E)\le C $$

in type checking it is always safe to replace $SELF_TYPE_c$ by $C$

Code Generation

Management of run-time resources
Correspondence between
- static (compile-time)
- dynamic (run-time)
Storage organization

compiler is responsible for:

Generating code
Orchestrating use of the data area

two goals:

correctness
speed

Activations

an invocation of procedure $P$ is an activation of $P$

the lifetime of an activation of $P$ is

all the steps to execute $P$
including all the steps in procedures $P$ calls

the lifetime of a variable $x$ is the portion of execution in which $x$ is defined

note

lifetime is a dynamic (run-time) concept
scope is a static concept

lifetimes of procedure activations are properly nested

activation lifetimes can be depicted as a tree

the activation tree depends on run-time behavior

since activations are properly nested, a stack can track currently active procedures

Activation Record (AR)

the information needed to manage one procedure activation is called an activation record (AR) or frame

if procedure $F$ calls $G$, then $G$’s activation record contains a mix of info about $F$ and $G$, $G$’s AR contains information needed to

Complete execution of $G$
Resume execution of $F$

The compiler must determine, at compile-time, the layout of activation records and generate code that correctly accesses locations in the activation record

Globals & Heap

Globals are assigned a fixed address once

Variables with fixed address are “statically allocated”

A value that outlives the procedure that creates it cannot be kept in the AR

Languages with dynamically allocated data use a Heap to store dynamic data

Spaces

The code area contains object code
for many languages, fixed size and read only
The static area contains data (not code) with fixed addresses
fixed size, may be readable and writable
The stack contains an AR for each currently active procedure
each AR usually fixed size, contains locals
Heap contains all other data
heap is managed by malloc and free in C

Alignment

Data is word aligned if it begins at a word boundary

Most machines have some alignment restrictions or performance penalties for poor alignment

Stack Machines

only storage is a stack

an instruction $r = F(a_1,…a_n)$ :

Pops $n$ operands from the stack
Computes the operation $F$ using the operands
Pushes the result $r$ on the stack

location of the operands/ result is not explicitly stated

always the top of the stack

In contrast to a register machine

add instead of add r1, r2, r3
more compact programs

There is an intermediate point between a pure stack machine and a pure register machine

An n-register stack machine

conceptually, keep the top n locations of the pure stack machine’s stack in registers

consider a 1-register stack machine

the register is called the accumulator

consider an expression $op(e_1,…,e_n)$, $e_1,…,e_n$ are subexpressions
for each $e_i$
- compute $e_i$
- push result on the stack
pop $n - 1$ values from the stack, compute $op$
store result in the accumulator

code generation

for each expression $e$ we generate MIPS code that

computes the value of $e$ in $$a_0$
Preserves $$sp$ and the contents of the stack

define a code generation function $cgen(e)$ whose result is the code generated for $e$

code generation can be written as a recursive-descent of the AST

variables

use a frame pointer

always points to the return address on the stack
since it does not move it can be used to find the variables

let $x_i$ be the $i-th$ formal parameter of the function for which code is being generated $$ cgen(x_i) = lw\space $a_0\space 4*i($sp) $$ The activation record must be designed together with the code generator

Using a stack machine for compiler is recommended.

Production compilers do different things

emphasis is on keeping values in registers
intermediate results are laid out in the AR, not pushed and popped from the stack

temporaries

the code generator must assign a fixed location in the AR for each temporary

let $NT(e)$ = count of temps needed to evaluate $e$

$NT(e_1+e_2)$

needs at least as many temporaries as $NT(e_1)$
needs at least as many temporaries as $NT(e_2) + 1$

for a function definition $f(x_1,…,x_n) = e$ then AR has $2+n+NT(e)$ elements

return address
frame pointer
n arguments
$NT(e)$ locations for intermediate results

code generation must know how many temporaries are in use at each point

add a new argument to code generation: the position of the next available temporary

the temporary area is used like a small, fixed-size stack

Object Layout

OO implementation = Basic code generation + More stuff

OO slogan: If B is a subclass of A, then an object of class B can be used wherever an object of class A is expected

This means that code in class A works unmodified for an object of class B

Object are laid out in contiguous memory

Class tag is an integer
identifies class of the object
Object size is an integer
size of the object in words
Dispatch ptr is a pointer to a table of methods
attributes in subsequent slots

Each attribute stored at a fixed offset in the object

Observation: Given a layout for class A, a layout for subclass B can be defined by extending the layout of A with additional slots for the additional attributes of B

Semantics

many ways to specify semantics:

Operational Semantics

describes program evaluation via execution rules (on an abstract machine)
most useful for specifying implementations

Denotational Semantics

program’s meaning is a mathematical function

Axiomatic Semantics

Program behavior described via logical formulae
- if execution begins in state satisfying X, then it ends in state satisfying Y
- X, Y formulas
Foundation of many program verification systems

Operational Semantics

Recall the typing judgment $$ Context \vdash e: C $$ in the given context, expression e has type C

We use something similar for evaluation $$ Context \vdash e: v $$ In the given context, expression e evaluates to value v

We track variables and their values with:

An environment : where in memory a variable is
A store: what is in the memory

A variable environment maps variables to locations

keep track of which variables are in scope
tells us where those variables are

$$ E = [a: I_1,b : I_2] $$

A store maps memory locations to values $$ S=[I_1\rightarrow5,I_2\rightarrow 7] $$ $S^{’}= S[12/I_1]$ defines a store $S^{’}$ such that $S^{’}(I_1)=12$ and $S^{’} = S(I)$ if $I\ne I_1$

The evaluation judgement is $$ so, E, S\vdash e: v, S^{’} $$

Given so the current value of self
And E the current variable environment
And S the current store
If the evaluation of e terminates then
The value of e is v
And the new store is $S^{’}$

$$ E(id) = I_{id}\newline S(I_{id}) = v \newline ————-\newline so,E,S\vdash id:v,S $$

Intermediate Code

A language between the source and the target

Provides an intermediate level of abstraction

more details than the source
fewer details than the target

high-level assembly

uses register names, but has an unlimited number
uses control structures like assembly language
uses opcodes but some are higher level

common form of intermediate code

each instruction is of the form $$ x:= y\space op\space z \newline x:= op\space y $$ $y$ and $z$ are registers or constants

similar to assembly code generation

but use any number of IL registers to hold intermediate results

Optimization

optimization is out last compiler phase

most complexity in modern compilers is in the optimizer

seeks to improve a program’s resource utilization

execution time (most often)
code size
network messages sent, etc.

when：

On AST
- Pro: machine independent
- Con: too high level
On assembly language
- Pro: Exposes optimization opportunities
- Con:
  Machine dependent
  Must reimplement optimizations when retargeting
On an intermediate language
- Pro:
  Machine independent
  Exposes optimization opportunities

Basic Block

A basic block is a maximal sequence of instructions with:

no labels (except at the first instruction)
no jumps (except at the last instruction)

idea

cannot jump into a basic block (except at beginning)
cannot jump out of a basic block (except at end)
a basic block is a single-entry, single-exit, straight-line code segment

control-flow graph

a control-flow-graph is a directed graph with

basic blocks as nodes
an edge from block A to block B if the execution can pass form the last instruction in A to the first instruction in B

the body of a method (or procedure) can be represented as a control-flow graph

Granularity

local optimizations
apply to a basic block in isolation
Global optimizations
apply to a control-flow graph (method body) in isolation
Inter-procedural optimizations
apply across method boundaries

most compilers do 1. many do 2. few do 3.

In practice, often a conscious decision is made not to implement the fanciest optimization known

some optimizations are hard to implement
some optimizations are costly in compilation time
some optimizations have low payoff
many fancy optimizations are all three!

Local Optimization

optimize one basic block

no need to analyze the whole procedure body

some statement can be deleted
$x:= x + 0$
$x:= x*1$
some statements can be simplified
$x:=x*0$ to $x:=0$
$y:= y**2$ to $y:=y*y$
$x:=x*8$ to $x:=x«3$
$x:=x*15$ to $t:=x « 4; x:=t-x$
Operations on constants can be computed at compile time
- if there is a statement $x:=y\space op\space z$
- and $y$ and $z$ are constants
- then $y$ op $z$ can be computed at compile time
constant folding can be dangerous
in cross-compiler
Eliminate unreachable blocks

some optimizations are simplified if each register occurs only once on the left-hand side of an assignment

common subexpression elimination

basic block is in single assignment form
a definition $x:=$ is the first use of $x$ in a block

then

When two assignments have the same rhs, they compute the same value

copy propagation

if $w:= x$ appears in a block, replace subsequent uses of $w$ with uses of $x$

dead elimination

$w:=rhs$ appears in a basic block

$w$ does not appear anywhere else in the program

then

the statement $w:=rhs$ is dead and can be eliminated

typically optimizations interact: performing one optimization enables another

optimizing compilers repeat optimizations until no improvement is possible

Peephole optimization

optimizations can be directly applied to assembly code

Peephole optimization is effective for improving assembly code

the “peephole” is a short sequence of (usually contiguous) instructions

the optimizer replaces the sequence with another equivalent one (but faster)

Global Optimization

Global optimization tasks share several traits:

The optimization depends on knowing a property X at a particular point in program execution
Proving X at any point requires knowledge of the entire program

Dataflow Analysis

To replace a use of x by a constant k we must know: on every path to the use of x, the last assignment to x is $x:=k$

Checking the condition requires global dataflow analysis: an analysis of the entire control-flow graph