R. Muller
- Concrete Syntax, Grammars and Derivations
- Parsing & Abstract Syntax
####1. Syntax####
Context:
- We're interested in the design, specification and implementation of programming languages;
- We're more or less following the slogan:
Language = Syntax + Semantics
- We don't have much to say about the design of concrete syntax other than the common sense idea that it shouldn't interfere with understanding and, if you're lucky, it should be in reasonable taste.
####Specification of Syntax####
Notation: We're going to use the latex symbols \alpha, \beta etc for the Greek letters that we cannot render in plain text.
Consider a finite set of symbols T = {a, b, c, ...}. We're going to view a language over T as elements of T* (i.e., sequences of symbols drawn from T). $T^$ has no constraints so we'll employ a device, a context free grammar (CFG) to define a reasonable subset of T.
A context free grammar G is a 4-tuple $ (N, T, P, S)$ where
-
N is a finite set of nonterminal symbols;
-
T is a finite set of terminal symbols;
-
$P \subseteq N \times (N \cup T)^*$ is a finite set of productions, and -
$S \in N$ is a distinguished element of$N$ called the start symbol.
We'll follow well-established conventions using:
- uppercase letters A, B, ... to range over N;
- lowercase letters at the front of the alphabet a, b, ... to range over T;
- lowercase letters at the end of the alphabet w, u, v to range over T*;
- Greek letters
$\alpha$ ,$\beta$ ,$\gamma$ to range over (N U T)*.
Example
G0 = ({S, A}, {a, b}, {(S, aA), (A, aA), (A, b)}, S)
The elements of P, in fact, the grammar G is usually written thusly:
S --> aA S ::= aA
A --> aA or sometimes as A ::= aA
A --> b A ::= b
Given our conventions, we can say things like
Derives in One Step
Let
The double arrow
Example
For G0 we have that
Derives in Zero or More Steps
The relation
Example
For G0 we have that aaA ==>* aaA (derives in zero steps) and that aaA ==>* aaaA (derives in one step) and aaA ==>* aaab (derives in two steps).
Sentential Form
Let G = (N, T, P, S) be a CFG. Then the set of sentential forms of G is
Intuitively,
Definition: L(G) -- the language of G
Let G = (N, T, P, S) be a CFG. Then the language defined by G is
Definition: parser for L(G):
a parser for L(G) is a function parse(
Comments
-
L(G) is considered as set of sentences. So from this perspective, programs are sentences.
-
It's easy to see that
$L(G) \subseteq T^*$ — we've thrown out a lot (but definitely not all!) of the nonsensical stuff. -
Given a grammar G and a string of symbols
$w$ , a derivation$S \Rightarrow^* w$ can be seen as a proof that$w \in L(G)$ . (Technically speaking, it's actually called a "witness".)
Conventions:
-
If
$(A ::= \alpha) \in P$ and$(A ::= \beta) \in P$ we usually write$A ::= \alpha \mid \beta$ where the vertical bar is pronounced "or". -
Given a set of productions P, the grammar G = (N, T, P, S) is easy to infer. So here to fore, we'll write grammars by writing the productions.
Leftmost/Rightmost Derivations
A derivation step in which the leftmost nonterminal is replaced:
is called a leftmost derivation step. A derivation:
$ \alpha \Rightarrow^* \beta$
is leftmost if each step in the derivation is leftmost.
And likewise for rightmost.
Ambiguity
A grammar G is ambiguous if there exists a
Derivations and Trees
Let
Tree($X_1 \Rightarrow^ \beta_1$) $\ldots$ Tree($X_n \Rightarrow^ \beta_n$)**.
Example
E ::= (E + E) | a
Let
E
/ / | \ \
( E + E )
| |
a a
The tree of a derivation is called a concrete syntax tree.
The phrase "abstract syntax tree" (AST) was coined by John McCarthy, the inventor of LISP. The abstract syntax tree has only the essential information contained in the concrete syntax tree. For the example above, the AST is:
+
/ \
a a