Skip to content

Latest commit

 

History

History
173 lines (95 loc) · 5.18 KB

012717.md

File metadata and controls

173 lines (95 loc) · 5.18 KB

CSCI 3366 Programming Languages

Spring 2017

R. Muller


Lecture Notes: 5

Today

  1. Concrete Syntax, Grammars and Derivations
  2. Parsing & Abstract Syntax

####1. Syntax####

Context:

  • We're interested in the design, specification and implementation of programming languages;
  • We're more or less following the slogan:
                             Language = Syntax + Semantics
  • We don't have much to say about the design of concrete syntax other than the common sense idea that it shouldn't interfere with understanding and, if you're lucky, it should be in reasonable taste.

####Specification of Syntax####

Notation: We're going to use the latex symbols \alpha, \beta etc for the Greek letters that we cannot render in plain text.

Consider a finite set of symbols T = {a, b, c, ...}. We're going to view a language over T as elements of T* (i.e., sequences of symbols drawn from T). $T^$ has no constraints so we'll employ a device, a context free grammar (CFG) to define a reasonable subset of T.

A context free grammar G is a 4-tuple $ (N, T, P, S)$ where

  • N is a finite set of nonterminal symbols;

  • T is a finite set of terminal symbols;

  • $P \subseteq N \times (N \cup T)^*$ is a finite set of productions, and

  • $S \in N$ is a distinguished element of $N$ called the start symbol.

We'll follow well-established conventions using:

  • uppercase letters A, B, ... to range over N;
  • lowercase letters at the front of the alphabet a, b, ... to range over T;
  • lowercase letters at the end of the alphabet w, u, v to range over T*;
  • Greek letters $\alpha$, $\beta$, $\gamma$ to range over (N U T)*.

Example

G0 = ({S, A}, {a, b}, {(S, aA), (A, aA), (A, b)}, S)

The elements of P, in fact, the grammar G is usually written thusly:

S --> aA                             S ::= aA
A --> aA       or sometimes as       A ::= aA
A --> b                              A ::= b

Given our conventions, we can say things like $(A ::= \alpha) \in P$.

Derives in One Step

Let $(A ::= \gamma) \in$ P. Then $\alpha A \beta \Rightarrow \alpha \gamma \beta$.

The double arrow $\Rightarrow$ is pronounced "derives in one step".

Example

For G0 we have that $aaA \Rightarrow aaaA$, using $\alpha = aa$, $\beta = \epsilon$ and $\gamma = aA$.

Derives in Zero or More Steps

The relation $\Rightarrow^*$ is the reflexive and transitive closure of $\Rightarrow$.

Example

For G0 we have that aaA ==>* aaA (derives in zero steps) and that aaA ==>* aaaA (derives in one step) and aaA ==>* aaab (derives in two steps).

Sentential Form

Let G = (N, T, P, S) be a CFG. Then the set of sentential forms of G is ${ \alpha \mid S \Rightarrow^* \alpha }$.

Intuitively, $\alpha$ is a string of terminals and nonterminals derivable from the start symbol.

Definition: L(G) -- the language of G

Let G = (N, T, P, S) be a CFG. Then the language defined by G is

$L(G) = { w \mid S \Rightarrow^* w }$

Definition: parser for L(G):

a parser for L(G) is a function parse($G$, $w$) that returns true if $w \in L(G)$ and false otherwise.

Comments

  1. L(G) is considered as set of sentences. So from this perspective, programs are sentences.

  2. It's easy to see that $L(G) \subseteq T^*$ — we've thrown out a lot (but definitely not all!) of the nonsensical stuff.

  3. Given a grammar G and a string of symbols $w$, a derivation $S \Rightarrow^* w$ can be seen as a proof that $w \in L(G)$. (Technically speaking, it's actually called a "witness".)

Conventions:

  1. If $(A ::= \alpha) \in P$ and $(A ::= \beta) \in P$ we usually write $A ::= \alpha \mid \beta$ where the vertical bar is pronounced "or".

  2. Given a set of productions P, the grammar G = (N, T, P, S) is easy to infer. So here to fore, we'll write grammars by writing the productions.

Leftmost/Rightmost Derivations

A derivation step in which the leftmost nonterminal is replaced:

$w A \beta \Rightarrow w \alpha \beta$

is called a leftmost derivation step. A derivation:

$ \alpha \Rightarrow^* \beta$

is leftmost if each step in the derivation is leftmost.

And likewise for rightmost.

Ambiguity

A grammar G is ambiguous if there exists a $w \in L(G)$ such that there are two different leftmost (rightmost) derivations of $w$.

Derivations and Trees

Let $X_i$ range over $N \cup T$, let $A ::= X_1X_2\ldots X_n$ be a production such that $X_i \Rightarrow* \beta_i$ with $\beta = \beta_1\ldots \beta_n$ and let $D$ be the derivation $A \Rightarrow^* \beta$. Then Tree($D$) is the tree structure with root $A$ and with subtrees

Tree($X_1 \Rightarrow^ \beta_1$) $\ldots$ Tree($X_n \Rightarrow^ \beta_n$)**.

Example

E ::= (E + E) | a

Let $w = (a + a)$, $D = E \Rightarrow (E + E) \Rightarrow (a + E) \Rightarrow (a + a)$. Then Tree($D$) is

         E
     / / | \ \
    ( E  +  E )
      |     |
      a     a

The tree of a derivation is called a concrete syntax tree.

Abstract Syntax#####

The phrase "abstract syntax tree" (AST) was coined by John McCarthy, the inventor of LISP. The abstract syntax tree has only the essential information contained in the concrete syntax tree. For the example above, the AST is:

   +
  / \
 a   a