From 92c1602926cbee358571e808f379356893877231 Mon Sep 17 00:00:00 2001 From: Lee Byron Date: Mon, 22 Jul 2019 18:49:14 -0700 Subject: [PATCH] Clarify that lexing is greedy GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses https://github.com/graphql/graphql-spec/pull/564#issuecomment-508714529 --- spec/Appendix A -- Notation Conventions.md | 8 ++++--- spec/Appendix B -- Grammar Summary.md | 5 +++++ spec/Section 2 -- Language.md | 26 +++++++++++++++++----- 3 files changed, 30 insertions(+), 9 deletions(-) diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md index cbb8e8a3a..d21f62fba 100644 --- a/spec/Appendix A -- Notation Conventions.md +++ b/spec/Appendix A -- Notation Conventions.md @@ -48,10 +48,12 @@ ListOfLetterA : The GraphQL language is defined in a syntactic grammar where terminal symbols are tokens. Tokens are defined in a lexical grammar which matches patterns of -source characters. The result of parsing a sequence of source Unicode characters -produces a GraphQL AST. +source characters. The result of parsing a source text sequence of Unicode +characters first produces a sequence of lexical tokens according to the lexical +grammar which then produces abstract syntax tree (AST) according to the +syntactical grammar. -A Lexical grammar production describes non-terminal "tokens" by +A lexical grammar production describes non-terminal "tokens" by patterns of terminal Unicode characters. No "whitespace" or other ignored characters may appear between any terminal Unicode characters in the lexical grammar production. A lexical grammar production is distinguished by a two colon diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md index efdcae8f8..cd1f629be 100644 --- a/spec/Appendix B -- Grammar Summary.md +++ b/spec/Appendix B -- Grammar Summary.md @@ -1,5 +1,10 @@ # B. Appendix: Grammar Summary +The source text of a GraphQL document must be a sequence of {SourceCharacter}. +The character sequence must be described by a sequence of {Token} and {Ignored} +lexical grammars. The lexical token sequence, omitting {Ignored}, must be +described by a single {Document} syntactical grammar. + SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md index bf31b2db7..2ae62893a 100644 --- a/spec/Section 2 -- Language.md +++ b/spec/Section 2 -- Language.md @@ -7,11 +7,13 @@ common unit of composition allowing for query reuse. A GraphQL document is defined as a syntactic grammar where terminal symbols are tokens (indivisible lexical units). These tokens are defined in a lexical -grammar which matches patterns of source characters (defined by a -double-colon `::`). +grammar which matches patterns of source characters. In this document, syntactic +grammar productions are distinguished with a colon `:` while lexical grammar +productions are distinguished with a double-colon `::`. -Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions -used in this document. +Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information +about the lexical and syntactic grammar and other notational conventions used +throughout this document. ## Source Text @@ -25,6 +27,19 @@ ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control. +**Greedy Lexical Parsing** + +The source text of a GraphQL document is first converted into a sequence of +lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is +scanned from left to right, repeatedly taking the longest possible sequence of +unicode characters as the next token. + +For example, the sequence `123` is always interpreted as a single {IntValue}, +and `""""""` is always interpreted as a single block {StringValue}. + +This sequence of lexical tokens are then scanned from left to right to produce +an abstract syntax tree (AST) according to the {Document} syntactical grammar. + ### Unicode @@ -118,8 +133,7 @@ Token :: A GraphQL document is comprised of several kinds of indivisible lexical tokens defined here in a lexical grammar by patterns of source Unicode characters. -Tokens are later used as terminal symbols in a GraphQL Document -syntactic grammars. +Tokens are later used as terminal symbols in GraphQL syntactic grammar rules. ### Ignored Tokens