Added draft for peon's grammar
This commit is contained in:
parent
88dc610363
commit
ae819daac4
180
docs/grammar.md
180
docs/grammar.md
|
@ -1 +1,179 @@
|
|||
# TODO
|
||||
# JAPL - Formal Grammar Specification
|
||||
|
||||
__Note__: This document is currently a draft and is therefore incomplete
|
||||
|
||||
## Rationale
|
||||
The purpose of this document is to provide an unambiguous formal specification of peon's syntax for use in automated
|
||||
compiler generators (known as "compiler compilers") and parsers.
|
||||
|
||||
Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters",
|
||||
available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will
|
||||
be explained below.
|
||||
|
||||
## Disclaimer
|
||||
----------------------------------------------
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).
|
||||
|
||||
Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To
|
||||
avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not,
|
||||
which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals
|
||||
may be formatted in monospace to make them stand out more in the document.
|
||||
|
||||
## EBNF Syntax & Formatting rules
|
||||
----------------------------------------------
|
||||
As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this document
|
||||
is detailed below:
|
||||
- The literal `"LF"` (without quotes) is a shorthand for "Line Feed". It symbolizes the end of a line and it's platform-independent
|
||||
- A pair of 2 forward-slashes (character code 47) is used to mark comments. A comment lasts until the
|
||||
the end of a line is encountered. It is RECOMMENDED to use them to clarify each rule, or a group of rules,
|
||||
to simplify human inspection of the specification
|
||||
- The name of non-terminal productions MUST be in lowercase (such as `foo`), while for terminals it MUST be in uppercase (such as `FOO`)
|
||||
- Whitespaces, tabs, newlines and form feeds (character codes 32, 9, 10 and 12 respectively) are not
|
||||
relevant to the grammar and MUST be ignored by automated parsers and parser generators
|
||||
- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times
|
||||
- `"?"` (without quotes, character code 63) means a rule can match 0 or 1 times
|
||||
- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST match 1 or more times
|
||||
- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or
|
||||
the second rule. This operator can be chained to obtain something like `"foo" | "bar" | "baz"`, meaning that either
|
||||
the literal strings foo, bar or baz are valid matches for the rule
|
||||
- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive).
|
||||
Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule
|
||||
MUST match exactly y times. Omitting both x and y is the same as using `*`
|
||||
- Production rules are terminated with an ASCII semicolon (`COLON` without quotes, character code 59)
|
||||
- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules
|
||||
come first"
|
||||
- An "arrow" (character code 8594) MUST be used to separate rule names from their definition.
|
||||
A rule definition, then, looks something like this (without quotes): `"name → rule definition here; // optional comment"`
|
||||
- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are
|
||||
hexadecimal using the prefix `0x`, octal using the prefix `0o`, and binary using the prefix `0b`. For example,
|
||||
the literals `0x7F`, `0b1111111` and `0o177` all represent the decimal number `127` in hexadecimal, binary and
|
||||
octal respectively
|
||||
- The literal `"EOF"` (without quotes), represents the end of the input stream and is a shorthand for "End Of File"
|
||||
- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and
|
||||
are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they
|
||||
be separated by the three dots with a space for ease of reading. Ranges can define numerical sets like in `"0 ... 9"`
|
||||
(without quotes), or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be
|
||||
interpreted as a sequence of the character codes between the start and end of the range. Ranges are inclusive at both
|
||||
ends. It is REQUIRED that the first element in the range is greater or equal to the last one: backwards ranges are illegal.
|
||||
In addition to this, although numerical ranges can use any combination of the supported number representation
|
||||
(meaning `'0 ... 0x10'` is a valid range encompassing all decimal numbers from 0 to 16) it is RECOMMENDED that
|
||||
the representation used is consistent across the start and end of the range. Finally, ranges can have a character
|
||||
and a number as either start or end of them, in which case the character is to be interpreted as its character code in decimal
|
||||
- For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between
|
||||
operators
|
||||
- Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated
|
||||
by any other term in the grammar by a space
|
||||
- Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST
|
||||
write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead)
|
||||
|
||||
## EBNF Grammar
|
||||
----------------------------------------------
|
||||
Below you can find the EBNF specification of peon's grammar.
|
||||
|
||||
```
|
||||
// Top-level code
|
||||
program → declaration* EOF; // An entire program (Note: an empty program *is* a valid program)
|
||||
|
||||
// Declarations (rules that bind a name to an object in the current scope and produce no side effects)
|
||||
|
||||
// A program is composed by a list of declarations
|
||||
declaration → funDecl | varDecl | coroDecl | statement;
|
||||
// Function declarations
|
||||
funDecl → "fn" function;
|
||||
coroDecl → "coro" function;
|
||||
// Constants still count as "variable" declarations in the grammar
|
||||
varDecl → ("var" | "let" | "const") IDENTIFIER ( "=" expression )? COLON;
|
||||
|
||||
|
||||
// Statements (rules that produce side effects, without binding a name. Well, mostly: import, foreach and others do, but they're exceptions to the rule)
|
||||
statement → exprStmt | ifStmt | returnStmt| whileStmt| blockStmt; // The set of all statements
|
||||
// Any expression followed by a semicolon is an expression statement
|
||||
exprStmt → expression COLON;
|
||||
// Returns from a function, illegal in top-level code. An empty return statement is illegal
|
||||
// in non-void functions
|
||||
returnStmt → "return" expression? COLON;
|
||||
// Defers the evaluation of the given expression right before a function exits, illegal in top-level code.
|
||||
// Semantically and functionally equivalent to wrapping a function in a big try block and executing the
|
||||
// expression in the finally block, but less verbose
|
||||
deferStmt → "defer" expression COLON;
|
||||
// Breaks out of a loop or named block
|
||||
breakStmt → "break" IDENTIFIER? COLON;
|
||||
// Skips to the next iteration in a loop or jumps to the
|
||||
// beginning of a named block
|
||||
continueStmt → "continue" IDENTIFIER? COLON;
|
||||
importStmt -> ("from" IDENTIFIER)? "import" (IDENTIFIER ("as" IDENTIFIER)? ","?)+ COLON; // Imports one or more modules in the current scope. Creates a namespace
|
||||
assertStmt → "assert" expression COLON;
|
||||
yieldStmt → "yield" expression? COLON;
|
||||
// Pauses the execution of the calling coroutine and calls the given coroutine. Execution continues when the callee returns
|
||||
awaitStmt → "await" expression COLON;
|
||||
// Exception handling
|
||||
tryStmt → "try" "{" statement* "}" (except+ "finally" statement | "finally" statement | "else" statement | except+ "else" statement | except+ "else" statement "finally" statement);
|
||||
// Blocks create a new scope that lasts until they're closed
|
||||
blockStmt → "{" declaration* "}";
|
||||
// Named blocks are useful for breaking out of deeply nested loops
|
||||
namedBlock → "block" IDENTIFIER "{" declaration* "}";
|
||||
// If statements are conditional jumps
|
||||
ifStmt → "if" expression "{" statement* "}" ("else" "{" statement* "}")?;
|
||||
// While loops run until their condition is true
|
||||
whileStmt → "while" expression "{" statement* "}";
|
||||
// For-each loops iterate over a collection type
|
||||
foreachStmt → "foreach" "(" (IDENTIFIER ":" expression) ")" "{" statement* "}";
|
||||
|
||||
|
||||
// Expressions (rules that produce a value and may have side effects)
|
||||
|
||||
// Assignment is the highest-level expression
|
||||
expression → assignment;
|
||||
assignment → (call ".")? IDENTIFIER ASSIGNTOKENS assignment | lambdaExpr;
|
||||
lambdaExpr → "lambda" lambda; // Lambdas are anonymous functions, so they act as expressions
|
||||
yieldExpr → "yield" expression?; // Empty yield equals yield nil
|
||||
awaitExpr → "await" expression;
|
||||
logic_or → logic_and ("and" logic_and)*;
|
||||
logic_and → equality ("or" equality)*;
|
||||
equality → comparison (("!=" | "==") comparison)*;
|
||||
comparison → term ((">" | ">=" | "<" | "<=" | "as" | "is" | "of") term)*;
|
||||
term → factor (("-" | "+") factor)*; // Precedence for + and - in operations
|
||||
factor → unary (("/" | "*" | "**" | "^" | "&") unary)*; // All other binary operators have the same precedence
|
||||
unary → ("!" | "-" | "~") unary | call;
|
||||
slice → expression "[" expression (":" expression){0,2} "]"
|
||||
call → primary ("(" arguments? ")" | "." IDENTIFIER)*;
|
||||
// Below are some collection literals: lists, sets, dictionaries and tuples
|
||||
listExpr → "[" arguments* "]";
|
||||
// Note: "{}" is an empty dictionary, NOT an empty set
|
||||
setExpr → "{" arguments? "}";
|
||||
dictExpr → "{" (expression ":" expression ("," expression ":" expression)*)* "}"; // {key: value, ...}
|
||||
tupleExpr → "(" arguments* ")";
|
||||
primary → "nan" | "true" | "false" | "nil" | "inf" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER;
|
||||
|
||||
// Utility rules to avoid repetition
|
||||
function → IDENTIFIER ("(" parameters? ")")? blockStmt;
|
||||
lambda → ("(" parameters? ")")? blockStmt;
|
||||
// ident: type [, ident2: type2, ...]
|
||||
parameters → IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)*;
|
||||
arguments → expression ("," expression)*;
|
||||
except → ("except" expression? statement)
|
||||
|
||||
|
||||
// These are all the terminals (i.e. productions defined non-recursively)
|
||||
COMMENT → "#" UNICODE* LF;
|
||||
COLON → COLON;
|
||||
SINGLESTRING → QUOTE UNICODE* QUOTE;
|
||||
DOUBLESTRING → DOUBLEQUOTE UNICODE* DOUBLEQUOTE;
|
||||
SINGLEMULTI → QUOTE{3} UNICODE* QUOTE{3}; // Single quoted multi-line strings
|
||||
DOUBLEMULTI → DOUBLEQUOTE{3} UNICODE* DOUBLEQUOTE{3}; // Double quoted multi-line string
|
||||
DECIMAL → DIGIT+;
|
||||
FLOAT → DIGIT+ ("." DIGIT+)? (("e" | "E") DIGIT+)?;
|
||||
BIN → "0b" ("0" | "1")+;
|
||||
OCT → "0o" ("0" ... "7")+;
|
||||
HEX → "0x" ("0" ... "9" | "A" ... "F" | "a" ... "f")+;
|
||||
NUMBER → DECIMAL | FLOAT | BIN | HEX | OCT;
|
||||
STRING → ("r"|"b"|"f")? SINGLESTRING | DOUBLESTRING | SINGLEMULTI | DOUBLEMULTI;
|
||||
IDENTIFIER → ALPHA (ALPHA | DIGIT)*; // Valid identifiers are only alphanumeric!
|
||||
QUOTE → "'";
|
||||
DOUBLEQUOTE → "\"";
|
||||
IDENTCHARS → "a" ... "z" | "A" ... "Z" | "_";
|
||||
UNICODE → 0x00 ... 0x10FFFD; // This covers the whole unicode range
|
||||
DIGIT → "0" ... "9";
|
||||
ASSIGNTOKENS → "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//=" | "="
|
||||
```
|
Loading…
Reference in New Issue