Added draft for peon's grammar

2023-01-25 10:20:37 +01:00 · 2023-01-25 10:20:37 +01:00 · ae819daac4
parent 88dc610363
commit ae819daac4
1 changed files with 179 additions and 1 deletions
--- a/docs/grammar.md
+++ b/docs/grammar.md
@ -1 +1,179 @@
-# TODO
+# JAPL - Formal Grammar Specification
+
+__Note__: This document is currently a draft and is therefore incomplete
+
+## Rationale
+The purpose of this document is to provide an unambiguous formal specification of peon's syntax for use in automated
+compiler generators (known as "compiler compilers") and parsers.
+
+Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters", 
+available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will
+be explained below.
+
+## Disclaimer
+----------------------------------------------
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).
+
+Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To
+avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not,
+which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals
+may be formatted in monospace to make them stand out more in the document.
+
+## EBNF Syntax & Formatting rules
+----------------------------------------------
+As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this document
+is detailed below:
+- The literal `"LF"` (without quotes) is a shorthand for "Line Feed". It symbolizes the end of a line and it's platform-independent
+- A pair of 2 forward-slashes (character code 47) is used to mark comments. A comment lasts until the
+  the end of a line is encountered. It is RECOMMENDED to use them to clarify each rule, or a group of rules, 
+  to simplify human inspection of the specification
+- The name of non-terminal productions MUST be in lowercase (such as `foo`), while for terminals it MUST be in uppercase (such as `FOO`)
+- Whitespaces, tabs, newlines and form feeds (character codes 32, 9, 10 and 12 respectively) are not 
+  relevant to the grammar and MUST be ignored by automated parsers and parser generators
+- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times
+- `"?"` (without quotes, character code 63) means a rule can match 0 or 1 times
+- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST match 1 or more times
+- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or
+  the second rule. This operator can be chained to obtain something like `"foo" | "bar" | "baz"`, meaning that either
+  the literal strings foo, bar or baz are valid matches for the rule
+- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive).
+  Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule
+  MUST match exactly y times. Omitting both x and y is the same as using `*`
+- Production rules are terminated with an ASCII semicolon (`COLON` without quotes, character code 59)
+- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules
+  come first"
+- An "arrow" (character code 8594) MUST be used to separate rule names from their definition.
+  A rule definition, then, looks something like this (without quotes): `"name → rule definition here; // optional comment"`
+- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are 
+  hexadecimal using the prefix `0x`, octal using the prefix `0o`, and binary using the prefix `0b`. For example,
+  the literals `0x7F`, `0b1111111` and `0o177` all represent the decimal number `127` in hexadecimal, binary and
+  octal respectively
+- The literal `"EOF"` (without quotes), represents the end of the input stream and is a shorthand for "End Of File"
+- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and
+  are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they
+  be separated by the three dots with a space for ease of reading. Ranges can define numerical sets like in `"0 ... 9"` 
+  (without quotes), or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be 
+  interpreted as a sequence of the character codes between the start and end of the range. Ranges are inclusive at both
+  ends. It is REQUIRED that the first element in the range is greater or equal to the last one: backwards ranges are illegal.
+  In addition to this, although  numerical ranges can use any combination of the supported number representation 
+  (meaning `'0 ... 0x10'` is a valid range  encompassing all decimal numbers from 0 to 16) it is RECOMMENDED that
+  the representation used is consistent across the  start and end of the range. Finally, ranges can have a character 
+  and a number as either start or end of them, in which case the character is to be interpreted as its character code in decimal
+ - For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between
+   operators
+ - Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated
+   by any other term in the grammar by a space
+ - Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST
+   write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead)
+
+## EBNF Grammar
+----------------------------------------------
+Below you can find the EBNF specification of peon's grammar.
+
+```
+// Top-level code
+program        → declaration* EOF; // An entire program (Note: an empty program *is* a valid program)
+
+// Declarations (rules that bind a name to an object in the current scope and produce no side effects)
+
+// A program is composed by a list of declarations
+declaration    → funDecl | varDecl | coroDecl | statement;
+// Function declarations
+funDecl        → "fn" function;
+coroDecl       → "coro" function;
+// Constants still count as "variable" declarations in the grammar
+varDecl        → ("var" | "let" | "const") IDENTIFIER ( "=" expression )? COLON;
+
+
+// Statements (rules that produce side effects, without binding a name. Well, mostly: import, foreach and others do, but they're exceptions to the rule)
+statement      → exprStmt | ifStmt | returnStmt| whileStmt| blockStmt;  // The set of all statements
+// Any expression followed by a semicolon is an expression statement
+exprStmt       → expression COLON;
+// Returns from a function, illegal in top-level code. An empty return statement is illegal
+// in non-void functions
+returnStmt     → "return" expression? COLON;
+// Defers the evaluation of the given expression right before a function exits, illegal in top-level code. 
+// Semantically and functionally equivalent to wrapping a function in a big try block and executing the 
+// expression in the finally block, but less verbose
+deferStmt      → "defer" expression COLON;
+// Breaks out of a loop or named block
+breakStmt      → "break" IDENTIFIER? COLON;
+// Skips to the next iteration in a loop or jumps to the
+// beginning of a named block
+continueStmt   → "continue" IDENTIFIER? COLON;
+importStmt     -> ("from" IDENTIFIER)? "import" (IDENTIFIER ("as" IDENTIFIER)? ","?)+ COLON;  // Imports one or more modules in the current scope. Creates a namespace
+assertStmt     → "assert" expression COLON;
+yieldStmt      → "yield" expression? COLON;
+// Pauses the execution of the calling coroutine and calls the given coroutine. Execution continues when the callee returns
+awaitStmt      → "await" expression COLON;
+// Exception handling
+tryStmt        → "try" "{" statement* "}" (except+ "finally" statement | "finally" statement | "else" statement | except+ "else" statement | except+ "else" statement "finally" statement);
+// Blocks create a new scope that lasts until they're closed
+blockStmt      → "{" declaration* "}";
+// Named blocks are useful for breaking out of deeply nested loops
+namedBlock     → "block" IDENTIFIER "{" declaration* "}";
+// If statements are conditional jumps
+ifStmt         → "if" expression "{" statement* "}" ("else" "{" statement* "}")?;
+// While loops run until their condition is true
+whileStmt      → "while" expression "{" statement* "}";
+// For-each loops iterate over a collection type
+foreachStmt    → "foreach" "(" (IDENTIFIER ":" expression) ")" "{" statement* "}";
+
+
+// Expressions (rules that produce a value and may have side effects)
+
+// Assignment is the highest-level expression
+expression     → assignment;
+assignment     → (call ".")? IDENTIFIER ASSIGNTOKENS assignment | lambdaExpr;
+lambdaExpr     → "lambda" lambda;  // Lambdas are anonymous functions, so they act as expressions
+yieldExpr      → "yield" expression?; // Empty yield equals yield nil
+awaitExpr      → "await" expression;
+logic_or       → logic_and ("and" logic_and)*;
+logic_and      → equality ("or" equality)*;
+equality       → comparison (("!=" | "==") comparison)*;
+comparison     → term ((">" | ">=" | "<" | "<=" | "as" | "is" | "of") term)*;
+term           → factor (("-" | "+") factor)*;  // Precedence for + and - in operations
+factor         → unary (("/" | "*" | "**" | "^" | "&") unary)*;  // All other binary operators have the same precedence
+unary          → ("!" | "-" | "~") unary | call;
+slice          → expression "[" expression (":" expression){0,2} "]"
+call           → primary ("(" arguments? ")" | "." IDENTIFIER)*;
+// Below are some collection literals: lists, sets, dictionaries and tuples
+listExpr       → "[" arguments* "]";
+// Note: "{}" is an empty dictionary, NOT an empty set
+setExpr        → "{" arguments? "}";
+dictExpr       → "{" (expression ":" expression ("," expression ":" expression)*)* "}"; // {key: value, ...}
+tupleExpr      → "(" arguments* ")";
+primary        → "nan" | "true" | "false" | "nil" | "inf" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER;
+
+// Utility rules to avoid repetition
+function       → IDENTIFIER ("(" parameters? ")")? blockStmt;
+lambda         → ("(" parameters? ")")? blockStmt;
+// ident: type [, ident2: type2, ...]
+parameters     → IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)*;
+arguments      → expression ("," expression)*;
+except         → ("except" expression? statement)
+
+
+// These are all the terminals (i.e. productions defined non-recursively)
+COMMENT        → "#" UNICODE* LF;
+COLON          → COLON;
+SINGLESTRING   → QUOTE UNICODE* QUOTE;
+DOUBLESTRING   → DOUBLEQUOTE UNICODE* DOUBLEQUOTE;
+SINGLEMULTI    → QUOTE{3} UNICODE* QUOTE{3};   // Single quoted multi-line strings
+DOUBLEMULTI    → DOUBLEQUOTE{3} UNICODE* DOUBLEQUOTE{3};  // Double quoted multi-line string
+DECIMAL        → DIGIT+;
+FLOAT          → DIGIT+ ("." DIGIT+)? (("e" | "E") DIGIT+)?;
+BIN            → "0b" ("0" | "1")+;
+OCT            → "0o" ("0" ... "7")+;
+HEX            → "0x" ("0" ... "9" | "A" ... "F" | "a" ... "f")+;
+NUMBER         → DECIMAL | FLOAT | BIN | HEX | OCT;
+STRING         → ("r"|"b"|"f")? SINGLESTRING | DOUBLESTRING | SINGLEMULTI | DOUBLEMULTI;
+IDENTIFIER     → ALPHA (ALPHA | DIGIT)*;  // Valid identifiers are only alphanumeric!
+QUOTE          → "'";
+DOUBLEQUOTE    → "\"";
+IDENTCHARS     → "a" ... "z" | "A" ... "Z" | "_"; 
+UNICODE        → 0x00 ... 0x10FFFD;  // This covers the whole unicode range
+DIGIT          → "0" ... "9";
+ASSIGNTOKENS   → "+=" | "-=" | "*="  | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//=" | "=" 
+```