Added docs
This commit is contained in:
parent
3b899ce440
commit
22181c7f78
|
@ -0,0 +1,111 @@
|
|||
# NimVM - Formal Grammar Specification
|
||||
|
||||
Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters",
|
||||
available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will
|
||||
be explained below.
|
||||
|
||||
## Disclaimer
|
||||
----------------------------------------------
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).
|
||||
|
||||
Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To
|
||||
avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not,
|
||||
which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals
|
||||
may be formatted in monospace to make them stand out more in the document.
|
||||
|
||||
## EBNF Syntax & Formatting rules
|
||||
----------------------------------------------
|
||||
As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this
|
||||
document can be summarized with the following points:
|
||||
- A sequence of 2 slashes (character code 47) is used to mark comments. A comment lasts until the
|
||||
a CRLF or LF character (basically the end of a line) is encountered. It is RECOMMENDED to use
|
||||
them to clarify each rule, or a group of rules, to simplify human inspection of the specification
|
||||
- Whitespaces, tabs, newlines and form feeds (character code 32, 9, 10 and 12 respectively) are not
|
||||
relevant to the grammar and SHOULD be ignored by automated parsers and parser generators
|
||||
- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times
|
||||
- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST 1 or more times
|
||||
- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or
|
||||
the second rule. This operator can be chained to obtain something like "foo | bar | baz", meaning that either
|
||||
foo, bar or baz are valid matches for the rule
|
||||
- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive).
|
||||
Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule
|
||||
MUST match exactly y times. Omitting both x and y is the same as using *
|
||||
- Lines end with an ASCII semicolon (";" without quotes, character code 59) and each rule must end with one
|
||||
- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules
|
||||
come first"
|
||||
- An "arrow" (character code 8594) MUST be used to separate rule names from their definition.
|
||||
A rule definition then looks something like this (without quotes): "name → rule definition here; // optional comment"
|
||||
- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are
|
||||
hexadecimal using the prefix 0x, octal using the prefix 0o, and binary using the prefix 0b. For example,
|
||||
the literals 0x7F, 0b1111111 and 0o177 all represent the decimal number 127 in hexadecimal, binary and
|
||||
octal respectively
|
||||
- The literal "EOF" (without quotes), represents the end of the input stream and is a shorthand for "End Of File"
|
||||
- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and
|
||||
are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they
|
||||
be separated by the three dots with a space for easier reading. Ranges can define numerical sets like in `"0 ... 9"` (without quotes),
|
||||
or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be interpreted as a sequence of the
|
||||
character codes between the start and end of the range. It is REQUIRED that the first element in the range is greater
|
||||
or equal to the last one: backwards ranges are illegal. In addition to this, although numerical ranges can use any
|
||||
combination of the supported number representation (meaning `'0 ... 0x10'` is a valid range encompassing all decimal
|
||||
numbers from 0 to 16) it is RECOMMENDED that the representation used is consistent across the start and end of the range.
|
||||
Finally, ranges can have a character and a number as either start or end of them, in which case the character is to be
|
||||
interpreted as its character code in decimal
|
||||
- For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between
|
||||
operators
|
||||
- Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated
|
||||
by any other term in the grammar by a space
|
||||
- Terminal symbols SHOULD use all-uppercase names to ease readability
|
||||
- Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST
|
||||
write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead)
|
||||
|
||||
## EBNF Grammar
|
||||
----------------------------------------------
|
||||
Below you can find the EBNF specification of NimVM's grammar.
|
||||
|
||||
```
|
||||
// Top-level code
|
||||
program → declaration* EOF; // An entire program (Note: an empty program is a valid program)
|
||||
|
||||
// Declarations (rules that bind a name to an object in the current scope and produce side effects)
|
||||
declaration → structDecl | funDecl | varDecl | statement; // A program is composed by a list of declarations
|
||||
structDecl → "struct" IDENTIFIER "{" (varDecl)* "}"; // Declares a structure type similar to C's
|
||||
funDecl → "fun" function; // Function declarations
|
||||
varDecl → "var" | "let" | "const" IDENTIFIER ( "=" expression )? ";"; // Constants and immutables still count as "variable" declarations in the grammar
|
||||
|
||||
// Statements (rules that produce side effects but without binding a name)
|
||||
statement → exprStmt | forStmt | ifStmt | returnStmt| whileStmt| blockStmt; // The set of all statements
|
||||
exprStmt → expression ";"; // Any expression followed by a semicolon is technically a statement
|
||||
forStmt → "for" "(" ( varDecl | exprStmt | ";" ) expression? ";" expression? ")" statement; // C-style for loops
|
||||
ifStmt → "if" "(" expression ")" statement ( "else" statement )?; // If statements are conditional jumps
|
||||
returnStmt → "return" expression? ";"; // Returns from a function, illegal in top-level code
|
||||
breakStmt → "break" ";";
|
||||
continueStmt → "continue" ";";
|
||||
whileStmt → "while" "(" expression ")" statement; // While loops run until their condition is truthy
|
||||
blockStmt → "{" declaration* "}"; // Blocks create a new scope that lasts until they're closed
|
||||
// Expressions (rules that produce a value, but may also have side effects)
|
||||
expression → assignment ;
|
||||
assignment → ( call "." )? IDENTIFIER "=" assignment | logic_or; // Assignment is the highest-level expression
|
||||
logic_or → logic_and ( "||" logic_and )*;
|
||||
logic_and → equality ( "&&" equality )*;
|
||||
equality → comparison ( ( "!=" | "==" ) comparison )*;
|
||||
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*;
|
||||
term → factor ( ( "-" | "+" ) factor )*; // Precedence for + and - in operations
|
||||
factor → unary ( ( "/" | "*" | "**" | "^" | "&") unary )*; // All other operators have the same precedence
|
||||
unary → ( "!" | "-" | "~" ) unary | call;
|
||||
call → primary ( "(" arguments? ")" | "." IDENTIFIER )*;
|
||||
primary → "true" | "false" | "nil" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER;
|
||||
|
||||
// Utility rules to avoid repetition
|
||||
function → IDENTIFIER "(" parameters? ")" blockStmt;
|
||||
parameters → IDENTIFIER ( "," IDENTIFIER )*;
|
||||
arguments → expression ( "," expression )*;
|
||||
|
||||
// Lexical grammar that defines terminals in a non-recursive (aka regular) fashion
|
||||
NUMBER → DIGIT+ ( "." | "e" | "E" DIGIT+ )?; // Numbers encompass integers and floats (even stuff like 1e5)
|
||||
STRING → "\"" UNICODE* "\""; // Strings can contain arbitrary unicode inside them
|
||||
IDENTIFIER → ALPHA ( ALPHA | DIGIT )*; // Valid identifiers are only alphanumeric!
|
||||
ALPHA → "a" ... "z" | "A" ... "Z" | "_"; // Alphanumeric characters
|
||||
UNICODE → 0x00 ... 0x10FFFD; // This covers the whole unicode range
|
||||
DIGIT → "0" ... "9"; // Arabic digits
|
||||
```
|
Loading…
Reference in New Issue