Added docs

This commit is contained in:
nocturn9x 2021-07-13 16:09:40 +02:00
parent 3b899ce440
commit 22181c7f78
2 changed files with 111 additions and 0 deletions

0
docs/bytecode.md Normal file
View File

111
docs/grammar.md Normal file
View File

@ -0,0 +1,111 @@
# NimVM - Formal Grammar Specification
Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters",
available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will
be explained below.
## Disclaimer
----------------------------------------------
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).
Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To
avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not,
which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals
may be formatted in monospace to make them stand out more in the document.
## EBNF Syntax & Formatting rules
----------------------------------------------
As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this
document can be summarized with the following points:
- A sequence of 2 slashes (character code 47) is used to mark comments. A comment lasts until the
a CRLF or LF character (basically the end of a line) is encountered. It is RECOMMENDED to use
them to clarify each rule, or a group of rules, to simplify human inspection of the specification
- Whitespaces, tabs, newlines and form feeds (character code 32, 9, 10 and 12 respectively) are not
relevant to the grammar and SHOULD be ignored by automated parsers and parser generators
- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times
- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST 1 or more times
- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or
the second rule. This operator can be chained to obtain something like "foo | bar | baz", meaning that either
foo, bar or baz are valid matches for the rule
- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive).
Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule
MUST match exactly y times. Omitting both x and y is the same as using *
- Lines end with an ASCII semicolon (";" without quotes, character code 59) and each rule must end with one
- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules
come first"
- An "arrow" (character code 8594) MUST be used to separate rule names from their definition.
A rule definition then looks something like this (without quotes): "name → rule definition here; // optional comment"
- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are
hexadecimal using the prefix 0x, octal using the prefix 0o, and binary using the prefix 0b. For example,
the literals 0x7F, 0b1111111 and 0o177 all represent the decimal number 127 in hexadecimal, binary and
octal respectively
- The literal "EOF" (without quotes), represents the end of the input stream and is a shorthand for "End Of File"
- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and
are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they
be separated by the three dots with a space for easier reading. Ranges can define numerical sets like in `"0 ... 9"` (without quotes),
or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be interpreted as a sequence of the
character codes between the start and end of the range. It is REQUIRED that the first element in the range is greater
or equal to the last one: backwards ranges are illegal. In addition to this, although numerical ranges can use any
combination of the supported number representation (meaning `'0 ... 0x10'` is a valid range encompassing all decimal
numbers from 0 to 16) it is RECOMMENDED that the representation used is consistent across the start and end of the range.
Finally, ranges can have a character and a number as either start or end of them, in which case the character is to be
interpreted as its character code in decimal
- For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between
operators
- Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated
by any other term in the grammar by a space
- Terminal symbols SHOULD use all-uppercase names to ease readability
- Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST
write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead)
## EBNF Grammar
----------------------------------------------
Below you can find the EBNF specification of NimVM's grammar.
```
// Top-level code
program → declaration* EOF; // An entire program (Note: an empty program is a valid program)
// Declarations (rules that bind a name to an object in the current scope and produce side effects)
declaration → structDecl | funDecl | varDecl | statement; // A program is composed by a list of declarations
structDecl → "struct" IDENTIFIER "{" (varDecl)* "}"; // Declares a structure type similar to C's
funDecl → "fun" function; // Function declarations
varDecl → "var" | "let" | "const" IDENTIFIER ( "=" expression )? ";"; // Constants and immutables still count as "variable" declarations in the grammar
// Statements (rules that produce side effects but without binding a name)
statement → exprStmt | forStmt | ifStmt | returnStmt| whileStmt| blockStmt; // The set of all statements
exprStmt → expression ";"; // Any expression followed by a semicolon is technically a statement
forStmt → "for" "(" ( varDecl | exprStmt | ";" ) expression? ";" expression? ")" statement; // C-style for loops
ifStmt → "if" "(" expression ")" statement ( "else" statement )?; // If statements are conditional jumps
returnStmt → "return" expression? ";"; // Returns from a function, illegal in top-level code
breakStmt → "break" ";";
continueStmt → "continue" ";";
whileStmt → "while" "(" expression ")" statement; // While loops run until their condition is truthy
blockStmt → "{" declaration* "}"; // Blocks create a new scope that lasts until they're closed
// Expressions (rules that produce a value, but may also have side effects)
expression → assignment ;
assignment → ( call "." )? IDENTIFIER "=" assignment | logic_or; // Assignment is the highest-level expression
logic_or → logic_and ( "||" logic_and )*;
logic_and → equality ( "&&" equality )*;
equality → comparison ( ( "!=" | "==" ) comparison )*;
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*;
term → factor ( ( "-" | "+" ) factor )*; // Precedence for + and - in operations
factor → unary ( ( "/" | "*" | "**" | "^" | "&") unary )*; // All other operators have the same precedence
unary → ( "!" | "-" | "~" ) unary | call;
call → primary ( "(" arguments? ")" | "." IDENTIFIER )*;
primary → "true" | "false" | "nil" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER;
// Utility rules to avoid repetition
function → IDENTIFIER "(" parameters? ")" blockStmt;
parameters → IDENTIFIER ( "," IDENTIFIER )*;
arguments → expression ( "," expression )*;
// Lexical grammar that defines terminals in a non-recursive (aka regular) fashion
NUMBER → DIGIT+ ( "." | "e" | "E" DIGIT+ )?; // Numbers encompass integers and floats (even stuff like 1e5)
STRING → "\"" UNICODE* "\""; // Strings can contain arbitrary unicode inside them
IDENTIFIER → ALPHA ( ALPHA | DIGIT )*; // Valid identifiers are only alphanumeric!
ALPHA → "a" ... "z" | "A" ... "Z" | "_"; // Alphanumeric characters
UNICODE → 0x00 ... 0x10FFFD; // This covers the whole unicode range
DIGIT → "0" ... "9"; // Arabic digits
```