diff --git a/docs/bytecode.md b/docs/bytecode.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/grammar.md b/docs/grammar.md new file mode 100644 index 0000000..084d4c3 --- /dev/null +++ b/docs/grammar.md @@ -0,0 +1,111 @@ +# NimVM - Formal Grammar Specification + +Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters", +available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will +be explained below. + +## Disclaimer +---------------------------------------------- +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and +"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119). + +Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To +avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not, +which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals +may be formatted in monospace to make them stand out more in the document. + +## EBNF Syntax & Formatting rules +---------------------------------------------- +As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this +document can be summarized with the following points: +- A sequence of 2 slashes (character code 47) is used to mark comments. A comment lasts until the + a CRLF or LF character (basically the end of a line) is encountered. It is RECOMMENDED to use + them to clarify each rule, or a group of rules, to simplify human inspection of the specification +- Whitespaces, tabs, newlines and form feeds (character code 32, 9, 10 and 12 respectively) are not + relevant to the grammar and SHOULD be ignored by automated parsers and parser generators +- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times +- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST 1 or more times +- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or + the second rule. This operator can be chained to obtain something like "foo | bar | baz", meaning that either + foo, bar or baz are valid matches for the rule +- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive). + Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule + MUST match exactly y times. Omitting both x and y is the same as using * +- Lines end with an ASCII semicolon (";" without quotes, character code 59) and each rule must end with one +- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules + come first" +- An "arrow" (character code 8594) MUST be used to separate rule names from their definition. + A rule definition then looks something like this (without quotes): "name → rule definition here; // optional comment" +- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are + hexadecimal using the prefix 0x, octal using the prefix 0o, and binary using the prefix 0b. For example, + the literals 0x7F, 0b1111111 and 0o177 all represent the decimal number 127 in hexadecimal, binary and + octal respectively +- The literal "EOF" (without quotes), represents the end of the input stream and is a shorthand for "End Of File" +- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and + are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they + be separated by the three dots with a space for easier reading. Ranges can define numerical sets like in `"0 ... 9"` (without quotes), + or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be interpreted as a sequence of the + character codes between the start and end of the range. It is REQUIRED that the first element in the range is greater + or equal to the last one: backwards ranges are illegal. In addition to this, although numerical ranges can use any + combination of the supported number representation (meaning `'0 ... 0x10'` is a valid range encompassing all decimal + numbers from 0 to 16) it is RECOMMENDED that the representation used is consistent across the start and end of the range. + Finally, ranges can have a character and a number as either start or end of them, in which case the character is to be + interpreted as its character code in decimal + - For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between + operators + - Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated + by any other term in the grammar by a space + - Terminal symbols SHOULD use all-uppercase names to ease readability + - Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST + write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead) + +## EBNF Grammar +---------------------------------------------- +Below you can find the EBNF specification of NimVM's grammar. + +``` +// Top-level code +program → declaration* EOF; // An entire program (Note: an empty program is a valid program) + +// Declarations (rules that bind a name to an object in the current scope and produce side effects) +declaration → structDecl | funDecl | varDecl | statement; // A program is composed by a list of declarations +structDecl → "struct" IDENTIFIER "{" (varDecl)* "}"; // Declares a structure type similar to C's +funDecl → "fun" function; // Function declarations +varDecl → "var" | "let" | "const" IDENTIFIER ( "=" expression )? ";"; // Constants and immutables still count as "variable" declarations in the grammar + +// Statements (rules that produce side effects but without binding a name) +statement → exprStmt | forStmt | ifStmt | returnStmt| whileStmt| blockStmt; // The set of all statements +exprStmt → expression ";"; // Any expression followed by a semicolon is technically a statement +forStmt → "for" "(" ( varDecl | exprStmt | ";" ) expression? ";" expression? ")" statement; // C-style for loops +ifStmt → "if" "(" expression ")" statement ( "else" statement )?; // If statements are conditional jumps +returnStmt → "return" expression? ";"; // Returns from a function, illegal in top-level code +breakStmt → "break" ";"; +continueStmt → "continue" ";"; +whileStmt → "while" "(" expression ")" statement; // While loops run until their condition is truthy +blockStmt → "{" declaration* "}"; // Blocks create a new scope that lasts until they're closed +// Expressions (rules that produce a value, but may also have side effects) +expression → assignment ; +assignment → ( call "." )? IDENTIFIER "=" assignment | logic_or; // Assignment is the highest-level expression +logic_or → logic_and ( "||" logic_and )*; +logic_and → equality ( "&&" equality )*; +equality → comparison ( ( "!=" | "==" ) comparison )*; +comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )*; +term → factor ( ( "-" | "+" ) factor )*; // Precedence for + and - in operations +factor → unary ( ( "/" | "*" | "**" | "^" | "&") unary )*; // All other operators have the same precedence +unary → ( "!" | "-" | "~" ) unary | call; +call → primary ( "(" arguments? ")" | "." IDENTIFIER )*; +primary → "true" | "false" | "nil" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER; + +// Utility rules to avoid repetition +function → IDENTIFIER "(" parameters? ")" blockStmt; +parameters → IDENTIFIER ( "," IDENTIFIER )*; +arguments → expression ( "," expression )*; + +// Lexical grammar that defines terminals in a non-recursive (aka regular) fashion +NUMBER → DIGIT+ ( "." | "e" | "E" DIGIT+ )?; // Numbers encompass integers and floats (even stuff like 1e5) +STRING → "\"" UNICODE* "\""; // Strings can contain arbitrary unicode inside them +IDENTIFIER → ALPHA ( ALPHA | DIGIT )*; // Valid identifiers are only alphanumeric! +ALPHA → "a" ... "z" | "A" ... "Z" | "_"; // Alphanumeric characters +UNICODE → 0x00 ... 0x10FFFD; // This covers the whole unicode range +DIGIT → "0" ... "9"; // Arabic digits +``` \ No newline at end of file