12 KiB

Raw Permalink Blame History

JAPL - Formal Grammar Specification

Note: This document is currently a draft and is therefore incomplete

Rationale

The purpose of this document is to provide an unambiguous formal specification of peon's syntax for use in automated compiler generators (known as "compiler compilers") and parsers.

Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters", available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will be explained below.

Disclaimer

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119.

Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not, which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals may be formatted in monospace to make them stand out more in the document.

EBNF Syntax & Formatting rules

As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this document is detailed below:

The literal "LF" (without quotes) is a shorthand for "Line Feed". It symbolizes the end of a line and it's platform-independent
A pair of 2 forward-slashes (character code 47) is used to mark comments. A comment lasts until the the end of a line is encountered. It is RECOMMENDED to use them to clarify each rule, or a group of rules, to simplify human inspection of the specification
The name of non-terminal productions MUST be in lowercase (such as foo), while for terminals it MUST be in uppercase (such as FOO)
Whitespaces, tabs, newlines and form feeds (character codes 32, 9, 10 and 12 respectively) are not relevant to the grammar and MUST be ignored by automated parsers and parser generators
"*" (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times
"?" (without quotes, character code 63) means a rule can match 0 or 1 times
"+" (character code 43) is used for repetition of a rule, meaning it MUST match 1 or more times
"|" (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or the second rule. This operator can be chained to obtain something like "foo" | "bar" | "baz", meaning that either the literal strings foo, bar or baz are valid matches for the rule
"{x,y}" (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive). Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule MUST match exactly y times. Omitting both x and y is the same as using *
Production rules are terminated with an ASCII semicolon (COLON without quotes, character code 59)
Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules come first"
An "arrow" (character code 8594) MUST be used to separate rule names from their definition. A rule definition, then, looks something like this (without quotes): "name → rule definition here; // optional comment"
Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are hexadecimal using the prefix 0x, octal using the prefix 0o, and binary using the prefix 0b. For example, the literals 0x7F, 0b1111111 and 0o177 all represent the decimal number 127 in hexadecimal, binary and octal respectively
The literal "EOF" (without quotes), represents the end of the input stream and is a shorthand for "End Of File"
Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they be separated by the three dots with a space for ease of reading. Ranges can define numerical sets like in "0 ... 9" (without quotes), or lexicographical ones such as "'a' ... 'z'" (without quotes), in which case the range should be interpreted as a sequence of the character codes between the start and end of the range. Ranges are inclusive at both ends. It is REQUIRED that the first element in the range is greater or equal to the last one: backwards ranges are illegal. In addition to this, although numerical ranges can use any combination of the supported number representation (meaning '0 ... 0x10' is a valid range encompassing all decimal numbers from 0 to 16) it is RECOMMENDED that the representation used is consistent across the start and end of the range. Finally, ranges can have a character and a number as either start or end of them, in which case the character is to be interpreted as its character code in decimal
For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between operators
Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated by any other term in the grammar by a space
Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST write "\"" (without quotes), althoguh it is recommended to use single quotes in this case (i.e. '"' instead)

EBNF Grammar

Below you can find the EBNF specification of peon's grammar.

// Top-level code
program        → declaration* EOF; // An entire program (Note: an empty program *is* a valid program)

// Declarations (rules that bind a name to an object in the current scope and produce no side effects)

// A program is composed by a list of declarations
declaration    → funDecl | varDecl | coroDecl | statement;
// Function declarations
funDecl        → "fn" function;
coroDecl       → "coro" function;
// Constants still count as "variable" declarations in the grammar
varDecl        → ("var" | "let" | "const") IDENTIFIER ( "=" expression )? COLON;


// Statements (rules that produce side effects, without binding a name. Well, mostly: import, foreach and others do, but they're exceptions to the rule)
statement      → exprStmt | ifStmt | returnStmt| whileStmt| blockStmt;  // The set of all statements
// Any expression followed by a semicolon is an expression statement
exprStmt       → expression COLON;
// Returns from a function, illegal in top-level code. An empty return statement is illegal
// in non-void functions
returnStmt     → "return" expression? COLON;
// Defers the evaluation of the given expression right before a function exits, illegal in top-level code. 
// Semantically and functionally equivalent to wrapping a function in a big try block and executing the 
// expression in the finally block, but less verbose
deferStmt      → "defer" expression COLON;
// Breaks out of a loop or named block
breakStmt      → "break" IDENTIFIER? COLON;
// Skips to the next iteration in a loop or jumps to the
// beginning of a named block
continueStmt   → "continue" IDENTIFIER? COLON;
importStmt     -> ("from" IDENTIFIER)? "import" (IDENTIFIER ("as" IDENTIFIER)? ","?)+ COLON;  // Imports one or more modules in the current scope. Creates a namespace
assertStmt     → "assert" expression COLON;
yieldStmt      → "yield" expression? COLON;
// Pauses the execution of the calling coroutine and calls the given coroutine. Execution continues when the callee returns
awaitStmt      → "await" expression COLON;
// Exception handling
tryStmt        → "try" "{" statement* "}" (except+ "finally" statement | "finally" statement | "else" statement | except+ "else" statement | except+ "else" statement "finally" statement);
// Blocks create a new scope that lasts until they're closed
blockStmt      → "{" declaration* "}";
// Named blocks are useful for breaking out of deeply nested loops
namedBlock     → "block" IDENTIFIER "{" declaration* "}";
// If statements are conditional jumps
ifStmt         → "if" expression "{" statement* "}" ("else" "{" statement* "}")?;
// While loops run until their condition is true
whileStmt      → "while" expression "{" statement* "}";
// For-each loops iterate over a collection type
foreachStmt    → "foreach" "(" (IDENTIFIER ":" expression) ")" "{" statement* "}";


// Expressions (rules that produce a value and may have side effects)

// Assignment is the highest-level expression
expression     → assignment;
assignment     → (call ".")? IDENTIFIER ASSIGNTOKENS assignment | lambdaExpr;
lambdaExpr     → "lambda" lambda;  // Lambdas are anonymous functions, so they act as expressions
yieldExpr      → "yield" expression?; // Empty yield equals yield nil
awaitExpr      → "await" expression;
logic_or       → logic_and ("and" logic_and)*;
logic_and      → equality ("or" equality)*;
equality       → comparison (("!=" | "==") comparison)*;
comparison     → term ((">" | ">=" | "<" | "<=" | "as" | "is" | "of") term)*;
term           → factor (("-" | "+") factor)*;  // Precedence for + and - in operations
factor         → unary (("/" | "*" | "**" | "^" | "&") unary)*;  // All other binary operators have the same precedence
unary          → ("!" | "-" | "~") unary | call;
slice          → expression "[" expression (":" expression){0,2} "]"
call           → primary ("(" arguments? ")" | "." IDENTIFIER)*;
// Below are some collection literals: lists, sets, dictionaries and tuples
listExpr       → "[" arguments* "]";
// Note: "{}" is an empty dictionary, NOT an empty set
setExpr        → "{" arguments? "}";
dictExpr       → "{" (expression ":" expression ("," expression ":" expression)*)* "}"; // {key: value, ...}
tupleExpr      → "(" arguments* ")";
primary        → "nan" | "true" | "false" | "nil" | "inf" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER;

// Utility rules to avoid repetition
function       → IDENTIFIER ("(" parameters? ")")? blockStmt;
lambda         → ("(" parameters? ")")? blockStmt;
// ident: type [, ident2: type2, ...]
parameters     → IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)*;
arguments      → expression ("," expression)*;
except         → ("except" expression? statement)


// These are all the terminals (i.e. productions defined non-recursively)
COMMENT        → "#" UNICODE* LF;
COLON          → COLON;
SINGLESTRING   → QUOTE UNICODE* QUOTE;
DOUBLESTRING   → DOUBLEQUOTE UNICODE* DOUBLEQUOTE;
SINGLEMULTI    → QUOTE{3} UNICODE* QUOTE{3};   // Single quoted multi-line strings
DOUBLEMULTI    → DOUBLEQUOTE{3} UNICODE* DOUBLEQUOTE{3};  // Double quoted multi-line string
DECIMAL        → DIGIT+;
FLOAT          → DIGIT+ ("." DIGIT+)? (("e" | "E") DIGIT+)?;
BIN            → "0b" ("0" | "1")+;
OCT            → "0o" ("0" ... "7")+;
HEX            → "0x" ("0" ... "9" | "A" ... "F" | "a" ... "f")+;
NUMBER         → DECIMAL | FLOAT | BIN | HEX | OCT;
STRING         → ("r"|"b"|"f")? SINGLESTRING | DOUBLESTRING | SINGLEMULTI | DOUBLEMULTI;
IDENTIFIER     → ALPHA (ALPHA | DIGIT)*;  // Valid identifiers are only alphanumeric!
QUOTE          → "'";
DOUBLEQUOTE    → "\"";
IDENTCHARS     → "a" ... "z" | "A" ... "Z" | "_"; 
UNICODE        → 0x00 ... 0x10FFFD;  // This covers the whole unicode range
DIGIT          → "0" ... "9";
ASSIGNTOKENS   → "+=" | "-=" | "*="  | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//=" | "="

12 KiB Raw Permalink Blame History