diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..261eeb9 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md index 2766ff4..7c1ae5f 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,126 @@ -# peon-rewrite +# The peon programming language -Work in progress for Peon 0.2.x \ No newline at end of file +Peon is a modern, multi-paradigm, async-first programming language with a focus on correctness and speed. + +[Go to the Manual](docs/manual.md) + + +## What's peon? + +__Note__: For simplicity reasons, the verbs in this section refer to the present even though part of what's described here is not implemented yet. + + +Peon is a multi-paradigm, statically-typed programming language inspired by C, Nim, Python, Rust and C++: it supports modern, high-level +features such as automatic type inference, parametrically polymorphic generic types, pure functions, closures, interfaces, single inheritance, +reference types, templates, coroutines, raw pointers and exceptions. + +The memory management model is rather simple: a Mark and Sweep garbage collector is employed to reclaim unused memory, although more garbage +collection strategies (such as generational GC or deferred reference counting) are planned to be added in the future. + +Peon features a native cooperative concurrency model designed to take advantage of the inherent waiting of typical I/O workloads, without the use of more than one OS thread (wherever possible), allowing for much greater efficiency and a smaller memory footprint. The asynchronous model used forces developers to write code that is both easy to reason about, thanks to the [Structured concurrency](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) model that is core to peon's async event loop implementation, and works as expected every time (without dropping signals, exceptions, or task return values). + +Other notable features are the ability to define (and overload) custom operators with ease by implementing them as language-level functions, [Universal function call syntax](https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax), [Name stropping](https://en.wikipedia.org/wiki/Stropping_(syntax)) and named scopes. + +In peon, all objects are first-class (this includes functions, iterators, closures and coroutines). + +## Disclaimers + +**Disclaimer 1**: The project is still in its very early days: lots of stuff is not implemented, a work in progress or +otherwise outright broken. Feel free to report bugs! + +**Disclaimer 2**: Currently, the `std` module has to be _always_ imported explicitly for even the most basic snippets to work. This is because intrinsic types and builtin operators are defined within it: if it is not imported, peon won't even know how to parse `2 + 2` (and even if it could, it would have no idea what the type of the expression would be). You can have a look at the [peon standard library](src/peon/stdlib) to see how the builtins are defined (be aware that they heavily rely on compiler black magic to work) and can even provide your own implementation if you're so inclined. + + +### TODO List + +In no particular order, here's a list of stuff that's done/to do (might be incomplete/out of date): + - User-defined types + - Function calls ✅ + - Control flow (if-then-else, switch) ✅ + - Looping (while) ✅ + - Iteration (foreach) + - Type conversions + - Type casting + - Intrinsics ✅ + - Type unions ✅ + - Functions ✅ + - Closures + - Managed references + - Unmanaged references + - Named scopes/blocks ✅ + - Inheritance + - Interfaces + - Generics ✅ + - Automatic types ✅ + - Iterators/Generators + - Coroutines + - Pragmas ✅ + - Attribute resolution ✅ + - Universal Function Call Syntax + - Import system ✅ + - Exceptions + - Templates (_not_ like C++ templates) ✅ + - Optimizations (constant folding, branch and dead code elimination, inlining) + + +## Feature wishlist + +Here's a random list of high-level features I would like peon to have and that I think are kinda neat (some may +have been implemented alredady): +- Reference types are not nullable by default (must use `#pragma[nullable]`) +- The `commutative` pragma, which allows to define just one implementation of an operator + and have it become commutative +- Easy C/Nim interop via FFI +- C/C++ backend +- Nim backend +- [Structured concurrency](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) (must-have!) +- Simple OOP (with multiple dispatch!) +- RTTI, with methods that dispatch at runtime based on the true (aka runtime) type of a value +- Limited compile-time evaluation (embed the Peon VM in the C/C++/Nim backend and use that to execute peon code at compile time) + + +## The name + +The name for peon comes from [Productive2's](https://git.nocturn9x.space/prod2) genius cute brain, and is a result of shortening +the name of the fastest animal on earth: the **Pe**regrine Falc**on**. I guess I wanted this to mean peon will be blazing fast (I +certainly hope so!) + +# Peon needs you. + +No, but really. I need help. This project is huge and (IMHO) awesome, but there's a lot of non-trivial work to do and doing +it with other people is just plain more fun and rewarding. If you want to get involved, definitely try [contacting](https://nocturn9x.space/contact) me +or open an issue/PR! + + +# Credits + +- Araq, for creating the amazing language that is [Nim](https://nim-lang.org) (as well as all of its contributors!) +- Guido Van Rossum, aka the chad who created [Python](https://python.org) and its awesome community and resources +- The Nim community and contributors, for making Nim what it is today +- Bob Nystrom, for his amazing [book](https://craftinginterpreters.com) that inspired me + and taught me how to actually make a programming language (kinda, I'm still very dumb) +- [Njsmith](https://vorpus.org/), for his awesome articles on structured concurrency +- All the amazing people in the [r/ProgrammingLanguages](https://reddit.com/r/ProgrammingLanguages) subreddit and its [Discord](https://discord.gg/tuFCPmB7Un) server +- [Art](https://git.nocturn9x.space/art) <3 +- Everyone to listened (and still listens to) me ramble about compilers, programming languages and the likes (and for giving me ideas and testing peon!) +- ... More? (I'd thank the contributors but it's just me :P) +- Me! I guess + + +## Ok, cool, how do I use it? + +Great question! If this README somehow didn't turn you away already (thanks, by the way), then you may want to try peon +out for yourself. Fortunately, the process is quite straightforward: + +- First, you're gonna have to install [Nim](https://nim-lang.org/), the language peon is written in. I highly recommend + using [choosenim](https://github.com/dom96/choosenim) to manage your Nim installations as it makes switching between them and updating them a breeze +- Then, clone this repository and compile peon in release mode with `nim c -d:release --passC:"-flto" -o:peon src/main`, which should produce`peon` binary + ready for you to play with (if your C toolchain doesn't support LTO then you can just omit the `--passC` option, although that would be pretty weird for + a modern linker) +- If you want to move the executable to a different directory (say, into your `PATH`), you should copy peon's standard + library (found in `/src/peon/stdlib`) into a known folder, edit the `moduleLookupPaths` variable inside `src/config.nim` + by adding said folder to it so that the peon compiler knows where to find modules when you `import std;` and then recompile + peon. Hopefully I will automate this soon, but as of right now the work is all manual + + +__Note__: On Linux, peon will also look into `~/.local/peon/stdlib` by default, so you can just create the `~/.local/peon` folder and copy `src/peon/stdlib` there \ No newline at end of file diff --git a/docs/.vscode/settings.json b/docs/.vscode/settings.json new file mode 100644 index 0000000..65e1ec0 --- /dev/null +++ b/docs/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "makefile.extensionOutputFolder": "./.vscode" +} \ No newline at end of file diff --git a/docs/bytecode.md b/docs/bytecode.md new file mode 100644 index 0000000..974b5f9 --- /dev/null +++ b/docs/bytecode.md @@ -0,0 +1,113 @@ +# Peon - Bytecode Specification + +This document aims to document peon's bytecode as well as how it is (de-)serialized to/from files and +other file-like objects. Note that the segments in a bytecode dump appear in the order they are listed +in this document. + +## Code Structure + +A peon program is compiled into a tightly packed sequence of bytes that contain all the necessary information +the VM needs to execute said program. There is no dependence between the frontend and the backend outside of the +bytecode format (which is implemented in a separate serialiazer module) to allow for maximum modularity. + +A peon bytecode file contains the following: + +- Constants +- The program's code +- Debugging information (file and version metadata, module info. Optional) + + +## File Headers + +A peon bytecode file starts with the header, which is structured as follows: + +- The literal string `PEON_BYTECODE` +- A 3-byte version number (the major, minor and patch version numbers of the compiler that generated the file) +- The branch name of the repository the compiler was built from, prepended with its length as a 1 byte integer +- The commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds) +- An 8-byte UNIX timestamp (with Epoch 0 starting at 1/1/1970 12:00 AM) representing the exact date and time of when the file was generated + +## Debug information + +The following segments contain extra information and metadata about the compiled bytecode to aid debugging, but they may be missing +in release builds. + +### Line data segment + +The line data segment contains information about each instruction in the code segment and associates them +1:1 with a line number in the original source file for easier debugging using run-length encoding. The segment's +size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data +in this segment can be decoded as explained in [this file](../src/frontend/compiler/targgets/bytecode/opcodes.nim#L29), which is quoted +below: +``` +[...] +## lines maps bytecode instructions to line numbers using Run +## Length Encoding. Instructions are encoded in groups whose structure +## follows the following schema: +## - The first integer represents the line number +## - The second integer represents the number of +## instructions on that line +## For example, if lines equals [1, 5], it means that there are 5 instructions +## at line 1, meaning that all instructions in code[0..4] belong to the same line. +## This is more efficient than using the naive approach, which would encode +## the same line number multiple times and waste considerable amounts of space. +[...] +``` + +### Functions segment + +This segment contains details about each function in the original file. The segment's size is fixed and is encoded at the +beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data in this segment can be decoded as explained +in [this file](../src/frontend/compiler/targets/bytecode/opcodes.nim#L39), which is quoted below: + +``` +[...] +## functions encodes the following information: +## - Function name +## - Argument count +## - Function boundaries +## The encoding for functions is the following: +## - First, the position into the bytecode where the function begins is encoded (as a 3 byte integer) +## - Second, the position into the bytecode where the function ends is encoded (as a 3 byte integer) +## - After that follows the argument count as a 1 byte integer +## - Lastly, the function's name (optional) is encoded in ASCII, prepended with +## its size as a 2-byte integer +[...] +``` + +### Modules segment + +This segment contains details about the modules that make up the original source code which produced a given bytecode dump. +The data in this segment can be decoded as explained in [this file](../src/frontend/compiler/targets/bytecode/opcodes.nim#L49), which is quoted below: +``` +[...] +## modules contains information about all the peon modules that the compiler has encountered, +## along with their start/end offset in the code. Unlike other bytecode-compiled languages like +## Python, peon does not produce a bytecode file for each separate module it compiles: everything +## is contained within a single binary blob. While this simplifies the implementation and makes +## bytecode files entirely "self-hosted", it also means that the original module information is +## lost: this segment serves to fix that. The segment's size is encoded at the beginning as a 4-byte +## sequence (i.e. a single 32-bit integer) and its encoding is similar to that of the functions segment: +## - First, the position into the bytecode where the module begins is encoded (as a 3 byte integer) +## - Second, the position into the bytecode where the module ends is encoded (as a 3 byte integer) +## - Lastly, the module's name is encoded in ASCII, prepended with its size as a 2-byte integer +[...] +``` + + +## Constant segment + +The constant segment contains all the read-only values that the code will need at runtime, such as hardcoded +variable initializers or constant expressions. It is similar to the `.rodata` section of Assembly files, although +the implementation is different. Constants are encoded as a linear sequence of bytes with no type information about +them whatsoever: it is the code that, at runtime, loads each constant (whose type is determined at compile time) onto +the stack accordingly. For example, a 32 bit integer constant would be encoded as a sequence of 4 bytes, which would +then be loaded by the appropriate `LoadInt32` instruction at runtime. The segment's size is fixed and is encoded at +the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant segment may be empty, although in +real-world scenarios it likely won't be. + +## Code segment + +The code segment contains the linear sequence of bytecode instructions of a peon program to be fed directly to +peon's virtual machine. The segment's size is fixed and is encoded at the beginning as a sequence of 3 bytes +(i.e. a single 24 bit integer). All the instructions are documented [here](../src/frontend/compiler/targgets/bytecode/opcodes.nim) \ No newline at end of file diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..570545f --- /dev/null +++ b/docs/design.md @@ -0,0 +1,32 @@ +# Peon design scratchpad + +This is just a random doc I made to keep track of all the design changes I have +in mind for Peon: with this being my first serious attempt at making a programming +language that's actually _useful_, I want to get the design right the first time +(no one wants to make JavaScript 2.0, right? _Right?_). + + +The basic idea is: +- Some peon code comes in (from a file or as command-line input, doesn't matter) +- It gets tokenized and parsed into a typeless AST +- The compiler processes the typeless AST into a typed one +- The typed AST is passed to an optional optimizer module, which spits + out another (potentially identical) typed AST representing the optimized + program. The optimizer is always run even when optimizations are disabled, + as it takes care of performing closure conversion and other cool stuff +- The typed AST is passed to a code generator module that is specific to every + backend/platform, which actually takes care of producing the code that will + then be executed + + +The current design is fairly modular and some parts of the codebase are more final +than others: for example, the lexer and parser are more or less complete and unlikely +to undergo massive changes in the future as opposed to the compiler which has been subject +to many major refactoring steps as the project went along, but I digress. + +The typed AST format should ideally be serializable to binary files so that I can slot in +different optimizer/code generator modules written in different languages without the need +to use FFI. The format will serve a similar purpose to the IR used by gcc (GIMPLE), but instead +of being an RTL-like language it'll operate on a much higher level since we don't really need to +support any other programming language other than peon itself (while gcc has to be interoperable +with FORTRAN and other stuff). \ No newline at end of file diff --git a/docs/grammar.md b/docs/grammar.md new file mode 100644 index 0000000..a6344f0 --- /dev/null +++ b/docs/grammar.md @@ -0,0 +1,179 @@ +# Peon - Formal Grammar Specification + +__Note__: This document is currently a draft and is therefore incomplete + +## Rationale +The purpose of this document is to provide an unambiguous formal specification of peon's syntax for use in automated +compiler generators (known as "compiler compilers") and parsers. + +Our grammar is inspired by (and extended from) the Lox language as described in Bob Nystrom's book "Crafting Interpreters", +available at https://craftinginterpreters.com, and follows the EBNF standard, but for clarity the relevant syntax will +be explained below. + +## Disclaimer +---------------------------------------------- +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and +"OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119). + +Literals in this document will be often surrounded by double quotes to make it obvious they're not part of a sentence. To +avoid ambiguity, this document will always specify explicitly if double quotes need to be considered as part of a term or not, +which means that if it is not otherwise stated they are to be considered part of said term. In addition to quotes, literals +may be formatted in monospace to make them stand out more in the document. + +## EBNF Syntax & Formatting rules +---------------------------------------------- +As a refresher to experienced users as well as to facilitate reading to newcomers, the variation of EBNF used in this document +is detailed below: +- The literal `"LF"` (without quotes) is a shorthand for "Line Feed". It symbolizes the end of a line and it's platform-independent +- A pair of 2 forward-slashes (character code 47) is used to mark comments. A comment lasts until the + the end of a line is encountered. It is RECOMMENDED to use them to clarify each rule, or a group of rules, + to simplify human inspection of the specification +- The name of non-terminal productions MUST be in lowercase (such as `foo`), while for terminals it MUST be in uppercase (such as `FOO`) +- Whitespaces, tabs, newlines and form feeds (character codes 32, 9, 10 and 12 respectively) are not + relevant to the grammar and MUST be ignored by automated parsers and parser generators +- `"*"` (without quotes, character code 42) is used for repetition of a rule, meaning it MUST match 0 or more times +- `"?"` (without quotes, character code 63) means a rule can match 0 or 1 times +- `"+"` (character code 43) is used for repetition of a rule, meaning it MUST match 1 or more times +- `"|"` (without quotes, character code 123) is used to indicate alternatives and means a rule may match either the first or + the second rule. This operator can be chained to obtain something like `"foo" | "bar" | "baz"`, meaning that either + the literal strings foo, bar or baz are valid matches for the rule +- `"{x,y}"` (without quotes) is used for repetition, meaning a rule MUST match from x to y times (start to end, inclusive). + Omitting x means the rule MUST match at least 0 times and at most x times, while omitting y means the rule + MUST match exactly y times. Omitting both x and y is the same as using `*` +- Production rules are terminated with an ASCII semicolon (`COLON` without quotes, character code 59) +- Rules are listed in descending order: the last rule is the highest-precedence one. Think of it as a "more complex rules + come first" +- An "arrow" (character code 8594) MUST be used to separate rule names from their definition. + A rule definition, then, looks something like this (without quotes): `"name → rule definition here; // optional comment"` +- Literal numbers can be expressed in their decimal form (i.e. with arabic numbers). Other supported formats are + hexadecimal using the prefix `0x`, octal using the prefix `0o`, and binary using the prefix `0b`. For example, + the literals `0x7F`, `0b1111111` and `0o177` all represent the decimal number `127` in hexadecimal, binary and + octal respectively +- The literal `"EOF"` (without quotes), represents the end of the input stream and is a shorthand for "End Of File" +- Ranges can be defined by separating the start and the end of the range with three dots (character code 46) and + are inclusive at both ends. Both the start and the end of the range are mandatory and it is RECOMMENDED that they + be separated by the three dots with a space for ease of reading. Ranges can define numerical sets like in `"0 ... 9"` + (without quotes), or lexicographical ones such as `"'a' ... 'z'"` (without quotes), in which case the range should be + interpreted as a sequence of the character codes between the start and end of the range. Ranges are inclusive at both + ends. It is REQUIRED that the first element in the range is greater or equal to the last one: backwards ranges are illegal. + In addition to this, although numerical ranges can use any combination of the supported number representation + (meaning `'0 ... 0x10'` is a valid range encompassing all decimal numbers from 0 to 16) it is RECOMMENDED that + the representation used is consistent across the start and end of the range. Finally, ranges can have a character + and a number as either start or end of them, in which case the character is to be interpreted as its character code in decimal + - For readability purposes, it is RECOMMENTED that the grammar text be left aligned and that spaces are used between + operators + - Literal strings MUST be delimited by matching pairs of double or single quotes (character code 34 and 39) and SHOULD be separated + by any other term in the grammar by a space + - Characters inside strings can be escaped using backslashes. For example, to add a literal double quote inside a double-quoted string, one MUST + write `"\""` (without quotes), althoguh it is recommended to use single quotes in this case (i.e. `'"'` instead) + +## EBNF Grammar +---------------------------------------------- +Below you can find the EBNF specification of peon's grammar. + +``` +// Top-level code +program → declaration* EOF; // An entire program (Note: an empty program *is* a valid program) + +// Declarations (rules that bind a name to an object in the current scope and produce no side effects) + +// A program is composed by a list of declarations +declaration → funDecl | varDecl | coroDecl | statement; +// Function declarations +funDecl → "fn" function; +coroDecl → "coro" function; +// Constants still count as "variable" declarations in the grammar +varDecl → ("var" | "let" | "const") IDENTIFIER ( "=" expression )? COLON; + + +// Statements (rules that produce side effects, without binding a name. Well, mostly: import, foreach and others do, but they're exceptions to the rule) +statement → exprStmt | ifStmt | returnStmt| whileStmt| blockStmt; // The set of all statements +// Any expression followed by a semicolon is an expression statement +exprStmt → expression COLON; +// Returns from a function, illegal in top-level code. An empty return statement is illegal +// in non-void functions +returnStmt → "return" expression? COLON; +// Defers the evaluation of the given expression right before a function exits, illegal in top-level code. +// Semantically and functionally equivalent to wrapping a function in a big try block and executing the +// expression in the finally block, but less verbose +deferStmt → "defer" expression COLON; +// Breaks out of a loop or named block +breakStmt → "break" IDENTIFIER? COLON; +// Skips to the next iteration in a loop or jumps to the +// beginning of a named block +continueStmt → "continue" IDENTIFIER? COLON; +importStmt -> ("from" IDENTIFIER)? "import" (IDENTIFIER ("as" IDENTIFIER)? ","?)+ COLON; // Imports one or more modules in the current scope. Creates a namespace +assertStmt → "assert" expression COLON; +yieldStmt → "yield" expression? COLON; +// Pauses the execution of the calling coroutine and calls the given coroutine. Execution continues when the callee returns +awaitStmt → "await" expression COLON; +// Exception handling +tryStmt → "try" "{" statement* "}" (except+ "finally" statement | "finally" statement | "else" statement | except+ "else" statement | except+ "else" statement "finally" statement); +// Blocks create a new scope that lasts until they're closed +blockStmt → "{" declaration* "}"; +// Named blocks are useful for breaking out of deeply nested loops +namedBlock → "block" IDENTIFIER "{" declaration* "}"; +// If statements are conditional jumps +ifStmt → "if" expression "{" statement* "}" ("else" "{" statement* "}")?; +// While loops run until their condition is true +whileStmt → "while" expression "{" statement* "}"; +// For-each loops iterate over a collection type +foreachStmt → "foreach" "(" (IDENTIFIER ":" expression) ")" "{" statement* "}"; + + +// Expressions (rules that produce a value and may have side effects) + +// Assignment is the highest-level expression +expression → assignment; +assignment → (call ".")? IDENTIFIER ASSIGNTOKENS assignment | lambdaExpr; +lambdaExpr → "lambda" lambda; // Lambdas are anonymous functions, so they act as expressions +yieldExpr → "yield" expression?; // Empty yield equals yield nil +awaitExpr → "await" expression; +logic_or → logic_and ("and" logic_and)*; +logic_and → equality ("or" equality)*; +equality → comparison (("!=" | "==") comparison)*; +comparison → term ((">" | ">=" | "<" | "<=" | "as" | "is" | "of") term)*; +term → factor (("-" | "+") factor)*; // Precedence for + and - in operations +factor → unary (("/" | "*" | "**" | "^" | "&") unary)*; // All other binary operators have the same precedence +unary → ("!" | "-" | "~") unary | call; +slice → expression "[" expression (":" expression){0,2} "]" +call → primary ("(" arguments? ")" | "." IDENTIFIER)*; +// Below are some collection literals: lists, sets, dictionaries and tuples +listExpr → "[" arguments* "]"; +// Note: "{}" is an empty dictionary, NOT an empty set +setExpr → "{" arguments? "}"; +dictExpr → "{" (expression ":" expression ("," expression ":" expression)*)* "}"; // {key: value, ...} +tupleExpr → "(" arguments* ")"; +primary → "nan" | "true" | "false" | "nil" | "inf" | NUMBER | STRING | IDENTIFIER | "(" expression ")" "." IDENTIFIER; + +// Utility rules to avoid repetition +function → IDENTIFIER ("(" parameters? ")")? blockStmt; +lambda → ("(" parameters? ")")? blockStmt; +// ident: type [, ident2: type2, ...] +parameters → IDENTIFIER ":" IDENTIFIER ("," IDENTIFIER)*; +arguments → expression ("," expression)*; +except → ("except" expression? statement) + + +// These are all the terminals (i.e. productions defined non-recursively) +COMMENT → "#" UNICODE* LF; +COLON → COLON; +SINGLESTRING → QUOTE UNICODE* QUOTE; +DOUBLESTRING → DOUBLEQUOTE UNICODE* DOUBLEQUOTE; +SINGLEMULTI → QUOTE{3} UNICODE* QUOTE{3}; // Single quoted multi-line strings +DOUBLEMULTI → DOUBLEQUOTE{3} UNICODE* DOUBLEQUOTE{3}; // Double quoted multi-line string +DECIMAL → DIGIT+; +FLOAT → DIGIT+ ("." DIGIT+)? (("e" | "E") DIGIT+)?; +BIN → "0b" ("0" | "1")+; +OCT → "0o" ("0" ... "7")+; +HEX → "0x" ("0" ... "9" | "A" ... "F" | "a" ... "f")+; +NUMBER → DECIMAL | FLOAT | BIN | HEX | OCT; +STRING → ("r"|"b"|"f")? SINGLESTRING | DOUBLESTRING | SINGLEMULTI | DOUBLEMULTI; +IDENTIFIER → ALPHA (ALPHA | DIGIT)*; // Valid identifiers are only alphanumeric! +QUOTE → "'"; +DOUBLEQUOTE → "\""; +IDENTCHARS → "a" ... "z" | "A" ... "Z" | "_"; +UNICODE → 0x00 ... 0x10FFFD; // This covers the whole unicode range +DIGIT → "0" ... "9"; +ASSIGNTOKENS → "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//=" | "=" +``` \ No newline at end of file diff --git a/docs/manual.md b/docs/manual.md new file mode 100644 index 0000000..71f5862 --- /dev/null +++ b/docs/manual.md @@ -0,0 +1,301 @@ +# Peon - Manual + +Peon is a statically typed, garbage-collected, C-like programming language with +a focus on speed and correctness, but whose main feature is the ability to natively +perform highly efficient parallel I/O operations by implementing the [structured concurrency](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) +paradigm. + +__Note__: Peon is currently a WIP (Work In Progress), and much of the content of this manual is purely theoretical as +of now. If you want to help make this into a reality, feel free to contribute! + + +## Table of contents + +- [Manual](#peon---manual) +- [Design Goals](#design-goals) +- [Examples](#peon-by-example) +- [Grammar](grammar.md) +- [Bytecode](bytecode.md) + +## Design Goals + +While peon is inspired from Bob Nystrom's [book](https://craftinginterpreters.com), where he describes a simple toy language +named Lox, the aspiration for it is to become a programming language that could actually be used in the real world. For that +to happen, we need: + +- Exceptions (`try/except/finally`) +- An import system (with namespaces, like Python) +- Multithreading support (with a global VM lock when GC'ing) +- Built-in collections (list, tuple, set, etc.) +- Coroutines (w/ structured concurrency) +- Generators +- Generics +- C/Nim FFI +- A C backend (for native speed) +- A package manager + +Peon ~~steals~~ borrows many ideas from Python, Nim (the the language peon itself is written in), C and many others. + +## Peon by Example + +Here follow a few examples of peon code to make it clear what the end product should look like. Note that +not all examples represent working functionality and some of these examples might not be up to date either. +For somewhat more updated code snippets, check the [tests](../tests/) directory. + +### Variable declarations + +``` +var x = 5; # Inferred type is int64 +var y = 3'u16; # Type is specified as uint16 +x = 6; # Works: type matches +x = 3.0; # Error: Cannot assign float64 to x +var x = 3.14; # Error: cannot re-declare x +const z = 6.28; # Constant declaration +let a = "hi!"; # Cannot be reassigned/mutated +var b: int32 = 5; # Explicit type declaration (TODO) +``` + +__Note__: Peon supports [name stropping](https://en.wikipedia.org/wiki/Stropping_(syntax)), meaning + that almost any ASCII sequence of characters can be used as an identifier, including language + keywords, but stropped names need to be enclosed by matching pairs of backticks (`\``) + +### Comments + +``` +# This is a single-line comment +# Peon has no specific syntax for multi-line comments. + +fn id[T: any](x: T): T { + ## Documentation comments start + ## with two dashes. They are currently + ## unused, but will be semantically + ## relevant in the future. They can + ## be used to document types, modules + ## and functions + return x; +} +``` + +### Functions + +``` +fn fib(n: int): int { + if n < 3 { + return n; + } + return fib(n - 1) + fib(n - 2); +} + +fib(30); +``` + +### Type declarations (TODO) + +``` +type Foo = object { # Can also be "ref object" for reference types (managed automatically) + fieldOne*: int # Asterisk means the field is public outside the current module + fieldTwo*: int +} +``` + +### Enumeration types (TODO) + +``` +type SomeEnum = enum { # Can be mapped to an integer + KindOne, + KindTwo +} +``` + +### Operator overloading + +``` +operator `+`(a, b: Foo): Foo { + return Foo(fieldOne: a.fieldOne + b.fieldOne, fieldTwo: a.fieldTwo + b.fieldTwo); +} + +Foo(fieldOne: 1, fieldTwo: 3) + Foo(fieldOne: 2, fieldTwo: 3); # Foo(fieldOne: 3, fieldTwo: 6) +``` + +__Note__: Custom operators (e.g. `foo`) can also be defined. The backticks around the plus sign serve to mark it +as an identifier instead of a symbol (which is a requirement for function names, since operators are basically +functions in peon). In fact, even the built-in peon operators are implemented partially in peon (actually, just +their stubs are) and they are then specialized in the compiler to get rid of unnecessary function call overhead. + +### Function calls + +``` +foo(1, 2 + 3, 3.14, bar(baz)); +``` + +__Note__: Operators can be called as functions; If their name is a symbol, just wrap it in backticks like so: +``` +`+`(1, 2) # Identical to 1 + 2 +``` + +__Note__: Code the likes of `a.b()` is (actually, will be) desugared to `b(a)` if there exists a function + `b` whose signature is compatible with the value of `a` (assuming `a` doesn't have a field named `b`, + in which case the attribute resolution takes precedence) + + +### Generics + +``` +fn genericSum[T: Number](a, b: T): T { # Note: "a, b: T" means that both a and b are of type T + return a + b; +} + +# This allows for a single implementation to be +# re-used multiple times without any code duplication +genericSum(1, 2); +genericSum(3.14, 0.1); +genericSum(1'u8, 250'u8); +``` + +__Note__: Peon generics are implemented according to a paradigm called [parametric polymorphism](https://en.wikipedia.org/wiki/Parametric_polymorphism). In constrast to the model employed by other languages such as C++, called [ad hoc polymorphism](https://en.wikipedia.org/wiki/Ad_hoc_polymorphism), +where each time a generic function is called with a new type signature it is instantiated and +typechecked (and then compiled), peon checks generics at declaration time and only once: this +not only saves precious compilation time, but it also allows the compiler to generate a single +implementation for the function (although this is not a requirement) and catches type errors right +when they occur even if the function is never called, rather than having to wait for the function +to be called and specialized. Unfortunately, this means that some of the things that are possible +in, say, C++ templates are just not possible with peon generics. As an example, take this code snippet: + +``` +fn add[T: any](a, b: T): T { + return a + b; +} +``` + +While the intent of this code is clear and makes sense semantically speaking, peon will refuse +to compile it because it cannot prove that the `+` operator is defined on every type (in fact, +it's only defined for numbers): this is a feature. If peon allowed it, `any` could be used to +escape the safety of the type system (for example, calling `add` with `string`s, which may or +may not be what you want). + +Since the goal for peon is to not constrain the developer into one specific programming paradigm, +it also implements a secondary, different, generic mechanism using the `auto` type. The above code +could be rewritten to work as follows: + +``` +fn add(a, b: auto): auto { + return a + b; +} +``` + +When using automatic types, peon will behave similarly to C++ (think: templates) and only specialize, +typecheck and compile the function once it is called with a given type signature. For this reason, +automatic and parametrically polymorphic types cannot be used together in peon code. + +Another noteworthy concept to keep in mind is that of type unions. For example, take this snippet: + +``` +fn foo(x: int32): int32 { + return x; +} + + +fn foo(x: int): int { + return x; +} + + +fn identity[T: int | int32](x: T): T { + return foo(x); +} +``` + +This code will, again, fail to compile: this is because as far as peon is concerned, `foo` is not +defined for both `int` and `int32` _at the same time_. In order for that to work, `foo` would need +to be rewritten with `T: int32 | int` as its generic argument type in order to avoid the ambiguity +(or `identity` could be rewritten to use automatic types instead, both are viable options). Obviously, +the above snippet would fail to compile if `foo` were not defined for all the types specified in the +type constraint for `identity` as well (this is because, counterintuitively, matching a generic constraint +such as `int32 | int` does _not_ mean "either of these types", but rather "_both_ of these types at +once"). + + +#### More generics + +``` +fn genericSth[T: someTyp, K: someTyp2](a: T, b: K) { # Note: no return type == void function + # code... +} + +genericSth(1, 3.0); +``` + + +#### Even more generics + +``` +type Box*[T: Number] = object { + num: T; +} + +var boxFloat = Box[float](1.0); +var boxInt = Box[int](1); +``` + +__Note__: The `*` modifier to make a name visible outside the current module must be put +__before__ the generic constraints, so only `fn foo*[T](a: T) {}` is the correct syntax. + + + +### Forward declarations + +``` +fn someF: int; # Semicolon and no body == forward declaration + +print(someF()); # Prints 42 + +fn someF: int { + return 42; +} +``` + +__Note__: A function that is forward-declared __must__ be implemented in the same module as +the forward declaration. + +### Generators + +``` +generator count(n: int): int { + while n > 0 { + yield n; + n -= 1; + } +} + +foreach n in count(10) { + print(n); +} +``` + + +### Coroutines + +``` +import concur; +import http; + + +coroutine req(url: string): string { + return (await http.AsyncClient().get(url)).content; +} + + +coroutine main(urls: list[string]) { + pool = concur.pool(); # Creates a task pool: like a nursery in njsmith's article + foreach url in urls { + pool.spawn(req, urls); + } + # The pool has internal machinery that makes the parent + # task wait until all child exit! When this function + # returns, ALL child tasks will have exited somehow. + # Exceptions and return values propagate neatly, too. +} + + +concur.run(main, newList[string]("https://google.com", "https://debian.org")) +``` \ No newline at end of file diff --git a/nim.cfg b/nim.cfg new file mode 100644 index 0000000..7f9bd25 --- /dev/null +++ b/nim.cfg @@ -0,0 +1 @@ +path="src" \ No newline at end of file diff --git a/src/backend/bytecode/vm.nim b/src/backend/bytecode/vm.nim new file mode 100644 index 0000000..7f2dd01 --- /dev/null +++ b/src/backend/bytecode/vm.nim @@ -0,0 +1,1098 @@ +# Copyright 2022 Mattia Giambirtone & All Contributors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +## The Peon runtime environment +import config + +# Sorry, but there only is enough space +# for one GC in this VM :( +{.push checks:enableVMChecks.} # The VM is a critical point where checks are deleterious +when defined(gcOrc): + GC_disableOrc() +when not defined(gcArc) and not defined(gcOrc): + GC_disable() + GC_disableMarkAndSweep() + + +import std/math +import std/strformat +import std/segfaults +import std/strutils +import std/sets +import std/monotimes + +when debugVM or debugMem or debugGC or debugAlloc: + import std/sequtils + import std/terminal + + +import frontend/compiler/targets/bytecode/opcodes +import frontend/compiler/targets/bytecode/util/multibyte + + +when debugVM: + proc clearerr(stream: File) {.header: "stdio.h", importc.} + + +type + ObjectKind* = enum + ## A tag for heap-allocated + ## peon objects + String, List, + Dict, Tuple, + CustomType, + HeapObject* = object + ## A tagged box for a heap-allocated + ## peon object + marked*: bool # Used in the GC phase + case kind*: ObjectKind + of String: + str*: ptr UncheckedArray[char] + len*: int + else: + discard # TODO + PeonGC* = object + ## A simple Mark&Sweep collector + ## to manage peon's heap space. + ## All heap allocation goes through + ## this system and is not handled + ## manually by the VM + bytesAllocated: tuple[total, current: int] + when debugGC or debugAlloc: + cycles: int + nextGC: int + pointers: HashSet[uint64] + PeonVM* = object + ## The Peon Virtual Machine. + ## Note how the only data + ## type we handle here is + ## a 64-bit unsigned integer: + ## This is to allow the use + ## of unboxed primitive types. + ## For more complex types, the + ## value represents a pointer to + ## some stack- or heap-allocated + ## object. The VM has no concept + ## of type by itself: everything + ## is lost after the compilation + ## phase + ip: uint64 # The instruction pointer + chunk: Chunk # The chunk of bytecode to execute + calls: seq[uint64] # The call stack + operands: seq[uint64] # The operand stack + cache: array[6, uint64] # The singletons cache + frames: seq[uint64] # Stores the bottom of stack frames + results: seq[uint64] # Stores function return values + gc: PeonGC # A reference to the VM's garbage collector + when debugVM: + breakpoints: seq[uint64] # Breakpoints where we call our debugger + debugNext: bool # Whether to debug the next instruction + lastDebugCommand: string # The last debugging command input by the user + + +# Implementation of peon's memory manager + +proc newPeonGC*: PeonGC = + ## Initializes a new, blank + ## garbage collector + result.bytesAllocated = (0, 0) + result.nextGC = FirstGC + when debugGC or debugAlloc: + result.cycles = 0 + + +proc collect*(self: var PeonVM) + + + +proc reallocate*(self: var PeonVM, p: pointer, oldSize: int, newSize: int): pointer = + ## Simple wrapper around realloc with + ## built-in garbage collection + self.gc.bytesAllocated.current += newSize - oldSize + try: + when debugMem: + if newSize == 0 and not p.isNil(): + if oldSize > 1: + echo &"DEBUG - MM: Deallocating {oldSize} bytes of memory" + else: + echo "DEBUG - MM: Deallocating 1 byte of memory" + if (oldSize > 0 and not p.isNil() and newSize > oldSize) or oldSize == 0: + when debugMem: + if oldSize == 0: + if newSize > 1: + echo &"DEBUG - MM: Allocating {newSize} bytes of memory" + else: + echo "DEBUG - MM: Allocating 1 byte of memory" + else: + echo &"DEBUG - M: Resizing {oldSize} bytes of memory to {newSize} bytes" + self.gc.bytesAllocated.total += newSize - oldSize + when debugStressGC: + self.collect() + else: + if self.gc.bytesAllocated.current >= self.gc.nextGC: + self.collect() + result = realloc(p, newSize) + except NilAccessDefect: + stderr.writeLine("Peon: could not manage memory, segmentation fault") + quit(139) # For now, there's not much we can do if we can't get the memory we need, so we exit + + +template resizeArray(self: var PeonVM, kind: untyped, p: pointer, oldCount, newCount: int): untyped = + ## Handy template to resize a dynamic array + cast[ptr UncheckedArray[kind]](reallocate(self, p, sizeof(kind) * oldCount, sizeof(kind) * newCount)) + + +template freeArray(self: var PeonVM, kind: untyped, p: pointer, size: int): untyped = + ## Frees a dynamic array + discard reallocate(self, p, sizeof(kind) * size, 0) + + +template free(self: var PeonVM, kind: typedesc, p: pointer): untyped = + ## Frees a pointer by reallocating its + ## size to 0 + discard reallocate(self, p, sizeof(kind), 0) + + +template setKind[T, K](t: var T, kind: untyped, target: K) = + ## Thanks to https://forum.nim-lang.org/t/8312 + cast[ptr K](cast[int](addr t) + offsetOf(typeof(t), kind))[] = target + + +proc allocate(self: var PeonVM, kind: ObjectKind, size: typedesc, count: int): ptr HeapObject {.inline.} = + ## Allocates an object on the heap and adds its + ## location to the internal pointer list of the + ## garbage collector + result = cast[ptr HeapObject](self.reallocate(nil, 0, sizeof(HeapObject))) + setkind(result[], kind, kind) + result.marked = false + case kind: + of String: + result.str = cast[ptr UncheckedArray[char]](self.reallocate(nil, 0, sizeof(size) * count)) + result.len = count + else: + discard # TODO + self.gc.pointers.incl(cast[uint64](result)) + when debugAlloc: + echo &"DEBUG - GC: Allocated new object: {result[]}" + echo &"DEBUG - GC: Current heap size: {self.gc.bytesAllocated.current}" + echo &"DEBUG - GC: Total bytes allocated: {self.gc.bytesAllocated.total}" + echo &"DEBUG - GC: Tracked objects: {self.gc.pointers.len()}" + echo &"DEBUG - GC: Completed GC cycles: {self.gc.cycles}" + + +proc mark(self: ptr HeapObject): bool = + ## Marks a single object + if self.marked: + return false + self.marked = true + return true + + +proc markRoots(self: var PeonVM): HashSet[ptr HeapObject] = + ## Marks root objects *not* to be + ## collected by the GC and returns + ## their addresses + when debugGC: + echo "DEBUG - GC: Starting mark phase" + # Unlike what Bob does in his book, we keep track + # of objects another way, mainly due to the difference + # of our respective designs. Specifically, our VM only + # handles a single type (uint64) while Lox stores all objects + # in heap-allocated structs (which is convenient, but slow). + # What we do is store the pointers to the objects we allocated in + # a hash set and then, at collection time, do a set difference + # between the reachable objects and the whole set and discard + # whatever is left; Unfortunately, this means that if a primitive + # object's value happens to collide with an active pointer the GC + # will mistakenly assume the object to be reachable, potentially + # leading to a nasty memory leak. Let's just hope a 48+ bit address + # space makes this occurrence rare enough not to be a problem + # handles a single type (uint64), while Lox has a stack + # of heap-allocated structs (which is convenient, but slow). + # What we do instead is store all pointers allocated by us + # in a hash set and then check if any source of roots contained + # any of the integer values that we're keeping track of. Note + # that this means that if a primitive object's value happens to + # collide with an active pointer, the GC will mistakenly assume + # the object to be reachable (potentially leading to a nasty + # memory leak). Hopefully, in a 64-bit address space, this + # occurrence is rare enough for us to ignore + var result = initHashSet[uint64](self.gc.pointers.len()) + for obj in self.calls: + if obj in self.gc.pointers: + result.incl(obj) + for obj in self.operands: + if obj in self.gc.pointers: + result.incl(obj) + var obj: ptr HeapObject + for p in result: + obj = cast[ptr HeapObject](p) + if obj.mark(): + when debugMarkGC: + echo &"DEBUG - GC: Marked object: {obj[]}" + when debugGC: + echo "DEBUG - GC: Mark phase complete" + + +proc trace(self: var PeonVM, roots: HashSet[ptr HeapObject]) = + ## Traces references to other + ## objects starting from the + ## roots. The second argument + ## is the output of the mark + ## phase. To speak in terms + ## of the tricolor abstraction, + ## this is where we blacken gray + ## objects + when debugGC: + if len(roots) > 0: + echo &"DEBUG - GC: Tracing indirect references from {len(roots)} root{(if len(roots) > 1: \"s\" else: \"\")}" + var count = 0 + for root in roots: + case root.kind: + of String: + discard # Strings hold no additional references + else: + discard # TODO: Other types + when debugGC: + echo &"DEBUG - GC: Traced {count} indirect reference{(if count != 1: \"s\" else: \"\")}" + + +proc free(self: var PeonVM, obj: ptr HeapObject) = + ## Frees a single heap-allocated + ## peon object and all the memory + ## it directly or indirectly owns. Note + ## that the pointer itself is not released + ## from the GC's internal table and must be + ## handled by the caller + when debugAlloc: + echo &"DEBUG - GC: Freeing object: {obj[]}" + case obj.kind: + of String: + # Strings only own their + # underlying character array + if obj.len > 0 and not obj.str.isNil(): + self.freeArray(char, obj.str, obj.len) + else: + discard # TODO + self.free(HeapObject, obj) + when debugAlloc: + echo &"DEBUG - GC: Current heap size: {self.gc.bytesAllocated.current}" + echo &"DEBUG - GC: Total bytes allocated: {self.gc.bytesAllocated.total}" + echo &"DEBUG - GC: Tracked objects: {self.gc.pointers.len()}" + echo &"DEBUG - GC: Completed GC cycles: {self.gc.cycles}" + + +proc sweep(self: var PeonVM) = + ## Sweeps unmarked objects + ## that have been left behind + ## during the mark phase. + when debugGC: + echo "DEBUG - GC: Beginning sweeping phase" + var count = 0 + var current: ptr HeapObject + var freed: HashSet[uint64] + for p in self.gc.pointers: + current = cast[ptr HeapObject](p) + if current.marked: + # Object is marked: don't touch it, + # but reset its mark so that it doesn't + # stay alive forever + when debugMarkGC: + echo &"DEBUG - GC: Unmarking object: {current[]}" + current.marked = false + else: + # Object is unmarked: its memory is + # fair game + self.free(current) + freed.incl(p) + when debugGC: + inc(count) + # Set difference + self.gc.pointers = self.gc.pointers - freed + when debugGC: + echo &"DEBUG - GC: Swept {count} object{(if count > 1: \"s\" else: \"\")}" + + +proc collect(self: var PeonVM) = + ## Attempts to reclaim some + ## memory from unreachable + ## objects onto the heap + when debugGC: + let before = self.gc.bytesAllocated.current + let time = getMonoTime().ticks().float() / 1_000_000 + echo "" + echo &"DEBUG - GC: Starting collection cycle at heap size {self.gc.bytesAllocated.current}" + echo &"DEBUG - GC: Total bytes allocated: {self.gc.bytesAllocated.total}" + echo &"DEBUG - GC: Tracked objects: {self.gc.pointers.len()}" + echo &"DEBUG - GC: Completed GC cycles: {self.gc.cycles}" + inc(self.gc.cycles) + self.trace(self.markRoots()) + self.sweep() + self.gc.nextGC = self.gc.bytesAllocated.current * HeapGrowFactor + if self.gc.nextGC == 0: + self.gc.nextGC = FirstGC + when debugGC: + echo &"DEBUG - GC: Collection cycle has terminated in {getMonoTime().ticks().float() / 1_000_000 - time:.2f} ms, collected {before - self.gc.bytesAllocated.current} bytes of memory in total" + echo &"DEBUG - GC: Next cycle at {self.gc.nextGC} bytes" + echo &"DEBUG - GC: Total bytes allocated: {self.gc.bytesAllocated.total}" + echo &"DEBUG - GC: Tracked objects: {self.gc.pointers.len()}" + echo &"DEBUG - GC: Completed GC cycles: {self.gc.cycles}" + +# Implementation of the peon VM + +proc initCache*(self: var PeonVM) = + ## Initializes the VM's + ## singletons cache + self.cache[0] = 0x0 # False + self.cache[1] = 0x1 # True + self.cache[2] = 0x2 # Nil + self.cache[3] = 0x3 # Positive inf + self.cache[4] = 0x4 # Negative inf + self.cache[5] = 0x5 # NaN + + +proc newPeonVM*: PeonVM = + ## Initializes a new, blank VM + ## for executing Peon bytecode + result.ip = 0 + result.initCache() + result.gc = newPeonGC() + result.frames = @[] + result.operands = @[] + result.results = @[] + result.calls = @[] + + +# Getters for singleton types +{.push inline.} + +func getNil*(self: var PeonVM): uint64 = self.cache[2] + +func getBool*(self: var PeonVM, value: bool): uint64 = + if value: + return self.cache[1] + return self.cache[0] + +func getInf*(self: var PeonVM, positive: bool): uint64 = + if positive: + return self.cache[3] + return self.cache[4] + +func getNan*(self: var PeonVM): uint64 = self.cache[5] + + +# Thanks to nim's *genius* idea of making x > y a template +# for y < x (which by itself is fine) together with the fact +# that the order of evaluation of templates with the same +# expression is fucking stupid (see https://nim-lang.org/docs/manual.html#order-of-evaluation +# and https://github.com/nim-lang/Nim/issues/10425 and try not to +# bang your head against the nearest wall), we need a custom operator +# that preserves the natural order of evaluation +func `!>`[T](a, b: T): auto = + b < a + + +proc `!>=`[T](a, b: T): auto {.used.} = + b <= a + + +# Stack primitives. Note that all accesses to the call stack +# that go through the (get|set|peek)c wrappers are frame-relative, +# meaning that the given index is added to the current stack frame's +# bottom to obtain an absolute stack index +func push(self: var PeonVM, obj: uint64) = + ## Pushes a value object onto the + ## operand stack + self.operands.add(obj) + + +func pop(self: var PeonVM): uint64 = + ## Pops a value off the operand + ## stack and returns it + return self.operands.pop() + + +func peekb(self: PeonVM, distance: BackwardsIndex = ^1): uint64 = + ## Returns the value at the given (backwards) + ## distance from the top of the operand stack + ## without consuming it + return self.operands[distance] + + +func peek(self: PeonVM, distance: int = 0): uint64 = + ## Returns the value at the given + ## distance from the top of the + ## operand stack without consuming it + if distance < 0: + return self.peekb(^(-int(distance))) + return self.operands[self.operands.high() + distance] + + +func pushc(self: var PeonVM, val: uint64) = + ## Pushes a value onto the + ## call stack + self.calls.add(val) + + +func popc(self: var PeonVM): uint64 = + ## Pops a value off the call + ## stack and returns it + return self.calls.pop() + + +func peekc(self: PeonVM, distance: int = 0): uint64 {.used.} = + ## Returns the value at the given + ## distance from the top of the + ## call stack without consuming it + return self.calls[self.calls.high() + distance] + + +func getc(self: PeonVM, idx: int): uint64 = + ## Getter method that abstracts + ## indexing our call stack through + ## stack frames + return self.calls[idx.uint64 + self.frames[^1]] + + +func setc(self: var PeonVM, idx: int, val: uint64) = + ## Setter method that abstracts + ## indexing our call stack through + ## stack frames + self.calls[idx.uint + self.frames[^1]] = val + + +# Byte-level primitives to read and decode +# bytecode + +proc readByte(self: var PeonVM): uint8 = + ## Reads a single byte from the + ## bytecode and returns it as an + ## unsigned 8 bit integer + inc(self.ip) + return self.chunk.code[self.ip - 1] + + +proc readShort(self: var PeonVM): uint16 = + ## Reads two bytes from the + ## bytecode and returns them + ## as an unsigned 16 bit + ## integer + return [self.readByte(), self.readByte()].fromDouble() + + +proc readLong(self: var PeonVM): uint32 = + ## Reads three bytes from the + ## bytecode and returns them + ## as an unsigned 32 bit + ## integer. Note however that + ## the boundary is capped at + ## 24 bits instead of 32 + return uint32([self.readByte(), self.readByte(), self.readByte()].fromTriple()) + + +proc readUInt(self: var PeonVM): uint32 {.used.} = + ## Reads three bytes from the + ## bytecode and returns them + ## as an unsigned 32 bit + ## integer + return uint32([self.readByte(), self.readByte(), self.readByte(), self.readByte()].fromQuad()) + + +# Functions to read primitives from the chunk's +# constants table + +proc constReadInt64(self: var PeonVM, idx: int): int64 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an int64 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3], + self.chunk.consts[idx + 4], self.chunk.consts[idx + 5], + self.chunk.consts[idx + 6], self.chunk.consts[idx + 7], + ] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadUInt64(self: var PeonVM, idx: int): uint64 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an uint64 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3], + self.chunk.consts[idx + 4], self.chunk.consts[idx + 5], + self.chunk.consts[idx + 6], self.chunk.consts[idx + 7], + ] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadUInt32(self: var PeonVM, idx: int): uint32 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an int32 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadInt32(self: var PeonVM, idx: int): int32 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an uint32 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadInt16(self: var PeonVM, idx: int): int16 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an int16 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadUInt16(self: var PeonVM, idx: int): uint16 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an uint16 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadInt8(self: var PeonVM, idx: int): int8 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an int8 + result = int8(self.chunk.consts[idx]) + + +proc constReadUInt8(self: var PeonVM, idx: int): uint8 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as an uint8 + result = self.chunk.consts[idx] + + +proc constReadFloat32(self: var PeonVM, idx: int): float32 = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as a float32 + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadFloat64(self: var PeonVM, idx: int): float = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as a float + var arr = [self.chunk.consts[idx], self.chunk.consts[idx + 1], + self.chunk.consts[idx + 2], self.chunk.consts[idx + 3], + self.chunk.consts[idx + 4], self.chunk.consts[idx + 5], + self.chunk.consts[idx + 6], self.chunk.consts[idx + 7]] + copyMem(result.addr, arr.addr, sizeof(arr)) + + +proc constReadString(self: var PeonVM, size, idx: int): ptr HeapObject = + ## Reads a constant from the + ## chunk's constant table and + ## returns it as a pointer to + ## a heap-allocated string + let str = self.chunk.consts[idx.. ") + stdout.flushFile() + try: + command = readLine(stdin) + except EOFError: + styledEcho(fgYellow, "Use Ctrl+C to exit") + clearerr(stdin) + break + except IOError: + styledEcho(fgRed, "An error occurred while reading command: ", fgYellow, getCurrentExceptionMsg()) + break + if command == "": + if self.lastDebugCommand == "": + command = "n" + else: + command = self.lastDebugCommand + case command: + of "n", "next": + self.debugNext = true + break + of "c", "continue": + self.debugNext = false + break + of "s", "stack": + stdout.styledWrite(fgGreen, "Call Stack: ", fgMagenta, "[") + for i, e in self.calls: + stdout.styledWrite(fgYellow, $e) + if i < self.calls.high(): + stdout.styledWrite(fgYellow, ", ") + styledEcho fgMagenta, "]" + of "o", "operands": + stdout.styledWrite(fgBlue, "Operand Stack: ", fgMagenta, "[") + for i, e in self.operands: + stdout.styledWrite(fgYellow, $e) + if i < self.operands.high(): + stdout.styledWrite(fgYellow, ", ") + styledEcho fgMagenta, "]" + of "f", "frame": + stdout.styledWrite(fgCyan, "Current Frame: ", fgMagenta, "[") + if self.frames.len() > 0: + for i, e in self.calls[self.frames[^1]..^1]: + stdout.styledWrite(fgYellow, $e) + if i < (self.calls.high() - self.frames[^1].int): + stdout.styledWrite(fgYellow, ", ") + styledEcho fgMagenta, "]", fgCyan + of "frames": + stdout.styledWrite(fgRed, "Live stack frames: ", fgMagenta, "[") + for i, e in self.frames: + stdout.styledWrite(fgYellow, $e) + if i < self.frames.high(): + stdout.styledWrite(fgYellow, ", ") + styledEcho fgMagenta, "]" + of "r", "results": + stdout.styledWrite(fgYellow, "Function Results: ", fgMagenta, "[") + for i, e in self.results: + stdout.styledWrite(fgYellow, $e) + if i < self.results.high(): + stdout.styledWrite(fgYellow, ", ") + styledEcho fgMagenta, "]" + of "clear": + stdout.write("\x1Bc") + else: + styledEcho(fgRed, "Unknown command ", fgYellow, &"'{command}'") + + +proc dispatch*(self: var PeonVM) {.inline.} = + ## Main bytecode dispatch loop + var instruction {.register.}: OpCode + while true: + {.computedgoto.} # https://nim-lang.org/docs/manual.html#pragmas-computedgoto-pragma + when debugVM: + if self.ip in self.breakpoints or self.debugNext: + self.debug() + instruction = OpCode(self.readByte()) + case instruction: + # Constant loading instructions + of LoadTrue: + self.push(self.getBool(true)) + of LoadFalse: + self.push(self.getBool(false)) + of LoadNan: + self.push(self.getNan()) + of LoadNil: + self.push(self.getNil()) + of LoadInf: + self.push(self.getInf(true)) + of LoadNInf: + self.push(self.getInf(false)) + of LoadInt64: + self.push(uint64(self.constReadInt64(int(self.readLong())))) + of LoadUInt64: + self.push(uint64(self.constReadUInt64(int(self.readLong())))) + of LoadUInt32: + self.push(uint64(self.constReadUInt32(int(self.readLong())))) + of LoadInt32: + self.push(uint64(self.constReadInt32(int(self.readLong())))) + of LoadInt16: + self.push(uint64(self.constReadInt16(int(self.readLong())))) + of LoadUInt16: + self.push(uint64(self.constReadUInt16(int(self.readLong())))) + of LoadInt8: + self.push(uint64(self.constReadInt8(int(self.readLong())))) + of LoadUInt8: + self.push(uint64(self.constReadUInt8(int(self.readLong())))) + of LoadString: + # Loads the string's pointer onto the stack + self.push(cast[uint64](self.constReadString(int(self.readLong()), int(self.readLong())))) + of LoadFloat32: + self.push(cast[uint64](self.constReadFloat32(int(self.readLong())))) + of LoadFloat64: + self.push(cast[uint64](self.constReadFloat64(int(self.readLong())))) + of Call: + # Calls a peon function. The calling convention here + # is pretty simple: the first value in the frame is + # the new instruction pointer to jump to, then a + # 64-bit return address follows. After that, all + # arguments and locals follow. Note that, due to + # how the stack works, all arguments before the call + # are in the reverse order in which they are passed + # to the function + let argc = self.readLong().int + let retAddr = self.peek(-argc - 1) # Return address + let jmpAddr = self.peek(-argc - 2) # Function address + self.ip = jmpAddr + self.pushc(jmpAddr) + self.pushc(retAddr) + # Creates a new result slot for the + # function's return value + self.results.add(self.getNil()) + # Creates a new call frame + self.frames.add(uint64(self.calls.len() - 2)) + # Loads the arguments onto the stack + for _ in 0..