diff --git a/README.md b/README.md index 7eb6ec7..f521db5 100644 --- a/README.md +++ b/README.md @@ -52,25 +52,25 @@ In no particular order, here's a list of stuff that's done/to do (might be incom Toolchain: - - Tokenizer (with dynamic symbol table) [x] - - Parser (with support for custom operators, even builtins) [x] - - Compiler [ ] (Work in Progress) - - VM [ ] (Work in Progress) - - Bytecode (de-)serializer [x] - - Static code debugger [x] - - Runtime debugger/inspection tool [ ] + - Tokenizer (with dynamic symbol table) -> Done + - Parser (with support for custom operators, even builtins) -> Done + - Compiler [ ] -> Being written + - VM [ ] -> Being written + - Bytecode (de-)serializer -> Done + - Static code debugger [x] -> Done + - Runtime debugger/inspection tool -> TODO Type system: - - Custom types [ ] - - Intrinsics [x] - - Generics [ ] (Work in Progress) - - Function calls [ ] (Work in Progress) + - Custom types -> TODO + - Intrinsics -> Done + - Generics -> TODO + - Function calls -> WIP Misc: - - Pragmas [ ] (Work in Progress) - - Attribute resolution [ ] + - Pragmas -> TODO + - Attribute resolution -> TODO - ... More? ## The name diff --git a/docs/bytecode.md b/docs/bytecode.md index 485ee15..0879136 100644 --- a/docs/bytecode.md +++ b/docs/bytecode.md @@ -16,25 +16,28 @@ A peon bytecode dump contains: - Debugging information - File and version metadata -## Encoding - -### Header +## File Headers A peon bytecode file starts with the header, which is structured as follows: - The literal string `PEON_BYTECODE` -- A 3-byte version number (the major, minor and patch versions of the compiler that generated the file as per the SemVer versioning standard) +- A 3-byte version number (the major, minor and patch version numbers of the compiler that generated the file) - The branch name of the repository the compiler was built from, prepended with its length as a 1 byte integer -- The full commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds) +- The commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds) - An 8-byte UNIX timestamp (with Epoch 0 starting at 1/1/1970 12:00 AM) representing the exact date and time of when the file was generated - A 32-byte, hex-encoded SHA256 hash of the source file's content, used to track file changes -### Line data section +## Debug information -The line data section contains information about each instruction in the code section and associates them +The following segments contain extra information and metadata about the compiled bytecode to aid debugging, but they may be missing +in release builds. + +### Line data segment + +The line data segment contains information about each instruction in the code segment and associates them 1:1 with a line number in the original source file for easier debugging using run-length encoding. The section's size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data -in this section can be decoded as explained in [this file](../src/frontend/meta/bytecode.nim#L28), which is quoted +in this segment can be decoded as explained in [this file](../src/frontend/meta/bytecode.nim#L28), which is quoted below: ``` [...] @@ -54,19 +57,43 @@ below: [...] ``` -### Constant section +### CFI segment -The constant section contains all the read-only values that the code will need at runtime, such as hardcoded +The CFI segment (where CFI stands for **C**all **F**rame **I**nformation), contains details about each function in +the original file. The segment's size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). +The data +in this segment can be decoded as explained in [this file](../src/frontend/meta/bytecode.nim#L41), which is quoted +below: + +``` +[...] +## cfi represents Call Frame Information and encodes the following information: +## - Function name +## - Stack bottom +## - Argument count +## The encoding for CFI data is the following: +## - First, the position into the bytecode where the function begins is encoded (as a 3 byte integer) +## - Second, the position into the bytecode where the function ends is encoded (as a 3 byte integer) +## - Then, the frame's stack bottom is encoded as a 3 byte integer +## - After the frame's stack bottom follows the argument count as a 1 byte integer +## - Lastly, the function's name (optional) is encoded in ASCII, prepended with +## its size as a 2-byte integer +[...] +``` + +## Constant segment + +The constant segment contains all the read-only values that the code will need at runtime, such as hardcoded variable initializers or constant expressions. It is similar to the `.rodata` section of Assembly files, although the implementation is different. Constants are encoded as a linear sequence of bytes with no type information about them whatsoever: it is the code that, at runtime, loads each constant (whose type is determined at compile time) onto the stack accordingly. For example, a 32 bit integer constant would be encoded as a sequence of 4 bytes, which would -then be loaded by the appropriate `LoadInt32` instruction at runtime. The section's size is fixed and is encoded at -the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant section may be empty, although in -real-world scenarios it's unlikely that it would. +then be loaded by the appropriate `LoadInt32` instruction at runtime. The segment's size is fixed and is encoded at +the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant segment may be empty, although in +real-world scenarios likely won't. -### Code section +## Code segment -The code section contains the linear sequence of bytecode instructions of a peon program. It is to be read directly -and without modifications. The section's size is fixed and is encoded at the beginning as a sequence of 3 bytes +The code segment contains the linear sequence of bytecode instructions of a peon program. It is to be read directly +and without modifications. The segment's size is fixed and is encoded at the beginning as a sequence of 3 bytes (i.e. a single 24 bit integer). \ No newline at end of file diff --git a/src/frontend/meta/bytecode.nim b/src/frontend/meta/bytecode.nim index cf0536b..2ec0027 100644 --- a/src/frontend/meta/bytecode.nim +++ b/src/frontend/meta/bytecode.nim @@ -43,8 +43,8 @@ type ## - Stack bottom ## - Argument count ## The encoding for CFI data is the following: - ## - First, the position into the bytecode where the function begins is encoded - ## - Second, the position into the bytecode where the function ends is encoded + ## - First, the position into the bytecode where the function begins is encoded (as a 3 byte integer) + ## - Second, the position into the bytecode where the function ends is encoded (as a 3 byte integer) ## - Then, the frame's stack bottom is encoded as a 3 byte integer ## - After the frame's stack bottom follows the argument count as a 1 byte integer ## - Lastly, the function's name (optional) is encoded in ASCII, prepended with