93 lines
4.5 KiB
Markdown
93 lines
4.5 KiB
Markdown
# Peon - Bytecode Specification
|
|
|
|
This document aims to document peon's bytecode as well as how it is (de-)serialized to/from files and
|
|
other file-like objects.
|
|
|
|
## Code Structure
|
|
|
|
A peon program is compiled into a tightly packed sequence of bytes that contain all the necessary information
|
|
the VM needs to execute said program. There is no dependence between the frontend and the backend outside of the
|
|
bytecode format (which is implemented in a separate serialiazer module) to allow for maximum modularity.
|
|
|
|
A peon bytecode dump contains:
|
|
|
|
- Constants
|
|
- The bytecode itself
|
|
- Debugging information
|
|
- File and version metadata
|
|
|
|
## File Headers
|
|
|
|
A peon bytecode file starts with the header, which is structured as follows:
|
|
|
|
- The literal string `PEON_BYTECODE`
|
|
- A 3-byte version number (the major, minor and patch version numbers of the compiler that generated the file)
|
|
- The branch name of the repository the compiler was built from, prepended with its length as a 1 byte integer
|
|
- The commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds)
|
|
- An 8-byte UNIX timestamp (with Epoch 0 starting at 1/1/1970 12:00 AM) representing the exact date and time of when the file was generated
|
|
|
|
## Debug information
|
|
|
|
The following segments contain extra information and metadata about the compiled bytecode to aid debugging, but they may be missing
|
|
in release builds.
|
|
|
|
### Line data segment
|
|
|
|
The line data segment contains information about each instruction in the code segment and associates them
|
|
1:1 with a line number in the original source file for easier debugging using run-length encoding. The section's
|
|
size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data
|
|
in this segment can be decoded as explained in [this file](../src/frontend/meta/bytecode.nim#L28), which is quoted
|
|
below:
|
|
```
|
|
[...]
|
|
## lines maps bytecode instructions to line numbers using Run
|
|
## Length Encoding. Instructions are encoded in groups whose structure
|
|
## follows the following schema:
|
|
## - The first integer represents the line number
|
|
## - The second integer represents the number of
|
|
## instructions on that line
|
|
## For example, if lines equals [1, 5], it means that there are 5 instructions
|
|
## at line 1, meaning that all instructions in code[0..4] belong to the same line.
|
|
## This is more efficient than using the naive approach, which would encode
|
|
## the same line number multiple times and waste considerable amounts of space.
|
|
[...]
|
|
```
|
|
|
|
### Functions segment
|
|
|
|
This segment , contains details about each function in
|
|
the original file. The segment's size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer).
|
|
The data in this segment can be decoded as explained in [this file](../src/frontend/meta/bytecode.nim#L41), which is quoted
|
|
below:
|
|
|
|
```
|
|
[...]
|
|
## [...] encodes the following information:
|
|
## - Function name
|
|
## - Argument count
|
|
## - Function boundaries
|
|
## The encoding for is the following:
|
|
## - First, the position into the bytecode where the function begins is encoded (as a 3 byte integer)
|
|
## - Second, the position into the bytecode where the function ends is encoded (as a 3 byte integer)
|
|
## - After that follows the argument count as a 1 byte integer
|
|
## - Lastly, the function's name (optional) is encoded in ASCII, prepended with
|
|
## its size as a 2-byte integer
|
|
[...]
|
|
```
|
|
|
|
## Constant segment
|
|
|
|
The constant segment contains all the read-only values that the code will need at runtime, such as hardcoded
|
|
variable initializers or constant expressions. It is similar to the `.rodata` section of Assembly files, although
|
|
the implementation is different. Constants are encoded as a linear sequence of bytes with no type information about
|
|
them whatsoever: it is the code that, at runtime, loads each constant (whose type is determined at compile time) onto
|
|
the stack accordingly. For example, a 32 bit integer constant would be encoded as a sequence of 4 bytes, which would
|
|
then be loaded by the appropriate `LoadInt32` instruction at runtime. The segment's size is fixed and is encoded at
|
|
the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant segment may be empty, although in
|
|
real-world scenarios it likely won't be.
|
|
|
|
## Code segment
|
|
|
|
The code segment contains the linear sequence of bytecode instructions of a peon program. It is to be read directly
|
|
and without modifications. The segment's size is fixed and is encoded at the beginning as a sequence of 3 bytes
|
|
(i.e. a single 24 bit integer). |