peon/docs/bytecode.md

113 lines
6.1 KiB
Markdown

# Peon - Bytecode Specification
This document aims to document peon's bytecode as well as how it is (de-)serialized to/from files and
other file-like objects. Note that the segments in a bytecode dump appear in the order they are listed
in this document.
## Code Structure
A peon program is compiled into a tightly packed sequence of bytes that contain all the necessary information
the VM needs to execute said program. There is no dependence between the frontend and the backend outside of the
bytecode format (which is implemented in a separate serialiazer module) to allow for maximum modularity.
A peon bytecode file contains the following:
- Constants
- The program's code
- Debugging information (file and version metadata, module info. Optional)
## File Headers
A peon bytecode file starts with the header, which is structured as follows:
- The literal string `PEON_BYTECODE`
- A 3-byte version number (the major, minor and patch version numbers of the compiler that generated the file)
- The branch name of the repository the compiler was built from, prepended with its length as a 1 byte integer
- The commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds)
- An 8-byte UNIX timestamp (with Epoch 0 starting at 1/1/1970 12:00 AM) representing the exact date and time of when the file was generated
## Debug information
The following segments contain extra information and metadata about the compiled bytecode to aid debugging, but they may be missing
in release builds.
### Line data segment
The line data segment contains information about each instruction in the code segment and associates them
1:1 with a line number in the original source file for easier debugging using run-length encoding. The segment's
size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data
in this segment can be decoded as explained in [this file](../src/frontend/compiler/targgets/bytecode/opcodes.nim#L29), which is quoted
below:
```
[...]
## lines maps bytecode instructions to line numbers using Run
## Length Encoding. Instructions are encoded in groups whose structure
## follows the following schema:
## - The first integer represents the line number
## - The second integer represents the number of
## instructions on that line
## For example, if lines equals [1, 5], it means that there are 5 instructions
## at line 1, meaning that all instructions in code[0..4] belong to the same line.
## This is more efficient than using the naive approach, which would encode
## the same line number multiple times and waste considerable amounts of space.
[...]
```
### Functions segment
This segment contains details about each function in the original file. The segment's size is fixed and is encoded at the
beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data in this segment can be decoded as explained
in [this file](../src/frontend/compiler/targets/bytecode/opcodes.nim#L39), which is quoted below:
```
[...]
## functions encodes the following information:
## - Function name
## - Argument count
## - Function boundaries
## The encoding for functions is the following:
## - First, the position into the bytecode where the function begins is encoded (as a 3 byte integer)
## - Second, the position into the bytecode where the function ends is encoded (as a 3 byte integer)
## - After that follows the argument count as a 1 byte integer
## - Lastly, the function's name (optional) is encoded in ASCII, prepended with
## its size as a 2-byte integer
[...]
```
### Modules segment
This segment contains details about the modules that make up the original source code which produced a given bytecode dump.
The data in this segment can be decoded as explained in [this file](../src/frontend/compiler/targets/bytecode/opcodes.nim#L49), which is quoted below:
```
[...]
## modules contains information about all the peon modules that the compiler has encountered,
## along with their start/end offset in the code. Unlike other bytecode-compiled languages like
## Python, peon does not produce a bytecode file for each separate module it compiles: everything
## is contained within a single binary blob. While this simplifies the implementation and makes
## bytecode files entirely "self-hosted", it also means that the original module information is
## lost: this segment serves to fix that. The segment's size is encoded at the beginning as a 4-byte
## sequence (i.e. a single 32-bit integer) and its encoding is similar to that of the functions segment:
## - First, the position into the bytecode where the module begins is encoded (as a 3 byte integer)
## - Second, the position into the bytecode where the module ends is encoded (as a 3 byte integer)
## - Lastly, the module's name is encoded in ASCII, prepended with its size as a 2-byte integer
[...]
```
## Constant segment
The constant segment contains all the read-only values that the code will need at runtime, such as hardcoded
variable initializers or constant expressions. It is similar to the `.rodata` section of Assembly files, although
the implementation is different. Constants are encoded as a linear sequence of bytes with no type information about
them whatsoever: it is the code that, at runtime, loads each constant (whose type is determined at compile time) onto
the stack accordingly. For example, a 32 bit integer constant would be encoded as a sequence of 4 bytes, which would
then be loaded by the appropriate `LoadInt32` instruction at runtime. The segment's size is fixed and is encoded at
the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant segment may be empty, although in
real-world scenarios it likely won't be.
## Code segment
The code segment contains the linear sequence of bytecode instructions of a peon program to be fed directly to
peon's virtual machine. The segment's size is fixed and is encoded at the beginning as a sequence of 3 bytes
(i.e. a single 24 bit integer). All the instructions are documented [here](../src/frontend/compiler/targgets/bytecode/opcodes.nim)