peon/docs/bytecode.md

3.7 KiB

Peon - Bytecode Specification

This document aims to document peon's bytecode as well as how it is (de-)serialized to/from files and other file-like objects.

Code Structure

A peon program is compiled into a tightly packed sequence of bytes that contain all the necessary information the VM needs to execute said program. There is no dependence between the frontend and the backend outside of the bytecode format (which is implemented in a separate serialiazer module) to allow for maximum modularity.

A peon bytecode dump contains:

  • Constants
  • The bytecode itself
  • Debugging information
  • File and version metadata

Encoding

Header

A peon bytecode file starts with the header, which is structured as follows:

  • The literal string PEON_BYTECODE
  • A 3-byte version number (the major, minor and patch versions of the compiler that generated the file as per the SemVer versioning standard)
  • The branch name of the repository the compiler was built from, prepended with its length as a 1 byte integer
  • The full commit hash (encoded as a 40-byte hex-encoded string) in the aforementioned branch from which the compiler was built from (particularly useful in development builds)
  • An 8-byte UNIX timestamp (with Epoch 0 starting at 1/1/1970 12:00 AM) representing the exact date and time of when the file was generated
  • A 32-byte, hex-encoded SHA256 hash of the source file's content, used to track file changes

Line data section

The line data section contains information about each instruction in the code section and associates them 1:1 with a line number in the original source file for easier debugging using run-length encoding. The section's size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The data in this section can be decoded as explained in this file, which is quoted below:

[...]
## lines maps bytecode instructions to line numbers using Run
## Length Encoding. Instructions are encoded in groups whose structure
## follows the following schema:
## - The first integer represents the line number
## - The second integer represents the count of whatever comes after it
##  (let's call it c)
## - After c, a sequence of c integers follows
##
## A visual representation may be easier to understand: [1, 2, 3, 4]
## This is to be interpreted as "there are 2 instructions at line 1 whose values
## are 3 and 4"
## This is more efficient than using the naive approach, which would encode
## the same line number multiple times and waste considerable amounts of space.
[...]

Constant section

The constant section contains all the read-only values that the code will need at runtime, such as hardcoded variable initializers or constant expressions. It is similar to the .rodata section of Assembly files, although the implementation is different. Constants are encoded as a linear sequence of bytes with no type information about them whatsoever: it is the code that, at runtime, loads each constant (whose type is determined at compile time) onto the stack accordingly. For example, a 32 bit integer constant would be encoded as a sequence of 4 bytes, which would then be loaded by the appropriate LoadInt32 instruction at runtime. The section's size is fixed and is encoded at the beginning as a sequence of 4 bytes (i.e. a single 32 bit integer). The constant section may be empty, although in real-world scenarios it's unlikely that it would.

Code section

The code section contains the linear sequence of bytecode instructions of a peon program. It is to be read directly and without modifications. The section's size is fixed and is encoded at the beginning as a sequence of 3 bytes (i.e. a single 24 bit integer).