Updates to the bytecode spec. Started working a bit on the type system

This commit is contained in:
Nocturn9x 2022-02-03 15:14:31 +01:00
parent 6e2070899b
commit 65053b7a05
9 changed files with 76 additions and 18 deletions

View File

@ -2,18 +2,19 @@
## Rationale
This document aims to lay down a simple, extensible and linear format for serializing and deserializing
compiled JAPL's code to a buffer (be it an actual OS file or an in-memory stream).
compiled JAPL code to a file-like buffer.
Once a JAPL source file (i.e. one with a ".jpl" extension, without quotes) has been successfully compiled to bytecode, the compiler dumps the resulting linear stream of bytes to a ".japlc" file (without quotes, which stands for __JAPL C__ompiled), which we will call "object file" (without quotes) in this document. The name of the object file will be the same of the original source file, and its structure is rigorously described in this document.
The main reason to serialize bytecode to a file is for porting JAPL code to other machines, but also to avoid processing the same file every time if it hasn't changed, therefore using it as a sort of cache. If this cache-like behavior is abused though, it may lead to unexpected behavior, hence we define how the JAPL toolchain will deal with local object files.
The main reason to serialize bytecode to a file is for porting JAPL code to other machines, but also to avoid processing the same file every time if it hasn't changed, therefore using it as a sort of cache. If this cache-like behavior is abused though, it may lead to unexpected behavior, hence we define how the JAPL toolchain will deal with local object files. These object files are stored inside `~/.cache/japl` under *nix systems and `C:\Windows\Temp\japl` under Windows systems.
When JAPL finds an existing object file whose name matches the one of the source file that has to be ran, it will skip processing the source file and use the existing object file only if (Note: both filenames are stripped of their respective file extension):
When JAPL finds an existing object file whose name matches the one of the source file that has to be ran (both filenames are stripped of their respective file extension), it will skip processing the source file and use the existing object file only and only if:
- The object file has been produced by the same JAPL version as the running interpreter: the 3-byte version header, the branch name and the commit hash must be the same for this check to succeed
- The object file is not older than an hour (this delay can be customized with the `--cache-delay` option)
- The SHA256 checksum of the source file matches the SHA256 checksum contained in the object file
If any of those checks fail, the object file is discarded and subsequently replaced by an updated version after the compiler is done processing the source file again. Since none of those checks are absolutely bulletproof, a `--nocache` option can be provided to the JAPL executable to instruct it to not load nor produce any object files.
If any of those checks fail, the object file is discarded and subsequently replaced by an updated version after the compiler is done processing the source file again (unless the `--nodump` switch is used, in which case no bytecode caching occurs). Since none of those checks are absolutely bulletproof, a `--nocache` option can be provided to the JAPL executable to instruct it to not load any already existing object files.
## Disclaimer
@ -30,7 +31,7 @@ __Note__: The conventions about number literals described in the document laying
## Compile-time type specifiers
To distinguish the different kinds of values that JAPL can represent at compile time, type specifiers are prepended to a given series of bytes to tell the deserializer what kind of object that specific sequence should deserialize into. It is important that each compile-time object specifies the size of its value in bytes using a 3-byte (aka 24 bit) integer (referred to as "size specifier" from now on, without quotes), after the type specifier. The following sections about object representation assume the appropriate type and size specifiers have been used and will therefore omit them to avoid repetition. Some types (such as singletons) are encoded with a dedicated bytecode instruction rather than as a constant (booleans, nan and inf are notable examples of this).
To distinguish the different kinds of values that JAPL can represent at compile time, type specifiers are prepended to a given series of bytes to tell the deserializer what kind of object that specific sequence should deserialize into. It is important that each compile-time object specifies the size of its value in bytes using a 3-byte (24 bit) integer (referred to as "size specifier" from now on, without quotes), after the type specifier. The following sections about object representation assume the appropriate type and size specifiers have been used and will therefore omit them to avoid repetition. Some types (such as singletons) are encoded with a dedicated bytecode instruction rather than as a constant (booleans, nan and inf are notable examples of this).
Below a list of all type specifiers:
- `0x0` -> Identifier
@ -39,7 +40,7 @@ Below a list of all type specifiers:
- `0x3` -> List literal (An heterogeneous dynamic array)
- `0x4` -> Set literal (An heterogeneous and unordered dynamic array without duplicates. Mirrors the mathematical definition of a set)
- `0x5` -> Dictionary literal (An associative array, also known as mapping)
- `0x6` -> Tuple literal (An heterogeneous, static array)
- `0x6` -> Tuple literal (An heterogeneous static array)
- `0x7` -> Function declaration
- `0x8` -> Class declaration
- `0x9` -> Variable declaration. Note that constants are replaced during compilation with their corresponding literal value, therefore they are represented as literals in the constants section and are not compiled as variable declarations.
@ -50,7 +51,7 @@ Below a list of all type specifiers:
### Numbers
For simplicity purposes, numbers in object files are serialized as strings of decimal digits and optionally a dot followed by 1 or more decimal digits (for floats). The number `2.718`, for example, would just be serialized as the string `"2.718"` (without quotes). JAPL supports scientific notation such as `2e3`, but numbers in this form are collapsed to their decimal representation before being written to a file, therefore `2e3` becomes `2000.0`. Other decimal number representations such as hexadecimal, binary and octal are also converted to base 10 during compilation (usually during the optimization process).
For simplicity purposes, numbers in object files are serialized as strings of decimal digits and optionally a dot followed by 1 or more decimal digits (for floats). The number `2.718`, for example, would just be serialized as the string `"2.718"` (without quotes). JAPL supports scientific notation such as `2e3`, but numbers in this form are collapsed to their decimal representation before being written to a file, therefore `2e3` becomes `2000.0`. Other decimal number representations such as hexadecimal, binary and octal are also converted to base 10 during compilation (or during the optimization process, if optimizations are enabled).
### Strings
@ -68,19 +69,20 @@ After the modifier follows the string encoded in UTF-8, __without__ quotes.
### List-like collections (sets, lists and tuples)
List-like collections (or _sequences_)-- namely sets, lists and tuples-- encode their length first: for lists and sets this only denotes the _starting_ size of the container, while a tuple's size is fixed once it is created. The length may be 0, in which case it is interpreted as the sequence being empty; After the length, which expresses the __number of elements__ in the collection (just the count!), follows a number of compile-time objects equal to the specified length, with their respective encoding.
__TODO__: Currently the compiler does not emit constant instructions for collections using only constants: it will just emit a bunch of `LoadConstant` instructions and
then a `BuildList` opcode with the length of the container as argument, so this section and the one below it are currently not relevant nor implemented yet.
### Mappings (or associative arrays)
Mappings (also called _associative arrays_ or, more informally, _dictionaries_) also encode their length first, but the difference lies in the element list that follows it: instead of there being n elements, with n being the length of the map, there are n _pairs_ (hence 2n elements) of objects that represent the key-value relation in the map.
## File structure
Once a JAPL source file (i.e. one with a ".jpl" extension, without quotes) has been successfully compiled to bytecode, the compiler dumps the resulting linear stream of bytes to a ".japlc" file (without quotes, which stands for __JAPL C__ompiled), which we will call "object file" (without quotes) in this document. The name of the object file will be the same of the original source file, and its structure is described below.
### File headers
An object file starts with the headers, namely:
- A 13-byte constant string with the value `"JAPL_BYTECODE"` (without quotes) encoded as a sequence of integers of the ASCII encoding of each character in the string
- A 13-byte constant string with the value `"JAPL_BYTECODE"` (without quotes) encoded as a sequence of integers corresponding to their value in the ASCII table
- A 3-byte version header composed of 3 unsigned integers representing the major, minor and patch version of the compiler used to generate the file, respectively. JAPL follows the SemVer standard for versioning
- A string representing the branch name of the git repo from which JAPL was compiled, prepended with its size represented as a single 8-bit unsigned integer. Due to this encoding the branch name can't be longer than 256 characters, which is a length deemed appropriate for this purpose
- A 40 bytes hexadecimal string, pinpointing the version of the compiler down to the exact commit hash in the JAPL repository, particularly useful when testing development versions
@ -98,5 +100,4 @@ After the headers and the constant section follows the code section, which store
### Modules
When compiling source files, one bytecode file is produced per source file. These bytecode dumps are stored inside `~/.cache` under *nix systems and `C:\Windows\Temp` under windows systems. Since JAPL allows explicit visibility specifiers that alter the way namespaces are built at runtime (and, partially, resolved at compile-time) by selectively
(not) exporting symbols to the outside world, these directives need to be specified in the bytecode file (TODO).
When compiling source files, one object file is produced per source file. Since JAPL allows explicit visibility specifiers that alter the way namespaces are built at runtime (and, partially, resolved at compile-time) by selectively exporting (or not exporting) symbols to other modules, these directives need to be specified in the bytecode file (TODO).

View File

@ -16,7 +16,8 @@
{.experimental: "implicitDeref".}
import iterable
import ../../memory/allocator
import base
import baseObject
import strformat
@ -32,7 +33,7 @@ type
current: int
proc newArrayList*[T](): ptr ArrayList[T] =
proc newArrayList*[T]: ptr ArrayList[T] =
## Allocates a new, empty array list
result = allocateObj(ArrayList[T], ObjectType.List)
result.capacity = 0

View File

@ -47,7 +47,7 @@ template allocateObj*(kind: untyped, objType: ObjectType): untyped =
cast[ptr kind](allocateObject(sizeof kind, objType))
proc newObj*(): ptr Obj =
proc newObj*: ptr Obj =
## Allocates a generic JAPL object
result = allocateObj(Obj, ObjectType.BaseObject)
@ -58,3 +58,5 @@ proc asObj*(self: ptr Obj): ptr Obj =
result = cast[ptr Obj](self)
proc hash*(self: ptr Obj): uint64 = 0x123FFFF # Constant hash value
proc `$`*(self: ptr Obj): string = "<object>"

View File

View File

@ -15,21 +15,40 @@
import ../../memory/allocator
import ../../config
import base
import baseObject
import iterable
type
Entry = object
## Low-level object to store key/value pairs.
## Using an extra value for marking the entry as
## a tombstone instead of something like detecting
## tombstones as entries with null keys but full values
## may seem wasteful. The thing is, though, that since
## we want to implement sets on top of this hashmap and
## the implementation of a set is *literally* a dictionary
## with empty values and keys as the elements, this would
## confuse our findEntry method and would force us to override
## it to account for a different behavior.
## Using a third field takes up more space, but saves us
## from the hassle of rewriting code
key: ptr Obj
value: ptr Obj
tombstone: bool
HashMap* = object of Iterable
## An associative array with O(1) lookup time,
## similar to nim's Table type, but using raw
## memory to be more compatible with JAPL's runtime
## memory management
entries: ptr UncheckedArray[ptr Entry]
# This attribute counts *only* non-deleted entries
actual_length: int
proc newHashMap*(): ptr HashMap =
proc newHashMap*: ptr HashMap =
## Initializes a new, empty hashmap
result = allocateObj(HashMap, ObjectType.Dict)
result.actual_length = 0
result.entries = nil
@ -38,6 +57,7 @@ proc newHashMap*(): ptr HashMap =
proc freeHashMap*(self: ptr HashMap) =
## Frees the memory associated with the hashmap
discard freeArray(UncheckedArray[ptr Entry], self.entries, self.capacity)
self.length = 0
self.actual_length = 0
@ -46,17 +66,40 @@ proc freeHashMap*(self: ptr HashMap) =
proc findEntry(self: ptr UncheckedArray[ptr Entry], key: ptr Obj, capacity: int): ptr Entry =
## Low-level method used to find entries in the underlying
## array, returns a pointer to an entry
var capacity = uint64(capacity)
var idx = uint64(key.hash()) mod capacity
while true:
result = self[idx]
if system.`==`(result.key, nil):
# We found an empty bucket
break
elif result.tombstone:
# We found a previously deleted
# entry. In this case, we need
# to make sure the tombstone
# will get overwritten when the
# user wants to add a new value
# that would replace it, BUT also
# for it to not stop our linear
# probe sequence. Hence, if the
# key of the tombstone is the same
# as the one we're looking for,
# we break out of the loop, otherwise
# we keep searching
if result.key == key:
break
elif result.key == key:
# We were looking for a specific key and
# we found it, so we also bail out
break
# If none of these conditions match, we have a collision!
# This means we can just move on to the next slot in our probe
# sequence until we find an empty slot. The way our resizing
# mechanism works makes the empty slot invariant easy to
# maintain since we increase the underlying array's size
# before we are actually full
idx = (idx + 1) mod capacity

View File

View File

@ -11,9 +11,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Implementation of iterable types and iterators in JAPL
import base
import baseObject
type

View File

View File

@ -16,12 +16,22 @@ import strformat
const BYTECODE_MARKER* = "JAPL_BYTECODE"
const MAP_LOAD_FACTOR* = 0.75 # Load factor for builtin hashmaps
when MAP_LOAD_FACTOR >= 1.0:
{.fatal: "Hashmap load factor must be < 1".}
const HEAP_GROW_FACTOR* = 2 # How much extra memory to allocate for dynamic arrays and garbage collection when resizing
when HEAP_GROW_FACTOR <= 1:
{.fatal: "Heap growth factor must be > 1".}
const MAX_STACK_FRAMES* = 800 # The maximum number of stack frames at any one time. Acts as a recursion limiter (1 frame = 1 call)
when MAX_STACK_FRAMES <= 0:
{.fatal: "The frame limit must be > 0".}
const JAPL_VERSION* = (major: 0, minor: 4, patch: 0)
const JAPL_RELEASE* = "alpha"
const JAPL_COMMIT_HASH* = "fdfe87ad424ec80b0dad780e5dd2c78c22feaf59"
when len(JAPL_COMMIT_HASH) != 40:
{.fatal: "The git commit hash must be exactly 40 characters long".}
const JAPL_BRANCH* = "master"
when len(JAPL_BRANCH) >= 255:
{.fatal: "The git branch name's length must be less than or equal to 255 characters".}
const DEBUG_TRACE_VM* = false # Traces VM execution
const SKIP_STDLIB_INIT* = false # Skips stdlib initialization (can be imported manually)
const DEBUG_TRACE_GC* = false # Traces the garbage collector (TODO)