Began deserializing the constants table and the code section of bytecode files. Minor fixes to debugger and bytecode.nim

2021-12-17 17:54:22 +01:00 · 2021-12-17 17:54:22 +01:00 · ed13304809
parent 195045e4f2
commit ed13304809
8 changed files with 164 additions and 42 deletions
--- a/docs/bytecode.md
+++ b/docs/bytecode.md
@ -30,7 +30,7 @@ __Note__: The conventions about number literals described in the document laying

 ## Compile-time type specifiers

-To distinguish the different kinds of values that JAPL can represent at compile time, type specifiers are prepended to a given series of bytes to tell the deserializer what kind of object that specific sequence should deserialize into. It is important that each compile-time object specifies the size of its value in bytes (referred to as "size specifier" from now on, without quotes), after the type specifier. The following sections about object representation assume the appropriate type and size specifiers have been used and will therefore omit them to avoid repetition. Some types (such as singletons) do not need a size specifier as they're only one byte long: these cases are an exception rather than the rule and are explicitly marked as such in this document.
+To distinguish the different kinds of values that JAPL can represent at compile time, type specifiers are prepended to a given series of bytes to tell the deserializer what kind of object that specific sequence should deserialize into. It is important that each compile-time object specifies the size of its value in bytes using a 3-byte (aka 24 bit) integer (referred to as "size specifier" from now on, without quotes), after the type specifier. The following sections about object representation assume the appropriate type and size specifiers have been used and will therefore omit them to avoid repetition. Some types (such as singletons) do not need a size specifier as they're only one byte long: these cases are an exception rather than the rule and are explicitly marked as such in this document.

 Below a list of all type specifiers:

@ -39,15 +39,16 @@ Below a list of all type specifiers:
 - `0xF` -> nil*
 - `0xA` -> nan*
 - `0xB` -> inf*
- `0x01` -> Number
- `0x02` -> String
- `0x03` -> List literal (An heterogeneous dynamic array)
- `0x04` -> Set literal  (An heterogeneous and unordered dynamic array without duplicates. Mirrors the mathematical definition of a set)
- `0x05` -> Dictionary literal  (An associative array, also known as mapping)
- `0x06` -> Tuple literal (An heterogeneous, static array)
- `0x07` -> Function declaration
- `0x08` -> Class declaration
- `0x09` -> Variable declaration. Note that constants are replaced during compilation with their corresponding literal value, therefore they are represented as literals in the constants section and are not compiled as variable declarations.
+- `0x0` -> Identifier
+- `0x1` -> Number
+- `0x2` -> String
+- `0x3` -> List literal (An heterogeneous dynamic array)
+- `0x4` -> Set literal  (An heterogeneous and unordered dynamic array without duplicates. Mirrors the mathematical definition of a set)
+- `0x5` -> Dictionary literal  (An associative array, also known as mapping)
+- `0x6` -> Tuple literal (An heterogeneous, static array)
+- `0x7` -> Function declaration
+- `0x8` -> Class declaration
+- `0x9` -> Variable declaration. Note that constants are replaced during compilation with their corresponding literal value, therefore they are represented as literals in the constants section and are not compiled as variable declarations.
 - `0x10` -> Lambda declarations (aka anonymous functions)


@ -57,7 +58,7 @@ __Note__: The types whose name is followed by an asterisk require no size specif

 ### Numbers

-For simplicity purposes, numbers in object files are serialized as strings of decimal digits and optionally a dot followed by 1 or more decimal digits (for floats). The number `2.718`, for example, would just be serialized as the string `"2.718"` (without quotes). JAPL supports scientific notation such as `2e3`, but numbers in this form are collapsed to their decimal representation before being written to a file, therefore `2e3` becomes `2000.0`. Other decimal number representations such as hexadecimal, binary and octal are also converted to base 10 during compilation.
+For simplicity purposes, numbers in object files are serialized as strings of decimal digits and optionally a dot followed by 1 or more decimal digits (for floats). The number `2.718`, for example, would just be serialized as the string `"2.718"` (without quotes). JAPL supports scientific notation such as `2e3`, but numbers in this form are collapsed to their decimal representation before being written to a file, therefore `2e3` becomes `2000.0`. Other decimal number representations such as hexadecimal, binary and octal are also converted to base 10 during compilation (usually during the optimization process).

 ### Strings

@ -92,17 +93,18 @@ An object file starts with the headers, namely:
 - A string representing the branch name of the git repo from which JAPL was compiled, prepended with its size represented as a single 8-bit unsigned integer. Due to this encoding the branch name can't be longer than 256 characters, which is a length deemed appropriate for this purpose
 - A 40 bytes hexadecimal string, pinpointing the version of the compiler down to the exact commit hash in the JAPL repository, particularly useful when testing development versions
 - An 8 byte (64 bit) UNIX timestamp (starting from the Unix Epoch of January 1st 1970 at 00:00), representing the date and time when the file was created
- A 32 bytes SHA256 checksum of the source file's contents, used to track file changes
+- A 32 byte SHA256 checksum of the source file's contents, used to track file changes

 ### Constant section

-This section of the file follows the headers and is meant to store all constants needed upon startup by the JAPL virtual machine. For example, the code `var x = 1;` would have the number one as a constant. Constants are just an ordered sequence of compile-time types as described in the sections above.
+This section of the file follows the headers and is meant to store all constants needed upon startup by the JAPL virtual machine. For example, the code `var x = 1;` would have the number one as a constant. Constants are just an ordered sequence of compile-time types as described in the sections above. The constant section's end is marked with
+the byte `0x59`.

 ### Code section

-After the headers and the constant section follows the code section, which stores the actual bytecode instructions the compiler has emitted. They're encoded as a linear sequence of bytes.
+After the headers and the constant section follows the code section, which stores the actual bytecode instructions the compiler has emitted. They're encoded as a linear sequence of bytes. The code section's size is fixed and is encoded as a 3-byte (24 bit) integer right after the constant section's end marker, limiting the maximum number of bytecode instructions per bytecode file to 16777216.

 ### Modules

 When compiling source files, one bytecode file is produced per source file. These bytecode dumps are stored inside `~/.cache` under *nix systems and `C:\Windows\Temp` under windows systems. Since JAPL allows explicit visibility specifiers that alter the way namespaces are built at runtime (and, partially, resolved at compile-time) by selectively
-(not) exporting symbols to the outside world, these directives need to be specified in the bytecode file
+(not) exporting symbols to the outside world, these directives need to be specified in the bytecode file (TODO).
--- a/src/backend/lexer.nim
+++ b/src/backend/lexer.nim
@ -225,7 +225,7 @@ proc match(self: Lexer, what: string): bool =
 proc createToken(self: Lexer, tokenType: TokenType) =
    ## Creates a token object and adds it to the token
    ## list
-    var tok: Token
+    var tok: Token = new(Token)
    tok.kind = tokenType
    tok.lexeme = self.source[self.start..<self.current]
    tok.line = self.line
--- a/src/backend/meta/ast.nim
+++ b/src/backend/meta/ast.nim
@ -611,6 +611,8 @@ proc newClassDecl*(name: ASTNode, body: ASTNode,


 proc `$`*(self: ASTNode): string = 
+    if self == nil:
+        return "nil"
    case self.kind:
        of intExpr, floatExpr, hexExpr, binExpr, octExpr, strExpr, trueExpr, falseExpr, nanExpr, nilExpr, infExpr:
            if self.kind in {trueExpr, falseExpr, nanExpr, nilExpr, infExpr}:
--- a/src/backend/meta/bytecode.nim
+++ b/src/backend/meta/bytecode.nim
@ -192,7 +192,7 @@ const argumentDoubleInstructions* = {PopN, }
 # Jump instructions jump at relative or absolute bytecode offsets
 const jumpInstructions* = {JumpIfFalse, JumpIfFalsePop, JumpForwards, JumpBackwards, 
                           LongJumpIfFalse, LongJumpIfFalsePop, LongJumpForwards,
-                           LongJumpBackwards}
+                           LongJumpBackwards, JumpIfTrue, LongJumpIfTrue}

 # Collection instructions push a built-in collection type onto the stack
 const collectionInstructions* = {BuildList, BuildDict, BuildSet, BuildTuple}
--- a/src/backend/meta/token.nim
+++ b/src/backend/meta/token.nim
@ -70,7 +70,7 @@ type
    EndOfFile


-  Token* = object
+  Token* = ref object
    ## A token object
    kind*: TokenType
    lexeme*: string
@ -78,4 +78,8 @@ type
    pos*: tuple[start, stop: int]


-proc `$`*(self: Token): string = &"Token(kind={self.kind}, lexeme={$(self.lexeme).escape()}, line={self.line}, pos=({self.pos.start}, {self.pos.stop}))"
+proc `$`*(self: Token): string =
+  if self != nil:
+    result = &"Token(kind={self.kind}, lexeme={$(self.lexeme).escape()}, line={self.line}, pos=({self.pos.start}, {self.pos.stop}))"
+  else:
+    result = "nil"
--- a/src/backend/serializer.nim
+++ b/src/backend/serializer.nim
@ -14,8 +14,9 @@
 import meta/ast
 import meta/errors
 import meta/bytecode
+import meta/token
 import ../config
-
+import ../util/multibyte

 import strformat
 import strutils
@ -49,11 +50,13 @@ proc `$`*(self: Serialized): string =

 proc error(self: Serializer, message: string) =
    ## Raises a formatted SerializationError exception
-    raise newException(SerializationError, &"A fatal error occurred while serializing '{self.filename}' -> {message}")
+    raise newException(SerializationError, &"A fatal error occurred while (de)serializing '{self.filename}' -> {message}")


-proc initSerializer*(): Serializer =
+proc initSerializer*(self: Serializer = nil): Serializer =
    new(result)
+    if self != nil:
+        result = self
    result.file = ""
    result.filename = ""
    result.chunk = nil
@ -84,6 +87,10 @@ proc bytesToInt(self: Serializer, input: array[8, byte]): int =
    copyMem(result.addr, input.unsafeAddr, sizeof(int))


+proc bytesToInt(self: Serializer, input: array[3, byte]): int =
+    copyMem(result.addr, input.unsafeAddr, sizeof(byte) * 3)
+
+
 proc extend[T](s: var seq[T], a: openarray[T]) =
    ## Extends s with the elements of a
    for e in a:
@ -105,12 +112,13 @@ proc writeHeaders(self: Serializer, stream: var seq[byte], file: string) =
    stream.extend(self.toBytes(computeSHA256(file)))


-proc writeConstants(self: Serializer, chunk: Chunk, stream: var seq[byte]) =
-    for constant in chunk.consts:
+proc writeConstants(self: Serializer, stream: var seq[byte]) =
+    ## Writes the constants table in-place into the given stream
+    for constant in self.chunk.consts:
        case constant.kind:
            of intExpr, floatExpr:
                stream.add(0x1)
-                stream.add(byte(len(constant.token.lexeme)))
+                stream.extend(len(constant.token.lexeme).toTriple())
                stream.extend(self.toBytes(constant.token.lexeme))
            of strExpr:
                stream.add(0x2)
@ -128,12 +136,11 @@ proc writeConstants(self: Serializer, chunk: Chunk, stream: var seq[byte]) =
                    else:
                        strip = 2
                        stream.add(0x0)
-                stream.add(byte(len(constant.token.lexeme) - offset))  # Removes the quotes from the length count as they're not written
+                stream.extend((len(constant.token.lexeme) - offset).toTriple())  # Removes the quotes from the length count as they're not written
                stream.add(self.toBytes(constant.token.lexeme[offset..^2]))
            of identExpr:
-                stream.add(0x2)
                stream.add(0x0)
-                stream.add(byte(len(constant.token.lexeme)))
+                stream.extend(len(constant.token.lexeme).toTriple())
                stream.add(self.toBytes(constant.token.lexeme))
            of trueExpr:
                stream.add(0xC)
@ -147,12 +154,104 @@ proc writeConstants(self: Serializer, chunk: Chunk, stream: var seq[byte]) =
                stream.add(0xB)
            else:
                self.error(&"unknown constant kind in chunk table ({constant.kind})")
+    stream.add(0x59)  # End marker


-proc writeCode(self: Serializer, chunk: Chunk, stream: var seq[byte]) =
+proc readConstants(self: Serializer, stream: seq[byte]): int =
+    ## Reads the constant table from the given stream and
+    ## adds each constant to the chunk object (note: most compile-time
+    ## information such as the original token objects and line info is lost when
+    ## serializing the data, so those fields are set to nil or some default
+    ## value). Returns the number of bytes that were processed in the stream
+    var stream = stream
+    var count: int = 0
+    while true:
+        case stream[0]:
+            of 0x59:
+                inc(count)
+                break
+            of 0x2:
+                stream = stream[1..^1]
+                let size = self.bytesToInt([stream[0], stream[1], stream[2]])
+                stream = stream[3..^1]
+                var s = newStrExpr(Token(lexeme: ""))
+                case stream[0]:
+                    of 0x0:
+                        discard
+                    of 0x1:
+                        s.token.lexeme.add("b")
+                    of 0x2:
+                        s.token.lexeme.add("f")
+                    else:
+                        self.error(&"unknown string modifier in chunk table (0x{stream[0].toHex()}")
+                stream = stream[1..^1]
+                s.token.lexeme.add("\"")
+                s.token.lexeme.add(stream[0..<size].join(""))
+                s.token.lexeme.add("\"")
+                inc(count, size + 5)
+            of 0x1:
+                stream = stream[1..^1]
+                inc(count)
+                let size = self.bytesToInt([stream[0], stream[1], stream[2]])
+                stream = stream[3..^1]
+                inc(count, 3)
+                var tok: Token = new(Token)
+                tok.lexeme = self.bytesToString(stream[0..<size])
+                if "." in tok.lexeme:
+                    tok.kind = Float
+                    self.chunk.consts.add(newFloatExpr(tok))
+                else:
+                    tok.kind = Integer
+                    self.chunk.consts.add(newIntExpr(tok))
+                stream = stream[size..^1]
+                inc(count, size)
+            of 0x0:
+                stream = stream[1..^1]
+                let size = self.bytesToInt([stream[0], stream[1], stream[2]])
+                stream = stream[3..^1]
+                discard self.chunk.addConstant(newIdentExpr(Token(lexeme: self.bytesToString(stream[0..<size]))))
+                inc(count, size + 4)
+            of 0xC:
+                discard self.chunk.addConstant(newTrueExpr(nil))
+                stream = stream[1..^1]
+                inc(count)
+            of 0xD:
+                discard self.chunk.addConstant(newFalseExpr(nil))
+                stream = stream[1..^1]
+                inc(count)
+            of 0xF:
+                discard self.chunk.addConstant(newNilExpr(nil))
+                stream = stream[1..^1]
+                inc(count)
+            of 0xA:
+                discard self.chunk.addConstant(newNaNExpr(nil))
+                stream = stream[1..^1]
+                inc(count)
+            of 0xB:
+                discard self.chunk.addConstant(newInfExpr(nil))
+                stream = stream[1..^1]
+                inc(count)
+            else:
+                self.error(&"unknown constant kind in chunk table (0x{stream[0].toHex()})")
+    result = count
+
+
+proc writeCode(self: Serializer, stream: var seq[byte]) =
    ## Writes the bytecode from the given chunk to the given source
    ## stream
-    stream.extend(chunk.code)
+    stream.extend(self.chunk.code.len.toTriple())
+    stream.extend(self.chunk.code)
+
+
+proc readCode(self: Serializer, stream: seq[byte]): int =
+    ## Reads the bytecode from a given stream and writes
+    ## it into the given chunk
+    let size = [stream[0], stream[1], stream[2]].fromTriple()
+    var stream = stream[3..^1]
+    for i in countup(0, int(size) - 1):
+        self.chunk.code.add(stream[i])
+    assert len(self.chunk.code) == int(size)
+    return int(size)


 proc dumpBytes*(self: Serializer, chunk: Chunk, file, filename: string): seq[byte] =
@ -162,15 +261,17 @@ proc dumpBytes*(self: Serializer, chunk: Chunk, file, filename: string): seq[byt
    self.filename = filename
    self.chunk = chunk
    self.writeHeaders(result, self.file)
-    self.writeConstants(chunk, result)
-    self.writeCode(chunk, result)
+    self.writeConstants(result)
+    self.writeCode(result)


 proc loadBytes*(self: Serializer, stream: seq[byte]): Serialized =
    ## Loads the result from dumpBytes to a Serializer object
    ## for use in the VM or for inspection
+    discard self.initSerializer()
    new(result)
    result.chunk = newChunk()
+    self.chunk = result.chunk
    var stream = stream
    try:
        if stream[0..<len(BYTECODE_MARKER)] != self.toBytes(BYTECODE_MARKER):
@ -187,10 +288,14 @@ proc loadBytes*(self: Serializer, stream: seq[byte]): Serialized =
        result.compileDate = self.bytesToInt([stream[0], stream[1], stream[2], stream[3], stream[4], stream[5], stream[6], stream[7]])
        stream = stream[8..^1]
        result.fileHash = self.bytesToString(stream[0..<32]).toHex().toLowerAscii()
-        result.chunk = newChunk()
-
+        stream = stream[32..^1]
+        stream = stream[self.readConstants(stream)..^1]
+        stream = stream[self.readCode(stream)..^1]
+        
    except IndexDefect:
        self.error("truncated bytecode file")
+    except AssertionDefect:
+        self.error("corrupted bytecode file")
    


--- a/src/main.nim
+++ b/src/main.nim
@ -106,15 +106,24 @@ proc main() =

            serialized = serializer.loadBytes(serializedRaw)
            echo "Deserialization step:"
-            echo &"\t\t- File hash: {serialized.fileHash} (matches: {computeSHA256(source).toHex().toLowerAscii() == serialized.fileHash})"
-            echo &"\t\t- JAPL version: {serialized.japlVer.major}.{serialized.japlVer.minor}.{serialized.japlVer.patch} (commit {serialized.commitHash[0..8]} on branch {serialized.japlBranch})"
-            stdout.write("\t\t")
+            echo &"\t- File hash: {serialized.fileHash} (matches: {computeSHA256(source).toHex().toLowerAscii() == serialized.fileHash})"
+            echo &"\t- JAPL version: {serialized.japlVer.major}.{serialized.japlVer.minor}.{serialized.japlVer.patch} (commit {serialized.commitHash[0..8]} on branch {serialized.japlBranch})"
+            stdout.write("\t")
            echo &"""- Compilation date & time: {fromUnix(serialized.compileDate).format("d/M/yyyy HH:mm:ss")}"""
+            stdout.write(&"\t- Reconstructed constants table: [")
+            for i, e in serialized.chunk.consts:
+                stdout.write(e)
+                if i < len(serialized.chunk.consts) - 1:
+                    stdout.write(", ")
+            stdout.write("]\n")
+            stdout.write(&"\t- Reconstructed bytecode: [")
+            for i, e in serialized.chunk.code:
+                stdout.write($e)
+                if i < len(serialized.chunk.code) - 1:
+                    stdout.write(", ")
+            stdout.write("]\n")
        except:
-            raise
            echo &"A Nim runtime exception occurred: {getCurrentExceptionMsg()}"
-            continue
-


 when isMainModule:
--- a/src/util/debugger.nim
+++ b/src/util/debugger.nim
@ -109,9 +109,9 @@ proc jumpInstruction(instruction: OpCode, chunk: Chunk, offset: int): int =
    ## Debugs jumps
    var jump: int
    case instruction:
-        of JumpIfFalse, JumpIfFalsePop, JumpForwards, JumpBackwards:
+        of JumpIfFalse, JumpIfTrue, JumpIfFalsePop, JumpForwards, JumpBackwards:
            jump = [chunk.code[offset + 1], chunk.code[offset + 2]].fromDouble().int()
-        of LongJumpIfFalse, LongJumpIfFalsePop, LongJumpForwards, LongJumpBackwards:
+        of LongJumpIfFalse, LongJumpIfTrue, LongJumpIfFalsePop, LongJumpForwards, LongJumpBackwards:
            jump = [chunk.code[offset + 1], chunk.code[offset + 2], chunk.code[offset + 3]].fromTriple().int()
        else:
            discard  # Unreachable