diff --git a/04.internals.md b/04.internals.md index efb968084..2ea6a31eb 100644 --- a/04.internals.md +++ b/04.internals.md @@ -8,545 +8,277 @@ permalink: /internals/ {:toc} # High-Level Design -![High-Level Design]({{ site.baseurl }}/img/engines_high_level_design.jpg){: class="thumbnail center-block img-responsive" } +![High-Level Design]({{ site.baseurl }}/img/engines_high_level_design.png){: class="thumbnail center-block img-responsive" } -On the diagram above is shown interaction of major components of software system: Parser and Runtime. Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to [Bytecode](#byte-code) and [Parser](#parser) page for details). Prepared bytecode is executed by Runtime engine that performs interpretation (refer to [Virtual Machine](#virtual-machine) and [ECMA](#ecma) pages for details). +On the diagram above is shown interaction of major components of JerryScript: Parser and Virtual Machine (VM). Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to [Bytecode](#byte-code) and [Parser](#parser) page for details). Prepared bytecode is executed by the Virtual Machine that performs interpretation (refer to [Virtual Machine](#virtual-machine) and [ECMA](#ecma) pages for details). # Parser -The parser is implemented as recursive descent parser. The parser does not build any type of Abstract Syntax Tree. It converts the source JavaScript code directly into the byte-code. +The parser is implemented as a recursive descent parser. The parser converts the JavaScript source code directly into byte-code without building an Abstract Syntax Tree. The parser depends on the following subcomponents. -The parser consists of three major parts: -- lexer -- parser -- opcodes dumper -- syntax errors checker -- serializer +## Lexer -These four (except the parser itself) components are initialized during `parser_init` call (jerry-core/parser/js/parser.cpp). +The lexer splits input string (ECMAScript program) into sequence of tokens. It is able to scan the input string not only forward, but it is possible to move to an arbitrary position. The token structure described by structure `lexer_token_t` in `./jerry-core/parser/js/js-lexer.h`. -This initializer requires two following subsystems to be initialized: memory allocator and serializer. The need for allocator is clear. The serializer resets internal bytecode_data structure(jerry-core/parser/js/bytecode-data.h). Currently bytecode_data is singleton. During parsing it is filled by the data which is needed for the further execution: +## Scanner -* Byte-code - array of opcodes (`bytecode_data.opcodes`). -* Literals - array of literals (`bytecode_data.literals`). -* Strings buffer (`bytecode_data.strings_buffer`) - literals of type `LIT_STR` contain pointers to strings, which are located in this buffer. +Scanner (`./jerry-core/parser/js/js-parser-scanner.h`) pre-scans the input string to find certain tokens. For example, scanner determines whether the keyword `for` defines a general for or a for-in loop. Reading tokens in a while loop is not enough because a slash (`/`) can indicate the start of a regular expression or can be a division operator. -The following is brief review of the mentioned components. See more concise description in the following chapters. +## Expression Parser -* Lexer -The lexer splits input file (given as the first parameter of the parser_init call) into sequence of tokens. These tokens are then matched on demand. -* Opcodes dumper -This component does necessary checks and preparations, and dumps opcodes using serializer. -* Serializer -The serializer puts opcodes, prepared by the dumper, to a continuous array that represents current scope's code. Also it provides API for accessing byte-code. -* Syntax error checker -This is bunch of simple die-on-error checks. +Expression parser is responsible for parsing JavaScript expressions. It is implemented in `./jerry-core/parser/js/js-parser-expr.c`. -After initialization `parser_parse_program` (`./jerry-core/js/parser.cpp`) should be called. This function performs the following steps (so-called parsing steps) for all scopes (global code and functions): +## Statement Parser -1. Initialize a scope. -2. Do pre-parser stage. -3. Parse the scope code. +JavaScript statements are parsed by this component. It uses the [Expression parser](#expression parser) to parse the constituent expressions. The implementation of Statement parser is located in `./jerry-core/parser/js/js-parser-statm.c`. -After every scope is processed, parser merges all scopes into the single byte-code array. +Function `parser_parse_source` carries out the parsing and compiling of the input EcmaScript source code. When a function appears in the source `parser_parse_source` calls `parser_parse_function` which is responsible for processing the source code of functions recursively including argument parsing and context handling. After the parsing, function `parser_post_processing` dumps the created opcodes and returns an ecma_compiled_code_t* that points to the compiled bytecode sequence. -Two new entities were introduced - scopes and pre-parser. +The interactions between the major components shown on the following figure. -* There are two types of scopes in the parser: global scope and function declaration scope. Notice that function expressions do not create a new scope in terms of the parser. The reason why is described below. Parsing process starts on global scope. If a function declaration occurs string the process, new scope is created, this new scope is pushed to a stack of current scopes; then steps 1-3 of parsing are performed. Note, that only global scope parsing shall merge all scopes into a byte-code. All scopes are stored in a tree to represent a hierarchy of them. -* Pre-parser. This step performs hoisting of variable declarations. First, it dumps `reg_var_decl` opcodes. Then it goes through the script and looks for variable declaration lists. For every found variable in the scope (not in a sub-scope or function expression) it dumps var_decl opcode. After this step byte-code in the scope starts with optional `'use strict'` marker, then `reg_var_decl` and several (optional) `var_decls`. +![Parser dependency]({{ site.baseurl }}/img/parser_dependency.png){: class="thumbnail center-block img-responsive" } -Due to some limitations of the parser, some parsing functions take `this_arg` and/or `prop` as parameters. They are further used to dump `prop_setter` opcode. During parsing all necessary data is stored in either stacks or scope trees. After parsing of the whole program, the parser merges all scopes into a single byte-code, hoisting function declarations in process. This task, so-called post-parser, is performed by `scopes_tree_raw_data` (jerry-core/js/scopes-tree.c) function. For the further information about post-parser, check opcodes dumper section. +# Byte-code -### Lexer +This section describes the compact byte-code (CBC) byte-code representation. The key focus is reducing memory consumption of the byte-code representation without sacrificing considerable performance. Other byte-code representations often focus on performance only so inventing this representation is an original research. -The lexer splits input string into the set of tokens. The token structure (`./jerry-core/parser/js/lexer.h`) consists of three elements: token type, location of the token and optional data: +CBC is a CISC like instruction set which assigns shorter instructions for frequent operations. Many instructions represent multiple atomic tasks which reduces the byte code size. This technique is basically a data compression method. -{% highlight cpp %} -typedef struct -{ - locus loc; - token_type type; - literal_index_t uid; -} -token; -{% endhighlight %} +## Compiled code format -Location of token (`locus`). It is just an index of the first token's character at a string that represents the program. Token types are listed in lexer.h header file (`token_type` enum). Depending on token type, token specific data (`uid` field) has the different meaning. +The memory layout of the compiled byte code is the following. + +![CBC layout]({{ site.baseurl }}/img/CBC_layout.png){: class="thumbnail center-block img-responsive" } + +The header is a `cbc_compiled_code` structure with several fields. These fields contain the key properties of the compiled code. + +The literals part is an array of ecma values. These values can contain any EcmaScript value types, e.g. strings, numbers, function and regexp templates. The number of literals is stored in the `literal_end` field of the header. + +CBC instruction list is a sequence of byte code instructions which represents the compiled code. + +## Byte-code Format + +The memory layout of a byte-code is the following: + +![byte-code layout]({{ site.baseurl }}/img/opcode_layout.png){: class="thumbnail center-block img-responsive" } + +Each byte-code starts with an opcode. The opcode is one byte long for frequent and two byte long for rare instructions. The first byte of the rare instructions is always zero (`CBC_EXT_OPCODE`), and the second byte represents the extended opcode. The name of common and rare instructions start with `CBC_` and `CBC_EXT_` prefix respectively. + +The maximum number of opcodes is 511, since 255 common (zero value excluded) and 256 rare instructions can be defined. Currently around 230 frequent and 120 rare instructions are available. + +There are three types of bytecode arguments in CBC: + + * __byte argument__: A value between 0 and 255, which often represents the argument count of call like opcodes (function call, new, eval, etc.). + + * __literal argument__: An integer index which is greater or equal than zero and less than the `literal_end` field of the header. For further information see next section Literals (next). + + * __relative branch__: An 1-3 byte long offset. The branch argument might also represent the end of an instruction range. For example the branch argument of `CBC_EXT_WITH_CREATE_CONTEXT` shows the end of a with statement. More precisely the position after the last instruction. + +Argument combinations are limited to the following seven forms: + +* no arguments +* a literal argument +* a byte argument +* a branch argument +* a byte and a literal arguments +* two literal arguments +* three literal arguments + +## Literals + +Literals are organized into groups whose represent various literal types. Having these groups consuming less space than assigning flag bits to each literal. +(In the followings, the mentioned ranges represent those indicies which are greater than or equal to the left side and less than the right side of the range. For example a range between `ident_end` and `literal_end` fields of the byte-code header contains those indicies, which are greater than or equal to `ident_end` +and less than `literal_end`. If `ident_end` equals to `literal_end` the range is empty.) + +The two major group of literals are _identifiers_ and _values_. + + * __identifier__: A named reference to a variable. Literals between zero and `ident_end` of the header belongs to here. All of these literals must be a string or undefined. Undefined can only be used for those literals which cannot be accessed by a literal name. For example `function (arg,arg)` has two arguments, but the `arg` identifier only refers to the second argument. In such cases the name of the first argument is undefined. Furthermore optimizations such as *CSE* may also introduce literals without name. + + * __value__: A reference to an immediate value. Literals between `ident_end` and `const_literal_end` are constant values such as numbers or strings. These literals can be used directly by the Virtual Machine. Literals between `const_literal_end` and `literal_end` are template literals. A new object needs to be constructed each time when their value is accessed. These literals are functions and regular expressions. + +There are two other sub-groups of identifiers. *Registers* are those identifiers which are stored in the function call stack. *Arguments* are those registers which are passed by a caller function. + +There are two types of literal encoding in CBC. Both are variable length, where the length is one or two byte long. + + * __small__: maximum 511 literals can be encoded. + +One byte encoding for literals 0 - 254. + +```c +byte[0] = literal_index +``` + +Two byte encoding for literals 255 - 510. + +```c +byte[0] = 0xff +byte[1] = literal_index - 0xff +``` + + * __full__: maximum 32767 literal can be encoded. + +One byte encoding for literals 0 - 127. + +```c +byte[0] = literal_index +``` + +Two byte encoding for literals 128 - 32767. + +```c +byte[0] = (literal_index >> 8) | 0x80 +byte[1] = (literal_index & 0xff) +``` + +Since most functions require less than 255 literal, small encoding provides a single byte literal index for all literals. Small encoding consumes less space than full encoding, but it has a limited range. + +## Byte-code Categories + +Byte-codes can be placed into four main categories. + +### Push Byte-codes + +Byte-codes of this category serve for placing objects onto the stack. As there are many instructions representing multiple atomic tasks in CBC, there are also many instructions for pushing objects onto the stack according to the number and the type of the arguments. The following table list a few of these opcodes with a brief description.
-Token type | 'uid' meaning -TOK_KEYWORD | Keyword id, like KW_DO, KW_CONST, etc. (see 'keyword' enum in lexer.h). -TOK_NAME, TOK_STRING, TOK_NUMBER | Literal index in the stack of literals. -TOK_BOOL | 0 - 'false'
1 - 'true' -TOK_SMALL_INT | Value of small integer (0-255). -Other (punctuators) | Not used. +| byte-code | description | +| CBC_PUSH_LITERAL | Pushes the value of the given literal argument. | +| CBC_PUSH_TWO_LITERALS | Pushes the value of the given two literal arguments. | +| CBC_PUSH_UNDEFINED | Pushes an undefined value. | +| CBC_PUSH_TRUE | Pushes a logical true. | +| CBC_PUSH_PROP_LITERAL | Pushes a property whose base object is popped from the stack, and the property name is passed as a literal argument. |
-Token matching algorithm is straightforward - look at the first character of the new token, recognize the type, and then just match the rest. Comments and space characters (except new line) are ignored, so they produce no token. The algorithm uses two pointers: buffer and token_start. The first one points to the next character of the input, the other one points to the first character of token, being matched, so-called current token. +### Call Byte-codes -The lexer remembers two tokens during scan: current and previously seen. It also allows buffering one token to be rescanned (`lexer_save_token`) and setting scan position to any location in the file (`lexer_seek`). +The byte-codes of this category perform calls in different ways. -The parser uses lexer to scan file two times - during pre-parsing and parsing stages. +
-Currently the lexer does not support any encoding except ASCII. Also the lexer does not support regular expressions. - -### Opcodes dumper - -It is a quite high level wrapper for the serializer. It was introduced to split functionality of parsing and dumping opcodes. To understand how opcodes dumper works, one should be acquainted with the byte-code layout (see the corresponding description). - -The main data structure of the dumper is an operand (jerry-core/parser/js/opcodes-dumper.h). Operand can represent either variable (i.e. literal) or temporary register (tmp). The most annoying thing of the dumper is a difference between these types. - -Byte-code is divided into blocks of fixed size (`BLOCK_SIZE` in jerry-core/parser/js/bytecode-data.h) and each block has independent encoding of variable names, which are represented by 8 bit numbers - uids. -Operands are encoded as uids in each opcode (see the `opcode_t` structure). -As byte-code decomposition into blocks is not possible until parsing is finished, uids can't be calculated on the fly. Therefore literal operands are encoded by literal indexes (`literal_index_t` - index in the global literals array) during parsing. In the post-parser stage these indexes are converted to block specific uids. - -During parsing scopes tree structure is constructed (see `scopes_tree_int` in the jerry-core/parser/js/scopes-tree.h). Each tree node comprises of its byte-code and list of child scopes. While final byte-code is the plain array of `opcode_t` structures, byte-code in tree nodes is represented by the list of `op_meta` structures. Op\_meta structure wraps `opcode_t` with an array of 3 values (result, operand_1 and operand_2), which holds literal indexes, so that literal operands could be encoded. - -In each dump\_\* function (jerry-core/parser/js/opcodes-dumper.h) the dumper checks for the operand type and dumps appropriate op\_meta to the scopes tree using serializer. The dumper also keeps opcode counters of rewritable opcodes inside a bunch of stacks. It dumps an op\_meta and pushed an opcodes counter of the op\_meta to a stack in functions with a name like dump\_\*\_for\_rewrite, then pops an opcode counter from the stack, retrieves op\_meta by the dematerializer and rewrites necessary fields of opcodes in functions with names like rewrite\_\*. - -The post-parser merges scopes into a single byte-code. For each scope it first dumps a header of the scope, which consists of optional func_decl with function_end opcode pair, optional ‘use strict’ marker, `reg_var_decl` and optional `var_decls`. Then it recursively dumps sub-scopes. Finally, it dumps the remainder of opcodes. The byte-code is split into blocks with fixed size; each block has its own counter of literals. While dumping opcodes the post-parser replaces LITERAL_TO_REWRITE markers with this counter’s value. - -### Serializer - -Serializer dumps literals collected by the lexer to bytecode_data, is used by the dumper to dump or rewrite op_metas to a current scope. - -### Syntax Errors Checker - -This component is just checks for syntax errors defined in the specification. It uses stacks to store necessary data, for example arguments names. - -# Byte-code -Every instruction of bytecode consists of opcode and up to three operands. Operand (idx) can be either a "register" or a string -literal, specifying identifier to evaluate (i.e. `var //Storage idx`). General structure of instruction is shown on the picture. - -
- -| opcode | idx | idx | idx | +| byte-code | description | +| CBC_CALL0 | Calls a function without arguments. The return value won't be pushed onto the stack. | +| CBC_CALL1 | Calls a function with one argument. The return value won't be pushed onto the stack. | +| CBC_CALL | Calls a function with n arguments. n is passed as a byte argument. The return value won't be pushed onto the stack. | +| CBC_CALL0_PUSH_RESULT | Calls a function without arguments. The return value will be pushed onto the stack. | +| CBC_CALL1_PUSH_RESULT | Calls a function with one argument. The return value will be pushed onto the stack. | +| CBC_CALL2_PROP | Calls a property function with two arguments. The base object, the property name, and the two arguments are on the stack. |
-
Special kinds of instructions are described below. +### Arithmetic, Logical, Bitwise and Assignment Byte-codes -## Arithmetic/bitwise-logic/logic/comparison/shift -Arithmetic instruction can have the following structure: +The opcodes of this category perform arithmetic, logical, bitwise and assignment operations according to the different -
+
-|opcode|dst|left|right| - -
-
- -|opcode|dst|value|-| +| byte-code | description | +| CBC_LOGICAL_NOT | Negates the logical value that popped from the stack. The result is pushed onto the stack. | +| CBC_LOGICAL_NOT_LITERAL | Negates the logical value that given in literal argument. The result is pushed onto the stack. | +| CBC_ADD | Adds two values that are poped from the stack. The result is pushed onto the stack. | +| CBC_ADD_RIGHT_LITERAL | Adds two values. The left one popped from the stack, the right one is given as literal argument. | +| CBC_ADD_TWO_LITERALS | Adds two values. Both are given as literal arguments. | +| CBC_ASSIGN | Assigns a value to a property. It has three arguments: base object, property name, value to assign. | +| CBC_ASSIGN_PUSH_RESULT | Assigns a value to a property. It has three arguments: base object, property name, value to assign. The result will be pushed onto the stack. |
-where `dst`/`left`/`right`/`value` identify an operand. +### Branch Byte-codes -## Control (jumps) -Control instructions utilize two bytes to encode jump location. Destination offset is contained inside `offset_high` and `offset_low` fields. +Branch byte-codes are used to perform conditional and unconditional jumps in the byte-code. The arguments of these instructions are 1-3 byte long relative offsets. The number of bytes is part of the opcode, so each byte-code with a branch argument has three forms. The direction (forward, backward) is also defined by the opcode since the offset is an unsigned value. Thus, certain branch instructions has six forms. Some examples can be found in the following table. -
+
-|opcode|offset-high|offset-low|-| - -
-
- -|opcode|cond value|offset-high|offset-low| +| byte-code | description | +| CBC_JUMP_FORWARD | Jumps forward by the 1 byte long relative offset argument. | +| CBC_JUMP_FORWARD_2 | Jumps forward by the 2 byte long relative offset argument. | +| CBC_JUMP_FORWARD_3 | Jumps forward by the 3 byte long relative offset argument. | +| CBC_JUMP_BACKWARD | Jumps backward by the 1 byte long relative offset argument. | +| CBC_JUMP_BACKWARD_2 | Jumps backward by the 2 byte long relative offset argument. | +| CBC_JUMP_BACKWARD_3 | Jumps backward by the 3 byte long relative offset argument. | +| CBC_BRANCH_IF_TRUE_FORWARD | Jumps if the value on the top of the stack is true by the 1 byte long relative offset argument. |
-Condition jump checks `cond value` field, which identifies an operand, and performs a jump if the operand has `true` value. +# Virtual Machine -## Assignment +Virtual machine is an interpreter which executes byte-code instructions one by one. The function that starts the interpretation is `vm_run` in `./jerry-core/vm/vm.c`. `vm_loop` is the main loop of the virtual machine, which has the peculiarity that it is *non-recursive*. This means that in case of function calls it does not calls itself recursively but returns, which has the benefit that it does not burdens the stack as a recursive implementation. -Assignment instructions perform assignment of immediate value (contained inside instruction) to the operand, which is marked as `idx` on the picture. - -
- -|op_assignment|dst|type|value| - -
- -where -`dst` - "storage idx", identifies where to store the value -`type` - specifies value type -`value` - depends on type field - -Type of the immediate value is encoded in the `type` field of instruction. The following values are supported: -- "simple value" (see ECMA types encoding) -- small integer/negative small integer -- number literal/negative number literal -- sring value, initialized by string literal ("literal idx") -- "Srorage idx" - -## Exit - -Exit instruction serves to stop the execution and exit with a specified status. - -
- -|op_exit|status(0/1)|-|-| - -
- -Exit instruction is employed in following cases: -- at script end (exit with "succesful" stats); -- in script assertion fail handling code (exit with "fail" status) - -## Native call (intrinsic call) - -Native call instruction is used to call intrinsics. Arguments are not encoded directly inside this instruction, instead they follow it as special "meta" instructions (see the according section). Id of desired intrinsic is encoded in the `intrinsic id` field. - -
- -|op_native_call|dst|intrinsic_id|arg_list| - -
- -where -`dst` - "storage idx" -`arg_list` - number of arguments - -## Function call/Constructor call - -Function/constructor call are utilized to perform calls to functions and constructors. Destination operand is encoded in `dst` field. Operand `name_idx` specifies the name of the function to call. Arguments are encoded the same way as in the native call instruction. - -
- -|opcode|dst|name_idx|arg_list| - -
- -where -`dst` - "storage idx" -`name_idx` - "storage idx" (which value to call) -`arg_list` - number of arguments - -## Function declaration - -Function declarations are represented by the special kind of instructions. Function name and number of arguments are located in `name_idx` and `arg_list` fields respectively. - -
- -|opcode|name_idx|arg_list|-| - -
- -where -`name_idx` - literal idx -`arg_list` - number of arguments - -## Function expression - -Very similar to function declaration. But additionally contains destination (`dst`) field and `name` operand is optional, because anonymous functions are possible. - -
- -|opcode|dst|name_idx|arg_list| - -
- -where -`dst` - "storage idx" -`name_idx` - literal idx (can be unspecified for anonymos function expression) -`arg_list` - number of arguments - -## Return from function/eval - -Return instructions perfrom unconditional return from function/eval code. Return value can be specified (`idx` field). - -
- -|op_ret|-|-|-| - -
- -
- -|op_retval|idx|-|-| - -
- -where -`idx` - "storage idx" - -## "Meta" (special marker opcode) - -
- -|op_meta|type|arg1|arg2| - -
- -Meta instructions are usually utilized as continuations of other instructions. Depending on `type` field, meta instruction can have the following meaning: - -- 'this' argument (for calls in a.f() form, a = this), put right after call opcode -- `varg` (encodes an argument for calls and array declarations (`arg1` - storage idx) / parameters name for function decl/expr (`arg1` - literal idx, i.e. string)) -- carg_prop_data / varg_prop_getter / varg_prop_setter - name (literal idx) and value/getter/setter (storage idx) of a property (see also: object declaration) -- end_with / function_end / end_of_try_catch_finally - end offset of 'with' block/function/try_catch_finally sequence -- catch / finally - start of catch/finally block and offset to the end of the block -- strict code - placed at the start of a scope's code if the source code contains 'use strict' at the beginning - -## Delete - -JavaScript delete operator is represented with delete instruction in the bytecode. There are two types of delete instruction, applied either to element of lexical environment or to object's property. - -
- -|op_delete_var|dst|name|-| - -
-
- -|op_delete_prop|dst|base_value|name| - -
- -where -`dst` - "storage idx" -`name` - literal idx -`base_value` - "storage idx" - -## This binding (evaluate "this") -This binding instruction writes value of "this" to the `dst` operand. - -
- -|op_this|dst|-|-| - -
- -where -`dst` - "storage idx" - -## typeof (typeof operation) - -Typeof instruction executes JavaScript operator with the same name. Result is written to the `dst` operand. - -
- -|op_typeof|dst|value|-| - -
- -where -`dst` and `value` - "storage idx" - -## with block - -To specify bounds of "with" block, a pair of instructions is used. "With" instruction specifies its start. - -
- -| op_with | value | - | - | - -
- -where -`value` - "storage idx" (evaluated expression - argument of with) - -Followed by a number of arbitrary instructions, the block ends with `end_with` meta instruction. - -
- -|op_with| -|| -|| -|...| -|op_meta (end_with)| - -
- - -## try block - -Try block consists of try instruction, followed by a number of arbitrary instructions, meta instruction `catch` or `finally` or both of them, separating catch and finally blocks respectively and meta instruction `end_try_catch_finally`, which finishes the whole construction. - -
- -| op_try_block | offset_high | offset_low | - | - -
- -where -`offset_high` and `offset_low` - offset of the end of try block - -
- -|op_try_block| -|...| -|op_meta (catch)| -|...| -|op_meta (finally)| -|...| -|op_meta (end_try_catch_finally)| - -
- -## Object declaration - -Obect declaration instruction represents object literal in JavaScript specification. It consists of `op_obj_decl` instruction, followed by the list of `prop_data`, `prop_getter` and `prop_setter` meta instructions. A series of instructions which evaluate property values can precede meta instructions. Number of meta instructions, e.g. number of properties, is specified in the `prop_num` field. - -
- -| op_obj_decl | dst | prop_num | - | - -
- -where -`dst` - "storage idx" (where to save the created object) -`prop_num` - number of properties - -
- -|op_obj_decl| -|...
(intermediate evaluation of value/function expression, etc.)| -|op_meta (prop_data/ prop_getter/ prop_setter)| - -
- -## Arguments and array declarartion - -The strategy descibed in previous section is also used for encoding of arguments in function/constructor calls and elements in array declarations. -See the according pictures. - -
- -| op_with | value | - | - | - -
- -where -`value` - "storage idx" (evaluated expression - argument of with) - -
- -|op_with| -|| -|| -|...| -|op_meta (end_with)| - -
- -# Virtual machine - -Virtual machine executes bytecode by interpreting instructions one by one. Bytecode is a continuous array of instructions, divided into blocks of fixed size. Main loop of interpreter calls `opfunc_*` for every instruction. This function returns completion value and position of the next instruction. - -![Bytecode storage]({{ site.baseurl }}/img/bytecode_storage.jpg){: class="thumbnail center-block img-responsive" } - -Instruction can have up to three operands which are represented by `idx` values. Meaning of `idx` value depends on opcode and can be the following: - -- id of a temporary variable (register) -- id of literal (quiried form serializer, specific to every block of bytecode) -- type of assigned value, id of number/string literal or simple value in `op_assignment` -- type of meta and corresponding arguments in `op_meta` -- idx pair may represent opcode position - -During the execution every function of the source code has associated -interpreter context, which consists of the following items: - -- current position (byte-code instruction to execute) -- 'this' binding (ecma-value) -- lexical environment -- `is_strict` flag (is current execution code strict) -- `is_eval_code_lag` (is current execution mode eval) -- `min_reg_num`, `max_reg_num` - range of `idx`'s used for "registers" -- stack frame (array of "register" values) - -Main routines of the virtual machine are: - -- `run_int` - starts execution of Global code (main program). -- `run_int_from_pos` - executes specified code scope - (global/function/eval), expects the following arguments: starting - position, 'this' binding, lexical environment. -- `run_int_loop` - interpretation loop. # ECMA ECMA component of the engine is responsible for the following notions: -- Data representation -- Runtime representation -- GC + +* Data representation +* Runtime representation +* Garbage collection (gc) ## Data representation -The major structure for data representation is `ECMA_value`. Lower two bits of this structure encode value tag, which determines the type of the value: +The major structure for data representation is `ECMA_value`. The lower two bits of this structure encode value tag, which determines the type of the value: * simple * number * string * object -![ECMA value representation]({{ site.baseurl }}/img/ecma_value.jpg){: class="thumbnail center-block img-responsive" } +![ECMA value representation]({{ site.baseurl }}/img/ecma_value.png){: class="thumbnail center-block img-responsive" } -The immediate value is placed in higher bits. "Simple value" is an enumeration, which consists of the following elements: -- undefined -- null -- true -- false -- empty -- array_redirect (implementation defined, currently unused, for array storage optimization) +In case of number, string and object the value contains an encoded pointer. +Simple value is a pre-defined constant which can be: -For other value types higher bits of `ECMA_value` structure contain compressed pointer to the real value. +* undefined +* null +* true +* false +* empty (uninitialized value) + +For other value types the higher bits of `ECMA_value` structure contains compressed pointer to the real value. ### Compressed pointers -Compressed pointers were introduced to save heap space. They are possible because heap size is currently limited by 256 KB, which requires 18 bits to cover it. ECMA values in heap are aligned by 8 bytes and this allows to save three more bits, so that compressed pointer consumes 15 bits only. - -![Heap and ECMA elements]({{ site.baseurl }}/img/ecma_compressed.jpg){: class="thumbnail center-block img-responsive" } +Compressed pointers were introduced to save heap space. +![Compressed Pointer]({{ site.baseurl }}/img/ecma_compressed.png){: class="thumbnail center-block img-responsive" } ECMA data elements are allocated in pools (pools are allocated on heap) Chunk size of the pool is 8 bytes (reduces fragmentation). ### Number -There are two possible representation of numbers: -- 4-byte (float, compact profile - no memory consumption, but hardware limitations) -- 8-byte (double, full profile) +There are two possible representation of numbers according to standard IEEE 754: + +* 4-byte (float, compact profile) +* 8-byte (double, full profile) + +![Number]({{ site.baseurl }}/img/number.png){: class="thumbnail center-block img-responsive" } Several references to single allocated number are not supported. Each reference holds its own copy of a number. ### String -String values are encoded by 8-byte structure, which contains the following fields: - -- references counter - each stack (and non_stack) reference is counted (upon overflow, string is duplicated) -- is_stack_allocated - some temporary strings are stack_allocated to reduce loading of memory (perf) -- container - type of actual string storage/encoding -- hash - hash, calculated from two last characters (for faster comparison (perf)) -- literal identifier - actual string is in the literal storage -- magic_string_id - string is equal to one of engine's magic strings -- uint32 - string is represented with unsigned integers (useful for array indexing) -- number_cp (compressed pointer to number) - string is represented with floating point number -- collection_cp - string is stored in one or several pool's chunks (see also: chars collection, collection header, collection chunk) -- concatenation_1_cp, concatenation_2_cp - pointers to two strings (parts of concatenation) - ### Object / Lexical environment Object and lexical environment structures, 8 bytes each, have common (GC) header: -- Stack refs counter -- Next object/lexical environment in list of objects/lexical environments -- GC's visited flag -- is_lexenv flag + * Stack refs counter + * Next object/lexical environment in list of objects/lexical environments + * GC's visited flag + * is_lexenv flag -Remaining fields of these structures are different and are shown on the picture. +Remaining fields of these structures are different and are shown on the figure below. ![Object/Lexicat environment structures]({{ site.baseurl }}/img/ecma_object.jpg){: class="thumbnail center-block img-responsive" } ### Property of an object / description of a lexical environment variable While objects comprise of properties, lexical environments consist of variables. Both of these units are tied up into lists. Unit types could be different: -- named data (property or variable) -- named accessor (property) -- internal (implementation defined) + * named data (property or variable) + * named accessor (property) + * internal (implementation defined) All these units occupy 8 bytes and have common header: -- type - 2 bit -- next property/variable in the object/lexical environment (compressed pointer) + * type - 2 bit + * next property/variable in the object/lexical environment (compressed pointer) The remaining parts are differnt: ![Object property/lexcial environment variable]({{ site.baseurl }}/img/ecma_object_property.jpg){: class="thumbnail center-block img-responsive" } @@ -556,51 +288,49 @@ The remaining parts are differnt: ECMA runtime utilizes collections for intermediate calculations. Collection consists of a header and a number of linked chunks, which hold collection values. Header occupies 8 bytes and consists of: - -- compressed pointer to the next chunk -- number of elements -- rest space, aligned down to byte, is for the first chunk of data in collection + * compressed pointer to the next chunk + * number of elements + * rest space, aligned down to byte, is for the first chunk of data in collection Chunk's layout is following: - -- compressed pointer to the next chunk -- rest space, aligned down to byte, is for data stored in corresponding part of the collection + * compressed pointer to the next chunk + * rest space, aligned down to byte, is for data stored in corresponding part of the collection ### Internal properties: -- [[Class]] - class of the object (ECMA-defined) -- [[Prototipe]] - is stored in object description -- [[Extensible]] - is stored in object description -- [[CScope]] - lexical environment (function's variable space) -- [[ParametersMap]] - arguments object -0 code of the function -- [[Code]] - where to find bytecode of the function -- native code - where to find code of native unction -- native handle - some uintptr_t assosiated with the objec -- [[FormalParameters]] - collection of pointers to ecma_string_t (the list of formal parameters of the function) -- [[PrimitiveValue]] for String - for String object -- [[PrimitiveValue]] for Number - for Number object -- [[PrimitiveValue]] for Boolean - for Boolean object -- built-in related: - - built-in id - id of built-in object - - built-in routine id - id of built-in routine - - "non-instantiated" mask - what built-in properties where notinstantiated yet (lazy instantiation) - - extention object identifier +* [[Class]] - class of the object (ECMA-defined) +* [[Prototipe]] - is stored in object description +* [[Extensible]] - is stored in object description +* [[CScope]] - lexical environment (function's variable space) +* [[ParametersMap]] - arguments object -0 code of the function +* [[Code]] - where to find bytecode of the function +* native code - where to find code of native unction +* native handle - some uintptr_t assosiated with the objec +* [[FormalParameters]] - collection of pointers to ecma_string_t (the list of formal parameters of the function) +* [[PrimitiveValue]] for String - for String object +* [[PrimitiveValue]] for Number - for Number object +* [[PrimitiveValue]] for Boolean - for Boolean object +* built-in related: + * built-in id - id of built-in object + * built-in routine id - id of built-in routine + * "non-instantiated" mask - what built-in properties where notinstantiated yet (lazy instantiation) + * extention object identifier ### LCache LCache is a cache for property variable search requests. -![LCache]({{ site.baseurl }}/img/ecma_lcache.jpg){: class="thumbnail center-block img-responsive"} +![LCache]({{ site.baseurl }}/img/ecma_lcache.png){: class="thumbnail center-block img-responsive"} -Entry of LCache has the following layout: -- object pointer -- property name (pointer to string) -- property pointer +The entries of LCache has the following layout: + * object (pointer to object) + * property name (pointer to string) + * property (pointer to property) -Caches's row is defined by string's hash. When a property access occurs, all row's entries are searched by comparing object pointer and property name according entry's fields, full comparison is used for property name. +The layout above presents multiple times in row. The rows of LCache is indexed by property name hash. When a property access occurs, all row's entries are searched by comparing object pointer and property name according entry's fields, full comparison is used for property name. If corresponding entry was found, its property pointer is returned (may be NULL - in case when there is no property with specified name in given object). -Otherwise, object's property set is iterated fully and corresponding record is registered in LCache (with property pointer if it was found or NULL otherwise). +Otherwise, the property set of the considered object is iterated over and the corresponding record is registered in LCache (with property pointer if it was found or NULL otherwise). ## Runtime @@ -619,12 +349,12 @@ Many algorithms/routines described in ECMA return a value of "completion" type, ![ECMA completion]({{ site.baseurl }}/img/ecma_completion.jpg){: class="thumbnail center-block img-responsive" } Jerry introduces two additional completion types: -- exit - produced by `exitval` opcode, indicates request to finish execution -- meta - produced by meta instruction, used to catch meta opcodes in interpreter loop without explicit comparison on every iteration (for example: meta 'end_with') + * exit - produced by `exitval` opcode, indicates request to finish execution + * meta - produced by meta instruction, used to catch meta opcodes in interpreter loop without explicit comparison on every iteration (for example: meta 'end_with') ### Value management and ownership -Every value stored by engine is associated with virtual "ownership" (that is responsibility to manage the value and free it when is is not needed, or pass ownership of the value to somewhere) +Every value stored by engine is associated with virtual "ownership" (that is responsible for manage the value and free it when is is not needed, or pass ownership of the value to somewhere)
@@ -660,11 +390,11 @@ In this case execution is continued after corresponding FINALIZE mark, and compl ## Exception handling Operations that could produce exceptions should be performed in one of the following ways: -- wrapped into ECMA_TRY_CATCH block: - `ECMA_TRY_CATCH (value_returned_from_op, op (... ),` - `ret_value_of_the_whole_routine_handler)`` - `...` - `ECMA_FINALIZE(value_returned_from_op);` - `return ret_value;` -- `ret_value = op(...);` -- manual handling (for special cases like interpretation of opfunc_try_block). + * wrapped into ECMA_TRY_CATCH block: + * `ECMA_TRY_CATCH (value_returned_from_op, op (... ),` + * `ret_value_of_the_whole_routine_handler)`` + * `...` + * `ECMA_FINALIZE(value_returned_from_op);` + * `return ret_value;` + * `ret_value = op(...);` + * manual handling (for special cases like interpretation of opfunc_try_block). diff --git a/_includes/head.html b/_includes/head.html index d83e45a4f..604be28c2 100644 --- a/_includes/head.html +++ b/_includes/head.html @@ -14,6 +14,7 @@ +