diff --git a/04.internals.md b/04.internals.md index efb968084..2ea6a31eb 100644 --- a/04.internals.md +++ b/04.internals.md @@ -8,545 +8,277 @@ permalink: /internals/ {:toc} # High-Level Design -{: class="thumbnail center-block img-responsive" } +{: class="thumbnail center-block img-responsive" } -On the diagram above is shown interaction of major components of software system: Parser and Runtime. Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to [Bytecode](#byte-code) and [Parser](#parser) page for details). Prepared bytecode is executed by Runtime engine that performs interpretation (refer to [Virtual Machine](#virtual-machine) and [ECMA](#ecma) pages for details). +On the diagram above is shown interaction of major components of JerryScript: Parser and Virtual Machine (VM). Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to [Bytecode](#byte-code) and [Parser](#parser) page for details). Prepared bytecode is executed by the Virtual Machine that performs interpretation (refer to [Virtual Machine](#virtual-machine) and [ECMA](#ecma) pages for details). # Parser -The parser is implemented as recursive descent parser. The parser does not build any type of Abstract Syntax Tree. It converts the source JavaScript code directly into the byte-code. +The parser is implemented as a recursive descent parser. The parser converts the JavaScript source code directly into byte-code without building an Abstract Syntax Tree. The parser depends on the following subcomponents. -The parser consists of three major parts: -- lexer -- parser -- opcodes dumper -- syntax errors checker -- serializer +## Lexer -These four (except the parser itself) components are initialized during `parser_init` call (jerry-core/parser/js/parser.cpp). +The lexer splits input string (ECMAScript program) into sequence of tokens. It is able to scan the input string not only forward, but it is possible to move to an arbitrary position. The token structure described by structure `lexer_token_t` in `./jerry-core/parser/js/js-lexer.h`. -This initializer requires two following subsystems to be initialized: memory allocator and serializer. The need for allocator is clear. The serializer resets internal bytecode_data structure(jerry-core/parser/js/bytecode-data.h). Currently bytecode_data is singleton. During parsing it is filled by the data which is needed for the further execution: +## Scanner -* Byte-code - array of opcodes (`bytecode_data.opcodes`). -* Literals - array of literals (`bytecode_data.literals`). -* Strings buffer (`bytecode_data.strings_buffer`) - literals of type `LIT_STR` contain pointers to strings, which are located in this buffer. +Scanner (`./jerry-core/parser/js/js-parser-scanner.h`) pre-scans the input string to find certain tokens. For example, scanner determines whether the keyword `for` defines a general for or a for-in loop. Reading tokens in a while loop is not enough because a slash (`/`) can indicate the start of a regular expression or can be a division operator. -The following is brief review of the mentioned components. See more concise description in the following chapters. +## Expression Parser -* Lexer -The lexer splits input file (given as the first parameter of the parser_init call) into sequence of tokens. These tokens are then matched on demand. -* Opcodes dumper -This component does necessary checks and preparations, and dumps opcodes using serializer. -* Serializer -The serializer puts opcodes, prepared by the dumper, to a continuous array that represents current scope's code. Also it provides API for accessing byte-code. -* Syntax error checker -This is bunch of simple die-on-error checks. +Expression parser is responsible for parsing JavaScript expressions. It is implemented in `./jerry-core/parser/js/js-parser-expr.c`. -After initialization `parser_parse_program` (`./jerry-core/js/parser.cpp`) should be called. This function performs the following steps (so-called parsing steps) for all scopes (global code and functions): +## Statement Parser -1. Initialize a scope. -2. Do pre-parser stage. -3. Parse the scope code. +JavaScript statements are parsed by this component. It uses the [Expression parser](#expression parser) to parse the constituent expressions. The implementation of Statement parser is located in `./jerry-core/parser/js/js-parser-statm.c`. -After every scope is processed, parser merges all scopes into the single byte-code array. +Function `parser_parse_source` carries out the parsing and compiling of the input EcmaScript source code. When a function appears in the source `parser_parse_source` calls `parser_parse_function` which is responsible for processing the source code of functions recursively including argument parsing and context handling. After the parsing, function `parser_post_processing` dumps the created opcodes and returns an ecma_compiled_code_t* that points to the compiled bytecode sequence. -Two new entities were introduced - scopes and pre-parser. +The interactions between the major components shown on the following figure. -* There are two types of scopes in the parser: global scope and function declaration scope. Notice that function expressions do not create a new scope in terms of the parser. The reason why is described below. Parsing process starts on global scope. If a function declaration occurs string the process, new scope is created, this new scope is pushed to a stack of current scopes; then steps 1-3 of parsing are performed. Note, that only global scope parsing shall merge all scopes into a byte-code. All scopes are stored in a tree to represent a hierarchy of them. -* Pre-parser. This step performs hoisting of variable declarations. First, it dumps `reg_var_decl` opcodes. Then it goes through the script and looks for variable declaration lists. For every found variable in the scope (not in a sub-scope or function expression) it dumps var_decl opcode. After this step byte-code in the scope starts with optional `'use strict'` marker, then `reg_var_decl` and several (optional) `var_decls`. +{: class="thumbnail center-block img-responsive" } -Due to some limitations of the parser, some parsing functions take `this_arg` and/or `prop` as parameters. They are further used to dump `prop_setter` opcode. During parsing all necessary data is stored in either stacks or scope trees. After parsing of the whole program, the parser merges all scopes into a single byte-code, hoisting function declarations in process. This task, so-called post-parser, is performed by `scopes_tree_raw_data` (jerry-core/js/scopes-tree.c) function. For the further information about post-parser, check opcodes dumper section. +# Byte-code -### Lexer +This section describes the compact byte-code (CBC) byte-code representation. The key focus is reducing memory consumption of the byte-code representation without sacrificing considerable performance. Other byte-code representations often focus on performance only so inventing this representation is an original research. -The lexer splits input string into the set of tokens. The token structure (`./jerry-core/parser/js/lexer.h`) consists of three elements: token type, location of the token and optional data: +CBC is a CISC like instruction set which assigns shorter instructions for frequent operations. Many instructions represent multiple atomic tasks which reduces the byte code size. This technique is basically a data compression method. -{% highlight cpp %} -typedef struct -{ - locus loc; - token_type type; - literal_index_t uid; -} -token; -{% endhighlight %} +## Compiled code format -Location of token (`locus`). It is just an index of the first token's character at a string that represents the program. Token types are listed in lexer.h header file (`token_type` enum). Depending on token type, token specific data (`uid` field) has the different meaning. +The memory layout of the compiled byte code is the following. + +{: class="thumbnail center-block img-responsive" } + +The header is a `cbc_compiled_code` structure with several fields. These fields contain the key properties of the compiled code. + +The literals part is an array of ecma values. These values can contain any EcmaScript value types, e.g. strings, numbers, function and regexp templates. The number of literals is stored in the `literal_end` field of the header. + +CBC instruction list is a sequence of byte code instructions which represents the compiled code. + +## Byte-code Format + +The memory layout of a byte-code is the following: + +{: class="thumbnail center-block img-responsive" } + +Each byte-code starts with an opcode. The opcode is one byte long for frequent and two byte long for rare instructions. The first byte of the rare instructions is always zero (`CBC_EXT_OPCODE`), and the second byte represents the extended opcode. The name of common and rare instructions start with `CBC_` and `CBC_EXT_` prefix respectively. + +The maximum number of opcodes is 511, since 255 common (zero value excluded) and 256 rare instructions can be defined. Currently around 230 frequent and 120 rare instructions are available. + +There are three types of bytecode arguments in CBC: + + * __byte argument__: A value between 0 and 255, which often represents the argument count of call like opcodes (function call, new, eval, etc.). + + * __literal argument__: An integer index which is greater or equal than zero and less than the `literal_end` field of the header. For further information see next section Literals (next). + + * __relative branch__: An 1-3 byte long offset. The branch argument might also represent the end of an instruction range. For example the branch argument of `CBC_EXT_WITH_CREATE_CONTEXT` shows the end of a with statement. More precisely the position after the last instruction. + +Argument combinations are limited to the following seven forms: + +* no arguments +* a literal argument +* a byte argument +* a branch argument +* a byte and a literal arguments +* two literal arguments +* three literal arguments + +## Literals + +Literals are organized into groups whose represent various literal types. Having these groups consuming less space than assigning flag bits to each literal. +(In the followings, the mentioned ranges represent those indicies which are greater than or equal to the left side and less than the right side of the range. For example a range between `ident_end` and `literal_end` fields of the byte-code header contains those indicies, which are greater than or equal to `ident_end` +and less than `literal_end`. If `ident_end` equals to `literal_end` the range is empty.) + +The two major group of literals are _identifiers_ and _values_. + + * __identifier__: A named reference to a variable. Literals between zero and `ident_end` of the header belongs to here. All of these literals must be a string or undefined. Undefined can only be used for those literals which cannot be accessed by a literal name. For example `function (arg,arg)` has two arguments, but the `arg` identifier only refers to the second argument. In such cases the name of the first argument is undefined. Furthermore optimizations such as *CSE* may also introduce literals without name. + + * __value__: A reference to an immediate value. Literals between `ident_end` and `const_literal_end` are constant values such as numbers or strings. These literals can be used directly by the Virtual Machine. Literals between `const_literal_end` and `literal_end` are template literals. A new object needs to be constructed each time when their value is accessed. These literals are functions and regular expressions. + +There are two other sub-groups of identifiers. *Registers* are those identifiers which are stored in the function call stack. *Arguments* are those registers which are passed by a caller function. + +There are two types of literal encoding in CBC. Both are variable length, where the length is one or two byte long. + + * __small__: maximum 511 literals can be encoded. + +One byte encoding for literals 0 - 254. + +```c +byte[0] = literal_index +``` + +Two byte encoding for literals 255 - 510. + +```c +byte[0] = 0xff +byte[1] = literal_index - 0xff +``` + + * __full__: maximum 32767 literal can be encoded. + +One byte encoding for literals 0 - 127. + +```c +byte[0] = literal_index +``` + +Two byte encoding for literals 128 - 32767. + +```c +byte[0] = (literal_index >> 8) | 0x80 +byte[1] = (literal_index & 0xff) +``` + +Since most functions require less than 255 literal, small encoding provides a single byte literal index for all literals. Small encoding consumes less space than full encoding, but it has a limited range. + +## Byte-code Categories + +Byte-codes can be placed into four main categories. + +### Push Byte-codes + +Byte-codes of this category serve for placing objects onto the stack. As there are many instructions representing multiple atomic tasks in CBC, there are also many instructions for pushing objects onto the stack according to the number and the type of the arguments. The following table list a few of these opcodes with a brief description.