<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A fast in-place interpreter for WebAssembly</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/31/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10403756</idno>
					<idno type="doi">10.1145/3563311</idno>
					<title level='j'>Proceedings of the ACM on Programming Languages</title>
<idno>2475-1421</idno>
<biblScope unit="volume">6</biblScope>
<biblScope unit="issue">OOPSLA2</biblScope>					

					<author>Ben L. Titzer</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[WebAssembly (Wasm) is a compact, well-specified bytecode format that offers a portable compilation target with near-native execution speed. The bytecode format was specifically designed to be fast to parse, validate, and compile, positioning itself as a portable alternative to native code. It was pointedly not designed to be interpreted directly. Instead, design considerations at the time focused on competing with native code, utilizing optimizing compilers as the primary execution tier. Yet, in JIT scenarios, compilation time and memory consumption critically impact application startup, leading many Wasm engines to later deploy faster single-pass (baseline) compilers. Though faster, baseline compilers still take time and waste code space for infrequently executed code. A typical interpreter being infeasible, some engines resort to compiling Wasm not to machine code, but to a more compact, but easy to interpret format. This still takes time and wastes memory. Instead, we introduce in this article a fast in-place interpreter for WebAssembly, where no rewrite and no separate format is necessary. Our evaluation shows that in-place interpretation of Wasm code is space-efficient and fast, achieving performance on-par with interpreting a custom-designed internal format. This fills a hole in the execution tier space for Wasm, allowing for even faster startup and lower memory footprint than previous engine configurations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Emerging first for the Web in 2017 <ref type="bibr">[Haas et al. 2017]</ref>, WebAssembly is a portable, low-level compilation target supported in all major browsers. Originally designed as a successor to asm.js <ref type="bibr">[Zakai 2013</ref>], which allowed C/C++ to be compiled to JavaScript, Wasm has supplanted other technologies such as Native Client <ref type="bibr">[Yee et al. 2010]</ref> as the new best target for native compilation to the Web. Since that time, WebAssembly has seen rapid uptake in a number of new spaces, including cloud computing <ref type="bibr">[Nurul-Hoque and Harras 2021]</ref>, digital contracts, edge computing <ref type="bibr">[Nieke et al. 2021</ref><ref type="bibr">][Fas 2020]</ref>, IOT <ref type="bibr">[Li et al. 2021]</ref>, and embedded systems [Gurdeep Singh and <ref type="bibr">Scholliers 2019]</ref>.</p><p>A key design criteria for Wasm was offering performance competitive with native code. Initially, top-tier performance was considered paramount, and approaching native code performance to compete with technologies like Native Client was realized by reusing the optimizing JIT compiler infrastructure in browsers. Yet Wasm bytecode was also designed to be fast to parse, validate, and compile&#347;criteria validated during the design process by building single-pass validators and single-pass decoding to SSA compiler IR to minimize upfront costs in the compilation pipeline.</p><p>However, despite minimizing bytecode parsing work by careful design, optimizing compilers inescapably take considerable time and memory to produce good native code, penalizing application startup in JIT scenarios. To address startup time problems, browsers prototyped separate, faster compilers during Wasm's design phase, validating that the same choices that enabled single-pass validation enabled single-pass compilation. Such often-termed &#322;baseline&#382; compilers spend far less compilation time, often 10&#215;-20&#215; less, but produce code that typically runs 1.5&#215; to 3&#215; slower than an optimizing compiler. This represents a classic tradeoff space known to VMs for decades; more compilation time means better code quality. Today, all browser engines employ multiple Wasm compiler tiers to strive for both good startup time and high throughput.</p><p>1.1 Whither the interpreter? Seemingly overlooked in this development arc is the obvious choice of using an interpreter to execute bytecode. After all, traditionally, virtual machines are developed with an interpreter first. There are a lot of advantages to interpreters.</p><p>(1) Since interpreters are easier to write, understand, and maintain, they allow more rapid experimentation in bytecode design.</p><p>(2) Since they need no translation or rewriting step, startup is fast.</p><p>(3) Since bytecode is usually more compact than machine code, interpeters generally use less memory than compilers. (4) Since the interpreter loop can be stopped at any instruction and program state inspected, altered, and resumed, debugging application code is easier. (5) Since there is a fixed amount of code, interpreters are easier to audit and usually have fewer security vulnerabilities. (6) Since dynamic code generation is sometimes impossible, either because is not allowed on the platform, like iOS, or code space is limited, an interpreter might be the only choice. For all these reasons, nearly all virtual machines, from pioneering work on Lisp to Smalltalk to Self, to today's broadly-accepted VMs such as the Java Virtual Machine, the CLR, Python, Ruby, and JavaScript, have an interpreter.</p><p>Why then, is Wasm any different? The answer is simply that efficient interpretation was explicitly not in the design criteria<ref type="foot">foot_0</ref> . But some Wasm engines do indeed employ interpreters, such as JavaScriptCore, Chakra<ref type="foot">foot_1</ref> , and Wasm3. These engines use interpreters for exactly the advantages listed above. Yet none of these interpreters work directly on the original bytecode; all of them rewrite Wasm bytecode to a different internal format. Rewriting Wasm bytecode is effectively a compilation step that takes time and memory, impacting startup and memory consumption contrary to the primary advantages of interpreters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">The final tier is shed</head><p>So far, there is an important point missing in the Wasm virtual machine design tradeoff space. An in-place interpreter operating on the original bytes of a binary module occupies a uniquely advantageous point in the tradeoff space. For cold or never executed code, where the interpreter's much slower execution speed is outweighed by avoiding translation time, it should offer the best startup time and lowest memory consumption. Employed in concert with compilation tiers for hot code, such an interpreter would serve the role it does in other mature systems like JVMs. Not to be overlooked is debugging; not needing a location mapping between the original code and the internal code is a major simplification. To date, we have found only two examples of in-place interpreters for Wasm: the testing interpreter in V8 and the &#322;classic&#382; interpreter in the WebAssembly Micro-Runtime[ <ref type="bibr">Wam 2021]</ref>, referred to here as wamr-classic.</p><p>The testing interpreter in V8, implemented in C++, was originally intended to support debugging only. Never employed as a production tier, its performance is nearly 1000&#215; slower than the other tiers, and it is now relegated to testing<ref type="foot">foot_2</ref> . That leaves only wamr-classic as the only production in-place interpreter. As we detail in 4.5.4, it employs a dynamic control stack with caching that can suffer pathological slowdowns as programs grow large. Aware of this vulnerability, the developers have deprecated this design in favor of a rewriting interpreter and a JIT compiler.</p><p>To date, the only high-performance interpreter designs for Wasm have employed rewriting or suffered from pathological cases. Is it possible to interpret Wasm in-place efficiently? In this work, we solve this open problem: the first fast in-place interpreter for Wasm without pathological behavior. As empirical measurements in Figure <ref type="figure">1&amp;2</ref> show, our interpreter design occupies a new region in the tradeoff space of execution tiers for Wasm.</p><p>We first identify the interpreter-crucial information that is missing in Wasm's bytecode design. This information, namely key control-flow and value stack information, is not actually missing, but rather implicit. Our insight is that the validation algorithm for Wasm bytecode already computes this information in its modeling the control and value stack during typechecking. All that remains is to distill a few key numbers into a compact side-table that is used during interpretation. The side-table is organized so that all accesses occur in constant (&#119874; (1)) time, and no searches of the table are necessary. Thus the interpreter always has relevant information directly at hand and behaves like a standard interpreter. Other details are important for making a fast interpreter as well, such as hand-coding key parts in assembly and combining exactly the right layout of value stack and virtual memory protections to robustly handle application stack overflow without needing any explicit checks.</p><p>With these new techniques, we have finally achieved an in-place interpreter for Wasm with execution time on-par with state-of-the-art rewriting interpreters. This paper completes the triad of basic tier designs (interpreter, fast compiler, optimizing compiler) for Wasm. In a comical twist of fate, Wasm's tiers have arrived exactly backward!</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Organization</head><p>The remainder of this paper is organized as follows. Section 2 recaps the design of Wasm's functions, stack machine, and control flow constructs, which are key to understanding why interpretation has been challenging until now. Section 3.1 shows the design of the side table used for the interpreter and how the validation algorithm already contains the key information necessary to emit the side table in a single pass. Section 3 details the interpreter implementation, including key assembly techniques to achieve the best-performing dispatch loop, and the design of the data structures necessary to make a fast operand and execution stack. Section 4 evaluates the interpreter on standard benchmarks and compares translation time, memory consumption, and execution time to JITs (4.5.1) and other interpreters (4.5.2) for Wasm. Prior work related to optimizing interpreters is summarized in Section 5, followed by the conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">WASM DESIGN</head><p>Wasm provides a low-level execution model consistent with its original goal of a minimal, highperformance abstraction over hardware. Its principal features include: The Wasm binary format is designed to be compact yet fast to decode and validate in a single pass. This includes not just bytecode, but all constructs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Modules and instances</head><p>Wasm code is organized into modules which are in turn organized into a list of sections. Sections in a module declare functions, memories, tables, global variables and static data. Bytecode is grouped into functions with statically-typed parameters, results, and local variables. All operations in core Wasm manipulate only a module's own internal state. Modules must import functions (and memories, tables, etc) in order to access platform capabilities or state outside the module. Imports may be provided by the &#322;host&#382; environment, such as JavaScript and the Web, or from other modules.</p><p>A module is akin to an executable file, or part of one, rather than an executing program. To run, a module must be instantiated, supplying bindings for its imports. At instantiation time, a Wasm engine creates the state (tables, globals, and memories) declared by the module, with the result being called an instance. An instance may export its own functions, memories, tables, etc. to other modules or the host environment.</p><p>The primary dynamic storage of a Wasm program is typically one large, bounds-checked, byteaddressable memory, but global variables and tables of opaque host references can also be used. Future proposals will add first-class function references and garbage-collected objects to Wasm. These too are forms of local state and must be shared explicitly with other instances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Bytecode design</head><p>Functions. All code 4 in Wasm is organized into functions. Functions each have a signature with a fixed number of parameter and result types, such as</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Execution of a Wasm program entails executing functions that may call each other, maintaining an</head><p>A Fast In-Place Interpreter for WebAssembly 148:5 execution stack<ref type="foot">foot_4</ref> that stores their local variables and operands, and running their internal code. In a binary module, the body of a Wasm function begins by declaring the number and type of their additional local variables, followed immediately by the bytecode.</p><p>Stack machine. As is common for many bytecode designs, Wasm is a stack machine, meaning individual bytecode instructions take their operand values from an operand stack and push their results back. Local variables are separate. To be used, a local must be explicitly loaded onto or stored back from the operand stack. Implementations typically store them internally as a prefix of the operand stack, together referred to here and throughout as the value stack. The arguments of an outgoing function call become the first locals of the callee function.</p><p>Structured control flow. Unlike most bytecode designs, however, Wasm has structured control flow constructs such as blocks, ifs, loops, and switches that are encoded inline in the bytecode. We refer to them as structured, since they must be properly nested. This was a deliberate choice for compactness and to ensure that bytecode validation can be done in a single pass with minimal data structures<ref type="foot">foot_5</ref> . In contrast, a typical bytecode design with jumps usually requires more bytes to store and two passes to verify.</p><p>Direct interpretation not straightforward. Most bytecode formats can be interpreted directly in their binary form (i.e. in-place), with an instruction pointer stepping through the bytes of the original code. Jumps typically have an offset or instruction number of the target instruction directly in the bytes, allowing a constant-time adjustment of the instruction pointer. But Wasm is unusual in that a branch instruction specifies a target construct by relative nesting depth, transferring control to the beginning (in case of loop), middle (in the case of if) or end (for block, if, else) of the construct. Wasm is also unusual in that branches can also copy and pop values off the operand stack<ref type="foot">foot_6</ref> . Thus a direct Wasm interpreter faces two unusual problems when executing a branch:</p><p>&#8226; can't quickly find the target bytecode offset, e.g. start of a loop or end of a block &#8226; can't determine how many values to pop off the operand stack To understand how this paper efficiently solves this open problem, we must first journey deep into how Wasm bytecode validation is done in production-quality engines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Bytecode validation</head><p>Wasm, though low-level, is typed. A number of static checks ensure that a module is well-formed. Within a function, all instructions, including arithmetic, calls, control flow, etc. must be applied to the correct number and type of operands. All control flow constructs must be properly nested with no invalid branches. In the specification, this validation is expressed in a standard type system formalism. In engines, the algorithm is implemented as an abstract interpreter that models an abstract control and value stack. We describe such an implementation here to later make it clear how our modifications affect an already very efficient validation implementation.</p><p>Figure <ref type="figure">3</ref> illustrates the operation of a production quality Wasm code validator. Three primary data structures are used.</p><p>&#8226; The module environment models the types of functions, tables, globals, and number of memories of the enclosing module. The module environment is not mutated during validation of a function's bytecode.  &#8226; An abstract control stack models the nested control-flow constructs, keep tracking of where each starts and its expected parameter and result types<ref type="foot">foot_7</ref> . This is used to properly match block, if, and loop starts with their else and/or end. &#8226; An abstract value stack tracks abstract values for stack slots and local variables. In current Wasm, abstract values are simply value types, i.e. i32, i64, externref, etc. If a future proposal introduces flow-sensitive validation, the abstract values for locals would need to be extended to include initialization information.</p><p>The validation algorithm proceeds in a single forward pass over the bytecode, never needing to backtrack. For simple instructions like arithmetic or calls which only pop their operands from the value stack and push results back, the algorithm pops and checks required input types and pushes resulting output types. Control-flow instructions are validated using the control stack. For example, a br (branch) instruction references the target block or loop by relative nesting depth; the validator matches the opening construct by indexing into the control stack. Since, in Wasm block, if, and loop can have parameter and result types, the validator must check that the value stack contains the appropriate types expected for the target construct.</p><p>Importantly, an end bytecode closes a control-flow construct, either a block, if, loop, or the whole function. At end, all branches that can legally target the construct have been seen. Thus the implicit target bytecode offset of all branches to this construct are known, because that offset must be either the start of a loop or, for any other construct, the end, i.e. here. This information just needs to be saved somewhere easily accessible for the interpreter. After processing end, the algorithm pops the control stack entry and can reuse its storage space, which is optimally efficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">INTERPRETER DESIGN</head><p>In this section, we present our fast in-place Wasm interpreter design.</p><p>The key enabling techniques are:</p><p>&#8226; an innovative side-table design which allows efficient access to missing branch information A Fast In-Place Interpreter for WebAssembly 148:7</p><p>&#8226; a highly efficient value stack organization for &#119874; (1) local variable and operand access as well as zero-copy function calls</p><p>Additionally, we chose to implement the core logic of the interpreter in hand-written assembly language<ref type="foot">foot_8</ref> , which allows for near-perfect register allocation and unlocks all possible dispatch and organization techniques. We discuss the rationale for hand-written assembly at the end of this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Sidetable design</head><p>As we've seen, Wasm control-flow bytecodes represent nested control constructs, rather than low-level jumps. However, an interpreter needs the bytecode offset of where to go if a branch is taken, ideally in &#119874; (1) time. This includes not only explicit branches like br, br_if, br_table, but also the implicit branch in else. Note that block, loop, end and return are not branching bytecodes, since they either fall through or exit the function.</p><p>To supply the in-place interpreter with the missing information, the validation algorithm saves a portion of its work into a per-function side-table data structure separate from the original bytecode. This side-table is a compact, highly-efficient mapping from branch origin to target offset, plus some additional stack manipulation information. To make the data structure time-and space-efficient, it consists of entries sorted by branch origin and omits non-branch instructions. It is emitted as a side-effect of the single-pass validation algorithm above. Because the validation algorithm already visits bytecodes in forward order, it simply emits branch entries as it goes, obviating a separate sorting step. Since only branches need entries, the sidetable is very small, often empty. Empirically, most Wasm functions are small with no control flow, so they have no side-table at all.</p><p>Every entry in the side-table is a 4-tuple of the form &#10216;&#916;ip, &#916;stp, vals, pop &#10217;, where:</p><p>&#8226; &#916;ip the amount to adjust the instruction pointer by if the branch is taken &#8226; &#916;stp the amount to adjust the side-table pointer by if the branch is taken</p><p>&#8226; vals the number of values that will be copied if the branch is taken</p><p>&#8226; pop the number of values that will be popped if the branch is taken 3.1.1 The sidetable pointer. Like most interpreters, our in-place interpreter maintains an instruction pointer (IP) into the bytecode during execution. It also maintains an end instruction pointer (EIP), which is used to check if the program falls off the end of the function, which is a legal implicit return in Wasm. To use the side-table, the interpreter simply maintains another pointer, the side-table pointer (STP), consulted when executing branches. Figure <ref type="figure">4</ref> illustrates the interpreter state for the bytecode and sidetable, and Listing 1 illustrates how side-table entries are used by the interpreter during execution. The instruction pointer (IP) is initialized to point directly into the bytecode and the sidetable pointer (STP) points at the first side-table entry. Branch instructions make use of the doControlTransferFromSTP() subroutine which adjusts both the IP and STP based on the entry to which it points, as well as adjusting the value stack. Notice that a conditional branch that is not taken still must update the STP so that it points at the next entry for the next branch, if any, in the code. Note that br_table works exactly as intended, as a jump table, computing an index into the side-table and using the corresponding entry in &#119874; (1) time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Value stack design</head><p>Since Wasm bytecode constitutes a stack machine, nearly all instructions access the value stack, making it crucial for interpretation speed. A single indirection to local variables and the top of the operand stack is ideal. Wasm's numbering scheme for locals is inspired by JVM bytecode. Parameter #0 is local #0, followed by declared locals, then the operand stack. Outgoing arguments of a call become the first local variables of the callee function's value stack. The JVM chose this design so interpreters could overlap the value stack of the callee frame with the operand stack of the caller, avoiding copying any argument values. This never panned out for Wasm until now. Our design succeeds in using the JVM trick to avoid copying arguments, but requires separating the value stack from the execution stack.</p><p>Figure <ref type="figure">5</ref> illustrates the value stack design. We separate the storage of Wasm program values from the control information of the interpreter itself. That is, the value stack contains only Wasm values, while the interpreter's call stack contains only control information, organized into one execution frame per Wasm call frame. As such, the value stack is a contiguous array of Wasm values which increases in size towards higher addresses. Though invisible to Wasm programs, and orthogonal to our design here, the interpreter frames are on the native stack and use the native stack pointer (e.g. %rsp on x86-64) which grows towards lower addresses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Value tagging.</head><p>Wasm values can be 32-bit or 64-bit integers, 32-bit or 64-bit floats, 128-bit SIMD vectors, or external references. We choose to store all Wasm values in the value stack as unboxed, so that the interpreter never needs to implicitly allocate a heap object<ref type="foot">foot_9</ref> . This obviously makes sense for all primitive values, since they can share storage as raw bytes in the memory of the value stack. However, an engine may need to find externref values in a value stack during garbage collection. Natively-compiling Wasm engines use stackmaps <ref type="bibr">[Diwan et al. 1992]</ref>. But an interpreter is different, and typically cannot afford either space or time to precompute stackmaps. Instead, we chose to tag value stack entries with a 1-byte type tag, inflating entries to be 32 bytes, i.e. really fat<ref type="foot">foot_10</ref> . Type tags are written into the value stack only when necessary (i.e. when initializing locals in the function prologue, not at all for primitive arithmetic, etc). To be clear, type tags are never needed for any dynamic check, since Wasm code is statically typechecked. They can be omitted if there is no garbage collector or it uses conservative stack scanning <ref type="bibr">[Demers et al. 1989</ref>].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Stack overflow.</head><p>Wasm engines must be robust to call stack overflow. The Wasm specification includes one test with unbounded recursive function calls. Every engine must fail gracefully, though the exact point where overflow is detected is not specified.</p><p>Checking for stack overflow should not ruin performance. An overflow should not crash the engine uncontrollably either, but be handled gracefully like other Wasm traps. A check on every value stack push would be prohibitive. Instead, we use a guard page at the end of the value stack and rely on an OS-level signal upon fault to catch and report stack overflow. A single guard page suffices, as the interpreter cannot inadvertantly stride over it; by design it never accesses arbitrarily far ahead. Similarly, interpreter execution frames (on the native stack) are all fixed size and another guard page<ref type="foot">foot_11</ref> will cause a fault if the native execution stack overflows before the value stack.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.2.3</head><p>Putting it all together. At this point, we've designed two critical data structures necessary to make an efficient in-place interpreter. Figure <ref type="figure">6</ref> completes the set of 9 state registers used by our interpreter implementation. In addition to the 3 &#322;control&#382; registers pointing into the bytecode and sidetable and the two &#322;data&#382; pointers into the value stack, the interpreter also requires:</p><p>&#8226; MEM a pointer to the start of the wasm memory <ref type="foot">13</ref>&#8226; FUNCTION a reference to the current function &#8226; INSTANCE a reference to the current instance &#8226; DISPATCH for dynamically enabling per-instruction probing</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Interpreter implementation in assembly</head><p>We chose to implement our interpreter in a new Wasm research engine, wizard <ref type="bibr">[Titzer 2021</ref>]. While most of the engine is written in a portable, safe, high-level language <ref type="bibr">[Titzer 2013</ref>], we use custom hand-written assembly for the fast interpreter. Using assembly or a custom code generation facility is relatively common for production interpreters. For example, the Java HotSpot virtual machine generates its interpreter <ref type="bibr">[Hot 1998</ref>] using a macro assembler (at startup), V8's Ignition [ <ref type="bibr">Ign 2022]</ref> interpreter is generated from hand-written TurboFan compiler IR (at build time), and JSC's LLint is written in a macro assembler language (build time). Several factors impinge on our decision.</p><p>(1) an interpreter is a small, important piece of code (2) the bytecode format and semantics of Wasm are very stable (3) compilers have trouble generating optimal code for interpreter loops (4) we wish to study interpreter performance in detail ( <ref type="formula">5</ref>) key dispatch techniques are difficult to obtain from compilers (6) developing and debugging assembly code is relatively time consuming<ref type="foot">foot_13</ref> Key advantages of using assembly to implement an interpreter are 1) its code is not perturbed by changing compiler optimizations, 2) key interpreter state can be fixed to specific architectural registers, 3) threaded <ref type="bibr">[Bell 1973</ref><ref type="bibr">], indirect-threaded [Dewar 1975]</ref>, and other dispatch techniques can be used, 4) handlers can be ordered and aligned for optimal I-cache behavior, 5) fast-and slow-paths can be organized inline or out-of-line, 6) all hardware instructions can be used, 7) self-modifying code and dispatch table swap techniques can be used, 8) error handling and hard cases can be factored from handlers, 9) very small resultant code footprint. The interpreter that we present in this paper makes use of nearly all of these techniques. It is implemented using a macro assembler that generates x86-64 machine code, has many switches to enable different features, and provides an instrumentation interface for profiling and debugging. It consists of 2800 source lines of code which generate approximately 14KiB of machine code and 7KiB of dispatch tables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Dispatch tables and handlers.</head><p>Wasm is a bytecode in the true sense of the word. An instruction is encoded as a byte-sized opcode, followed by zero or more immediates. Longer instructions start with a prefix byte. It is natural therefore to design a software interpreter around 256-entry dispatch tables that contain pointers to handlers that implement each bytecode. Figure <ref type="figure">7</ref> illustrates the dispatch table and handler organization for our fast Wasm interpreter.</p><p>The fast interpreter uses multiple dispatch tables, each of which points to a sequence of machine code called a handler. A dispatch through a table consists of loading the address at a particular index and indirectly jumping to that address. The first dispatch table, the main dispatch table, is used for the first byte of an instruction. Since the most important bytecodes in Wasm were assigned a non-prefixed opcode, the first dispatch through the main table normally lands in a handler that will directly execute the bytecode.</p><p>Prefix dispatch tables handle the tricky but rare corner cases. If the first byte of an instruction is a prefix, then the target address in the main table will be a special handler that loads the next byte in the instruction stream and dispatches through the appropriate table. There is still one more wrinkle, however. Wasm is unusual in that the opcode after a prefix can be a variable-length integer. Like nearly all integers in the Wasm binary format, this is a LEB128 [ <ref type="bibr">LEB 2022]</ref> encoding that uses the uppermost bit of each byte to indicate if continuation bytes follow. Thus, in prefix dispatch tables, entries for the upper 128 opcodes (i.e. where the upper bit is 1) point to another special handler that fully decodes the LEB and finally dispatches to an actual bytecode handler, using yet another dispatch table. Of current Wasm bytecodes, only the SIMD opcodes occupy the upper part of their prefix space (0xFD). Thus, in practice, Wasm opcodes normally require just one dispatch, two if prefixed, and maximum three if the LEB opcode is longer than 1 byte.</p><p>3.3.2 Direct vs threaded dispatch. Each bytecode handler consists of machine instructions that manipulate the value stack or module instance or perform control flow. A handler usually simply leaves the interpreter registers ready to start the next bytecode pointed to by IP, except for instructions like unreachable and return, which will terminate execution or return from the current interpreter frame, respectively.</p><p>Conceptually an interpreter has a single loop that repeatedly dispatches one instruction at a time, each handler jumping back to the loop header. Nowadays, a technique known as threaded dispatch <ref type="bibr">[Shi et al. 2005</ref>] is often used, where instead of a jump, a copy of the dispatch code is inlined at the end of each handler. This is difficult to do in a high-level language<ref type="foot">foot_14</ref> , but easy in assembly. This saves a jump instruction and typically makes better use of the CPU's indirect branch predictor.</p><p>3.3.3 Debugging and instrumenting with probes. One of the motivating advantages for an interpreter tier in any virtual machine is ease of debugging and instrumentation. In particular, an interpreter implements an exact bytecode-by-bytecode emulation of the machine, offering the possibility of stopping before or after any bytecode and inspecting the virtual machine state. This is key to support both high and low-level debugging of a program as well as instrumentation such as profiling.</p><p>Our fast interpreter design has provisions for general instrumentation at the bytecode level. It offers the ability to insert probes <ref type="bibr">[Titzer and Palsberg 2005</ref>] at either local locations in a program, or global, the interpreter loop itself. Both have different use cases. A probe is simply a callback to engine-level code that fires when the given bytecode is executed, or each time the interpreter loop is executed. In a probe callback, one can inspect virtual machine state through an engine API and then indicate if the program should resume normally or do something else (typically, terminate). Probes are primitives from which debugging support (e.g. breakpoints), tracing support (e.g. logging), or profiling support (e.g. counters), are built.</p><p>Figure <ref type="figure">7</ref> shows how the fast interpreter supports probes.</p><p>&#8226; Local probes. For a probe inserted at a particular instruction, the original bytecode of the function containing the location is copied and the bytecode is overwritten with a special, normally-illegal bytecode, PROBE. Since illegal bytecodes will be rejected by the code validator, they will never appear in valid programs. Dispatching on PROBE will land in a special handler that looks up the user's callback associated with this bytecode, calls it, and then after, loads the original bytecode and dispatches through the main table. &#8226; Global probes. For a probe inserted into the global interpreter loop, the interpreter switches modes, using the probe dispatch table for every instruction. Similar to the local probes, the interpreter looks up the global instrumentation, calls it, and then after, dispatches through the main table.</p><p>It's important to note that this design allows probes to be inserted and removed dynamically. This allows maximum flexibility to instrumentation code while allowing the interpreter to run at full speed otherwise. For example, suppose a user wants to trace the execution of just one particular function in a module. They could insert global instrumentation into the main interpreter loop and filter out all callbacks where the function of interest is not on the top of the stack. But this is inefficient; the interpreter will go through the probed table every time, issue calls into the runtime, into the user code, which inspects state, etc. A more efficient way is to insert a local probe into the interesting function. When the local probe is fired, it dynamically inserts the global probe, thereby getting called for every subsequent bytecode. When the function calls another, or when it returns, the global probe can be disabled, and everything goes back to full speed. The interpreter is careful to switch back to the main dispatch table whenever it detects global instrumentation is disabled, so it will always run fast when it has opportunity to do so.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Tuning</head><p>Considerable work has gone into designing efficient data structures and dispatch tables for the fast interpreter presented here. We'd be remiss in not enumerating the many other tuning strategies applied here and what we've learned. This was made easier by the fact that the fast interpreter presented here was implemented not in textual assembly language, but in a macro assembler framework we built in a high-level language. Thus configuration options were easy to introduce into the code that generates the interpreter, rather than relying on macro facilities in a textual assembly language.</p><p>&#8226; Manual register allocation. The fast interpreter state consists of 9 registers (Figure <ref type="figure">6</ref>).</p><p>All of today's 64 bit architectures have enough architectural registers that it is simply a matter of assigning interpreter registers to architectural registers. There are enough left-over registers to implement all bytecode handlers without spilling. Registers are only spilled when the interpreter performs a Wasm call or calls into the engine runtime. Our implementation against the macro assembler uses symbolic registers throughout, allowing the mapping to change easily. For x86-64, we chose register assignments carefully to eliminate REX prefixes for the most commonly-occurring registers. We did not measure alternative assignments, but suspect that CPU register renaming makes additional tuning moot. &#8226; Minimal dispatch sequence. In addition to the choice between threaded and non-threaded dispatch, we experimented with dispatch table designs where entries were 2, 4, and 8 bytes.</p><p>In the 2 byte design, entries are not direct addresses, but deltas that are applied to the start of a code region, requiring an additional add instruction in the dispatch sequence. In the 4 and 8 byte alternatives, the entries are direct addresses, constrained to be in the lower 4gb of address space in the former case. Of these alternatives, the 4 byte sequence is fastest, often as much as 10% faster. Unsurprisingly, we found threaded dispatch, where the dispatch sequence is duplicated at the end of every handler to be on average 14%, maximum 29% faster. &#8226; Hardware checking of memory bounds. Like most Wasm implementations on 64-bit machines, our engine uses virtual memory protection to catch out-of-bounds memory accesses by the program, reserving 8GiB of address space for a memory and protecting the out-ofbounds address ranges. We did not evaluate the cost of explicit memory bounds checks, but anecdotal data from other engines suggests this is on the order of 20-30%, even in JITed code.</p><p>It may be proportionately less for an interpreter. &#8226; Value tagging. Our interpreter design uses value tags to find GC roots when necessary. To evaluate the performance cost, we measure with and without tags and found that tags reduce performance of the interpreter on average 8%, maximum 15%. We did not pursue design alternatives that would require computing a reference stack map (as done in JITs), as the space cost negates the primary advantage of the in-place interpreter design. &#8226; Really fat values. SIMD values are 128 bits (16 bytes) wide. When coupled with value tagging, values occupy 32 bytes of memory each in the value stack. Uniform size for values in the value stack (Figure <ref type="figure">5</ref>) is effectively required because, unlike the JVM, Wasm bytecode doesn't use multiple &#322;slots&#382; for wider values; the numbering and operations like drop treat each value as a single slot. While our experimental engine fully supports the v128 type in its type system and validation, we did not yet implement the 236 Wasm SIMD instructions in the assembly interpreter, thus we did not measure their performance. &#8226; Inline/out-of-line LEB decoding. The interpreter must dynamically decode the many immediates in Wasm that are encoded as variable-length LEBs, e.g. the index of a local.get.</p><p>Often these variable-length immediates are just a single byte. Moving the multi-byte case out-of-line (leaving the inline 1-byte case as three instructions&#347;&#322;load, test upper bit, branch out-of-line&#382;) improves performance as much as 5%. &#8226; Memory #0 base pointer. Wasm programs access memory very frequently. As described earlier, our implementation caches a pointer to the base of Wasm memory #0 in an architectural register. This saves one or two loads for every memory access (&#322;load memory tables from instance, load memory start&#382;), depending if the base for memory #0 is directly in the instance, rather than in an array of memory bases. We did not evaluate the performance impact of this optimization. &#8226; Handler alignment. Bytecode handlers are short but critically important sequences of code.</p><p>We suspected they may be subject to microarchitectural effects of instruction and trace caches arising from code alignment. We experimented with 1, 2, 4, 8, 16, and 32 byte alignment of handlers and concluded that 8 byte alignment was consistently fastest, but not by much (0-5%); variance is high and very application-dependent. &#8226; Handler order. We suspect that handler code order may introduce significant additional microarchitectural effects <ref type="bibr">[Mytkowicz et al. 2009] [Curtsinger and</ref><ref type="bibr">Berger 2013]</ref>. However, we did not heavily optimize the handler order beyond simply emitting common bytecode handlers first, clustering them near the beginning of the interpreter code. &#8226; Error case sharing. Several Wasm bytecodes trap on error cases, like divide by zero, unrepresentable floats, etc. We factored the error handling paths in order to save code space. Since traps usually terminate the program, the performance does not matter. &#8226; Call/entry/exit sharing. We exploit commonalities among the call, call_indirect and tail call<ref type="foot">foot_15</ref> bytecodes, return/fall-through, and branches in order to save code space. This may have a small effect on performance, but we did not measure this. &#8226; Handler sharing. We exploit the fact that some instructions end up with identical handler code and share the handlers (e.g. block and loop, some memory stores). We did not measure alternatives but suspect the performance effect is neglible.</p><p>The above conclusions are supported by a performance evaluation of alternatives that is beyond the scope of this paper. To summarize those experiments, the overall difference between the best (tuned) and worst (untuned) interpreter performance is 20% to 60% across the benchmark suite. Interestingly, as we will see in the next section, this difference is enough to make our interpreter design meet and exceed existing (re-writing) interpreters in comparative performance tests. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EVALUATION</head><p>In this section, we evaluate our Wasm interpreter against many state-of-the-art Wasm engines. The goal is to assess our claim that an in-place interpreter supplies the missing point in the execution tier tradeoff space between translation time, space overhead, and execution time. In particular, an in-place interpreter should offer superior translation time and space overhead compared to other tiers. Ideally, such an interpreter should also have comparable execution time to existing interpreter tiers. Of course we expect that interpreters should be handily outclassed by JIT compilers for long-running programs, so we shouldn't expect to replace them. Our experiments quantify the tradeoff space empirically.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Wasm engines in 2022</head><p>Today, five years after support was announced in 4 major browsers, engines are significantly different, and many new competitors have appeared. In particular, Web engines have evolved significantly from what has been reported in the literature to date. Today, all Web engines are sophisticated multi-tier systems. In addition to the rapid evolution of Web engines, a number of non-Web Wasm engines have appeared.</p><p>&#8226; wasmtime [Was 2021b] -A standalone runtime implemented in Rust.</p><p>One compiler tier. The optimizing Cranelift compiler can be used in JIT or AOT modes. Three compiler tiers, one at a time. Packages two compilers from other projects: Cranelift (from wasmtime) or LLVM, and offers its own one-pass compiler.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Benchmark setup</head><p>Our tier choices. We chose only to compare against execution tiers that perform dynamic translation, intentionally omitting those engines performing static (AOT) translation. Further, this paper focuses on interpreter design, which is a crucial point in the tradeoff space of execution tiers. There we include several experiments that compare engines with multiple tiers to understand their current tradeoffs. We chose to measure these engine/tiering configurations:</p><p>&#8226; wizard-Our engine with its in-place interpreter that uses a sidetable.</p><p>&#8226; wamr-classic -The WAMR engine with its &#322;classic&#382; in-place interpreter.</p><p>&#8226; wamr-fast -The WAMR engine with its (now default) rewriting interpreter.</p><p>&#8226; wasm3 -Default configuration of a rewriting interpreter.</p><p>&#8226; v8-liftoff -V8 with only the Liftoff baseline JIT.</p><p>&#8226; v8-turbofan -V8 with only the TurboFan optimizing JIT.</p><p>&#8226; sm-base -Spidermonkey with only the baseline compiler.</p><p>&#8226; sm-opt -Spidermonkey with only the optimizing compiler.</p><p>&#8226; jsc-int -JSC with JITs disabled, i.e. only the rewriting interpreter.</p><p>&#8226; jsc-bbq -JSC with the rewriting interpreter and BBQ quick JIT only.</p><p>&#8226; jsc-omg -JSC with the rewriting interpreter and OMG optimizing JIT only.</p><p>Evaluation methodology. All tests were performed on a Linux 4.15 kernel on a machine with 32GiB of RAM and one Intel Core i7-8700K CPU @ 3.7GHz and the CPU governor set to &#322;performance&#382;. Benchmarks used are the PolyBenchC-4.2.1 suite, with the MEDIUM dataset. As included in the artifact accompanying this paper, all engines were cloned from source on roughly 2022-07-10. That implies V8 version 10.5.0, JSC version 614.1.20, and Spidermonkey version C104.0a1. All engines were built as "release" from source. Data was collected in two experiments; 100 uninstrumented runs of 10 engine configurations on the 24 benchmarks to gather execution time, and 100 instrumented runs of the same to collect translation time and space statistics. Every run represents a separate OS process. Execution times are of the complete process (i.e. not internally timed), while translation times are the sum over all Wasm functions translated in the run. All timing results are the average<ref type="foot">foot_17</ref> over the 100 runs, and when error bars are shown, they represent the 5 th and 95 th percentiles of the distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Translation time</head><p>All of our chosen execution tiers apply some form of translation to the input bytecode, except wizard and wamr-classic (the only other in-place interpreter of which we are aware). Rewriters either generate an internal bytecode (interpreters) or machine code (compilers). Thus we measure the time taken for the respective translation by instrumenting the source code of each engine by  adding time and space measurements. In the case of wamr-fast, the custom bytecode is generated as a side-effect of validation, so we measure the additional time for translation by subtracting the baseline validation time obtained from wasmr-classic, which has no translation time. wizard doesn't translate, but instead the reported time is the sidetable tracking and construction time, measured as the difference between validation time with and without sidetable tracking. We plot the average translation time, divided by the number of input bytes translated. The ratio of translation time to input bytes normalizes differences in tiering strategy (e.g. lazy compilation) and benchmark size. Figure <ref type="figure">8</ref> gives the results (note Y-axis is logarithmic).</p><p>Here we focus on configurations that isolate individual tiers, rather than multi-tier adaptive configurations. The experimental results show a dramatic difference in translation time for these tiers, nearly 3 orders of magnitude. Though baseline compilers often differ from optimizing compilers by more than 10&#215;, they are still 10&#215; more expensive than rewriting interpreters. Yet there is overlap, as the rewrite time of jsc-int, an interpreter, is almost on-par with v8-liftoff, a baseline compiler.</p><p>Similarly, the jsc-bbq quick compiler is closer to an optimizing compiler, as it uses an IR, unlike v8-liftoff. The winner is clearly our in-place interpreter design, as the sidetable generation in wizard is a full order of magnitude cheaper than the cheapest rewriting interpreters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Translation space</head><p>Translation not only consumes time, but space. We measure the memory overhead of translation in terms of bytes generated per input byte of code. Here again, the ratio of output bytes to input bytes normalizes differences in tiering strategy (e.g. lazy compilation) and benchmark size. Note that we only measure the size of generated code (whether internal bytecode or machine code) and not additional metadata such as code objects or a source position table, which all rewriters need for debugging. We also do not measure per-module space costs. Thus this experiment underestimates space costs of rewriters when debugging is involved. Figure <ref type="figure">9</ref> gives the results. There are several somewhat surprising results here. We find that some rewriting interpreters (such as wamr) can consume up to 4&#215; as much space as the original bytecode, while others (such as jsc-int) consume about the same amount of space as the original bytecode. We believe this is because jsc-int uses an internal bytecode similar to JSC's JavaScript bytecode, which has been heavily tuned to reduce memory consumption on webpages. Also somewhat surprising is that JIT compilers, which generate machine code, do not necessarily consume more code space, in general, than rewriting interpreters, though jsc-bbq is an outlier. We discovered that wasm3, like wamr, often trades space for time&#347;nearly all of its bytecode quantities are word-sized. Yet during rewriting, wasm3 does a number of peephole-like optimizations, globally canonicalizes constants, and uses a register machine internally, all of which save space.</p><p>These measurements show the sidetable in wizard takes far less space, only about 30% additional space compared to the original bytecode, as it only requires entries for control-flow<ref type="foot">foot_18</ref> , and many sidetables are empty. That is a full order of magnitude more space-efficient than the other tiers, which typically cost 2&#215; to 4&#215; the original bytecode's space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Execution time</head><p>Figure <ref type="figure">10</ref> gives the absolute execution times of the benchmarks for the v8-turbofan and wasm3 execution tiers. Execution times reported in other figures in this section are normalized to either of these two baselines. On the horizontal axis, we sorted the benchmarks by their execution time on wasm3 (same as in the table). As we can see, the shortest-running benchmarks on the left side of the graph do not run long enough to benefit from the work spent by the optimizing compiler, and all baseline compilers are faster, with the interpreter fastest. There is also a fixed cost of starting a relatively heavyweight JSVM, which contributes to the interpreter being fastest for the shortest 4 benchmarks. Moving to the right, as execution time increases with longer-running benchmarks, the fixed cost of startup and the cost of compilation are increasingly amortized. Thus the middle of the graph shows more balanced results; the interpreter falls behind, but the baseline compilers, particularly v8-liftoff, remain competitive. Continuing on, the gap between optimizing compilers (particularly v8-turbofan and sm-opt) and the rest increases; execution time is now dominated by code quality. Baseline compilers level off, with v8-liftoff and sm-base around 2&#215; slower and jsc-bbq closer to 3&#215;. The common rule of thumb that interpreters are 10&#215; slower than JITs turns out to be roughly accurate in the end, as there is a clear trend towards roughly 10&#215; for wasm3 vs v8-turbofan here. These results match our intuition, but better, quantify it, giving us fairly round numbers to reason with. They also reaffirm the need to have a broad picture of execution time:</p><p>&#8226; Each of (interpreter, baseline compiler, optimizing compiler) runs some benchmark fastest.   4.5.2 Interpreter showdown. Of course, this paper focuses on making a fast, in-place interpreter. Using similar benchmarking methodology, on the same benchmarks, we evaluated the five interpreter tiers. Figure <ref type="figure">12</ref> gives the results. As we found that wasm3 was consistently the fastest interpreter across the board, we chose to normalize all execution times in Figure <ref type="figure">12</ref> to it. Here we find that wamr-fast, the configuration of the wamr engine with its rewriting interpreter, is consistently 2 nd fastest, nearly always 1.5&#215; to 1.7&#215; slower than wasm3, while the others are typically 2&#215;-3&#215; slower. We attribute this performance advantage to wasm3's threaded code dispatch technique (where the IR is simply a list of function pointers with immediates), its several bytecode-level optimizations, and its register machine design which also incorporates global constants. In the first 4 (shortest running) benchmarks on the left, we see the amortization effect of wizard's lower translation cost; it is faster than jsc-int and wamr-classic and nearly on par with wamr-fast. Moving to the right, difference in translation time becomes amortized, and interpreter speed starts to dominate, though the difference is not nearly as dramatic as the comparison of JITs, and levels off quickly. We see that our design performs better than the other in-place interpreter, wasm-classic in most situations, and generally on-par with jsc-int. 4.5.3 Multi-tier configurations in context. Our experiments in this section focused on isolating individual execution tiers in order to study their tradeoffs. In production configurations, all web engines run in multi-tier configurations, as described in Section 4.1. We ran additional experiments for engines in their default configuration, but we found that the default configurations do not always yield the best performance results. The design space for tiering strategies is vast, with many variables, such as laziness, concurrency, thresholds for tiering up, on-disk caching, etc. So far, our measured results are complex enough to merit a more detailed study in their own right, which is unfortunately beyond the scope of this paper. Thus, for now, we cannot characterize the performance of Wasm interpreters in complex multi-tier configurations. 4.5.4 Avoiding pathological behavior. The only other production Wasm interpreter to use an inplace design is wamr-classic. It is written in C and uses gcc extensions to construct a jump table for threaded dispatch. The jump table is crucial for performance; disabling it via a configuration variable reduces performance by more than 2&#215;, which reaffirms the need for this crucial optimization in interpreter design. Of course, we tested wamr-classic in its fastest configuration, as 2&#215; would have put it handily out of the running.</p><p>The wamr-classic design uses both a dynamic control stack and a cache of control entries that help it find branch targets during runtime. The cache is a fixed-size, 128-element, 2-way set associative cache indexed by branch address. That amounts to a fixed cost per module, rather than per function (we did not include its space cost in Figure <ref type="figure">9</ref>). A cache miss results in a slow path where the entire function may be rescanned to the end, repopulating the cache with new entries, including an entry for the current branch. Thus the rescan overhead is bounded only by the function size, and the miss rate is determined by the program's temporal and spatial locality of branch instructions, regardless of their outcome (taken/not taken). The performance could be pathological for a large program with many branches, a fact obscured by the relatively small benchmark programs in our suite. To verify the pathological behavior, we constructed a large function consisting of a few thousand branches with poor locality and observed slowdowns of as much as 8&#215;. It could be even worse on larger programs with millions of instructions that have hundreds of thousands of branches. This vulnerability was known to its authors and is one of the motivations for wamr to now ship the faster rewriting interpreter (wamr-fast) by default.</p><p>Given the potential pathological behavior of wamr-classic, we believe wizard represents the first viable fast in-place design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RELATED WORK</head><p>Interpreters are as old as the hills. From the first popular interpreted language, Lisp <ref type="bibr">[McCarthy 1960</ref>] in 1960, to today's modern scripting and data manipulation languages like Python, Ruby, R, PHP, and MatLab, interpreter performance has been a key subject of interest. Many languages that are now fast through dynamic compilation were once primarily interpreted, such as Smalltalk, Self, Java, C#, OCaml, and JavaScript. Python is still most-often interpreted, and its performance is still of key concern <ref type="bibr">[Barany 2014</ref><ref type="bibr">] [Zhang et al. 2022]</ref>. Because of their advantages, new interpreters continue to appear, even for languages previously compiled <ref type="bibr">[Hu et al. 2021]</ref>.</p><p>Fixed or flexible format? Research into interpreter techniques either assumes the code format to be fixed, such as standardized bytecode formats like the JVM, the CLR, Dalvik, and WebAssembly, where binary programs arrive metaphorically chiselled into stone, or flexible, where the format can be changed to suit the needs of a specific language or implementation. Clearly the larger design space of flexible formats affords more techniques, though key lessons hopefully improve the design of future fixed formats. For example, a key question of interest is whether a stack machine (such as the JVM or WebAssembly), or a register machine (such as the CLR or Dalvik bytecode), is inherently more efficient <ref type="bibr">[Shi et al. 2005</ref><ref type="bibr">] [Davis et al. 2003]</ref>.</p><p>Interestingly, though it was not a research question at the outset of this work, our experimental results lend some credence to the conclusion that interpreters for register-based VMs are faster than stack-based ones, since the rewriting interpreters that are faster than wizard are both register machines. Though it is too late to change the JVM, CLR, Dalvik, or Wasm, research here informs the next proposed format. Yet note also that the side-table strategy employed here can also inform the next proposed format, as a bytecode not designed for fast interpretation can be augmented with additional metadata to improve interpretation speed.</p><p>Interpreter dispatch techniques. As our own experimental results reaffirm, the dispatch sequence is critical to interpreter performance. If the format is flexible, e.g. if entirely in-memory, then threaded code <ref type="bibr">[Bell 1973</ref>] or more compact indirect threaded code <ref type="bibr">[Dewar 1975</ref>] can be used. A large amount of work has aimed to improve the predictability of indirect branches <ref type="bibr">[Rohou et al. 2015]</ref> by exploiting the microarchitectural details of branch target buffers (BTBs) <ref type="bibr">[Chang et al. 1997</ref>]. Some of the latter techniques may also be applicable to wizard's interpreter design.</p><p>Superinstructions. For flexible formats, the number of dispatches can be reduced with superinstructions <ref type="bibr">[Proebsting 1995]</ref>, replacing small opcodes with larger, combined opcodes. This can be done online <ref type="bibr">[Ertl and Gregg 2004a]</ref>, offline, or a mixture of both. Super-instructions can be formed from combinations of simpler instructions, often in embedded Java VMs <ref type="bibr">[Zilli et al. 2015b]</ref>, or by defining language-specific high-level operations like in CPython and JavaScript VMs. Superinstructions might not necessarily apply to a fixed format, but given that many Wasm instructions are a single byte, an intriguing possibility might be to dispatch on two bytes at a time, using a table of 64K entries that has some superinstruction entries for common patterns. At the least, the superinstruction could eliminate the need to check for multi-byte LEBs, e.g. in local.get/set, two very frequent Wasm instructions.</p><p>Radically different interpreter IRs. Bytecode is not the only interpretable format. Abstract syntax trees or other intermediate representations can be interpreted directly in memory. Usually, speed is not the goal, though recently a fast AST-walking interpreter has been described for R <ref type="bibr">[Kalibera et al. 2014]</ref>. Instead, interpreting IR has other benefits, e.g. proving the correctness of the interpreter, partial evaluation, and collapsing multiple levels of interpretation <ref type="bibr">[Amin and Rompf 2017]</ref>. Truffle/Graal <ref type="bibr">[W&#252;rthinger et al. 2012</ref>] uses ASTs, for example <ref type="bibr">[Niephaus et al. 2018]</ref>, as a way to express an interpreter for the Futamura <ref type="bibr">[Futamura 1983</ref>] projection. Other graph-based IRs have been explored <ref type="bibr">[Williams et al. 2010</ref>]. Some compilers have IR interpreters [Lli 2015] for testing. A standard compilers course <ref type="bibr">[Palsberg 2014</ref>] includes interpreters for each IR during translation. None seem to compete with bytecode interpreters in speed.</p><p>Optimizing fixed format interpreters. A number of interpreter optimizations can still apply to fixed formats. The most common is duplicating the dispatch sequence at the end of every handler (referred to in this paper and elsewhere, if somewhat imprecisely, as a threaded interpreter or threaded dispatch), and is used in all the interpreters we tested. Threaded dispatch has even been applied to hosted interpreters on the JVM <ref type="bibr">[Savrun-Yeni&#231;eri et al. 2014]</ref>. For stack machines, stack caching <ref type="bibr">[Peng et al. 2004</ref><ref type="bibr">] [Ertl and Gregg 2004b</ref><ref type="bibr">] [Ogata et al. 2002]</ref> VMs try to keep the top-of-stack cached in a register, reducing loads and stores to the value stack. It is possible to further duplicate <ref type="bibr">[Prokopski and Verbrugge 2008]</ref> handlers to get more BTB entries <ref type="bibr">[Casey et al. 2007]</ref>. Recent work has also proposed new hardware support for indirect speculation <ref type="bibr">[Zilli et al. 2015a</ref><ref type="bibr">] [Williams et al. 2008]</ref> or to directly address BTB entries <ref type="bibr">[Kim et al. 2016</ref>]. Many JVMs mutate bytecode in-place <ref type="bibr">[Brunthaler 2010</ref>] to replace symbolic references with indexes and offsets.</p><p>Interpreter generators. Writing a highly efficient interpreter remains a black art. It is challenging, sometimes impossible, to convince compilers for high-level languages to emit perfect interpreter code. A number of interpreter generation frameworks have been proposed, including Tiger <ref type="bibr">[Casey et al. 2005]</ref>, VMGen <ref type="bibr">[Ertl et al. 2002]</ref>, and a JavaScript VM generator <ref type="bibr">[Kataoka et al. 2018]</ref>, that automate much of the tedious bookkeeping and broadly apply best practices.</p><p>Interplay with JITs and debugging. If the virtual machine is allowed to generate new machine code, then the entire VM design space is unlocked. Context threaded code <ref type="bibr">[Zaleski et al. 2005]</ref>, where code is represented as a sequence of calls to handlers, can be used. From there, selective inlining <ref type="bibr">[Piumarta and Riccardi 1998</ref>] of handlers can be applied to improve performance. Discussion of JIT designs is beyond the scope of the paper, yet their interactions with interpreters is of note. Trace compilation <ref type="bibr">[Gal et al. 2009</ref>] is fed by collecting execution traces from an interpreter. <ref type="bibr">Metatracing [Bolz et al. 2009]</ref>, i.e. tracing through the handler implementation, has been employed by the PyPy dynamic optimizer. Dynamic adaptive optimization with deoptimization is now standard in many virtual machines. The design of interpreter stack frames determines the cost and complexity of on-stack replacement. In-place interpretation has been shown to ease debugging support, since no mapping need be maintained from rewriting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>In-place interpretation has long been considered when designing fixed code formats such as the JVM, CLR, and Dalvik, since it is the most memory-efficient. Wasm is unique in that it was explicitly designed with near-native performance as the highest priority and engines shipped with optimizing compiler tiers first. Interpretation was not thought necessary, and in-place interpretation, stymied by structured control flow, was dismissed as either impossible or at least unneeded. Yet in the intervening five years, startup time and memory consumption have increased in priority, and interpretation of Wasm (by rewriting to an internal format) has become more widespread.</p><p>This paper restores the missing execution tier for Wasm, regaining the key property of efficient in-place interpretation that was thought lost. We presented the design and implementation of the first fast in-place interpreter for Wasm that utilizes a compact sidetable easily generated as a side-effect of the code validation algorithm. We measured an order of magnitude improvement in memory consumption and processing time over rewriting Wasm. With this open problem now solved, we believe that Wasm engines in the future will employ new interpreter tiers for improved startup time, reduced memory consumption, and improved debugging support.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">DATA AVAILABILITY STATEMENT</head><p>The experiments in this paper were conducted on open-source software which has been packaged and made available as an accompanying artifact accessible on GitHub <ref type="bibr">[Titzer 2022a</ref>] and on Zenodo <ref type="bibr">[Titzer 2022b</ref>].</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In fact, in a smoky back room, I probably declared, &#322;Interpreters don't matter here. &#382;</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Though discontinued, ChakraCore was the first Wasm engine to feature a rewriting interpreter.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Today, V8 supports debugging of Wasm code in Liftoff, its baseline compiler tier. Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>Note that the execution stack is not aliased by Wasm memory, thus not vulnerable to stack smashing</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>Provably minimal, if a CFG is restructured from the known algorithm for optimal interval analysis, ensuring the validation metadata per control flow construct is discarded and reused as promptly as possible.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>Other stack machine designs, like JVM bytecode, allow values on the operand stack while branching, but stack heights and contents must match. Thus JVM branches cannot implicitly pop values. Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>In Wasm, all control constructs can have data parameters and results, represented as a block signature. Thus a block can be seen as a super-instruction; it pops values off the stack, and every path to its end pushes the same number and type of results. This allows for very compact code, often obviating local variables. In practice, most blocks have an empty block signature.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8"><p>More precisely, a macro assembler API in a high-level language that has methods to generate individual instructions, allowing it to be configurable in ways that typical textual assembly languages are not.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9"><p>Boxing is a major overhead in dynamic language implementations and would be prohibitively expensive for Wasm, outweighing all optimizations described in this paper.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_10"><p>Aligning value stack entries on a power-of-two boundary allows for shift-based arithmetic when indexing.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_11"><p>Using sigaltstack on POSIX platforms for signal handling. Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_12"><p>Though not shown, Wasm memories also have a guard region, obviating the need for an explicit bounds check.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_13"><p>This deserves some careful consideration. As Wasm is a standardized wire format that is likely to change only by extension, the interpreter is subject to less change pressure, except to adapt to internal changes in engine organization (e.g. layout of internal data structures). We feel its warranted to hand-optimize assembly in this limited case. Further, much work on the interpreter can be reused in a baseline compiler, as the code sequences emitted by a single pass compiler are nearly identical to interpreter bytecode handlers.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_14"><p>Compilers generally will not transform a straight-forward loop-over-switch into threaded dispatch, as the necessary transformation, tail-duplicating the loop header for many hundreds of cases, is highly specific. Instead, we must arduously fill out manual dispatch tables and either use tail calls or, in C, the non-standard gcc &#322;labels as values&#382; [Gcc 2022] extension. The resulting control flow is complex, with multiple irreducible loops and hundreds of indirect branch targets, and may lead to spills across important instructions. To use tail calls, we must rewrite the entire interpreter as individual handler functions that end with a tail-call to the next handler (as is done in Ignition [Ign 2022]), passing the entire interpreter state forward as arguments. Yet only some C compilers support a non-standard tail-call optimization. In any case, without a custom calling convention, we run out of registers and spill some interpreter state on the stack unnecessarily.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_15"><p>wizard implements a few Wasm extension proposals such tail-call. Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="17" xml:id="foot_16"><p>It is; we confirm their measurements in our experiments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="18" xml:id="foot_17"><p>Arithmetic mean, including all measurements. We report the average to include the effect of outliers, while reporting the 95% confidence interval to characterize the distribution.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October 2022.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="19" xml:id="foot_18"><p>Note that here, we measure wizard sidetable entries compressed to 2 bytes where possible, though this is not the default.Proc. ACM Program. Lang., Vol. 6, No. OOPSLA2, Article 148. Publication date: October</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2022" xml:id="foot_19"><p/></note>
		</body>
		</text>
</TEI>
