<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Going beyond the Limits of SFI: Flexible and Secure Hardware-Assisted In-Process Isolation with HFI</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>03/25/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10442364</idno>
					<idno type="doi">10.1145/3582016.3582023</idno>
					<title level='j'>Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shravan Narayan</author><author>Tal Garfinkel</author><author>Mohammadkazem Taram</author><author>Joey Rudek</author><author>Daniel Moghimi</author><author>Evan Johnson</author><author>Chris Fallin</author><author>Anjo Vahldiek-Oberwagner</author><author>Michael LeMay</author><author>Ravi Sahita</author><author>Dean Tullsen</author><author>Deian Stefan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce Hardware-assisted Fault Isolation (HFI), a simple extension to existing processors to support secure, flexible, and efficient in-process isolation. HFI addresses the limitations of existing software-based isolation (SFI) systems including: runtime overheads, limited scalability, vulnerability to Spectre attacks, and limited compatibility with existing code. HFI can seamlessly integrate with current SFI systems (e.g., WebAssembly), or directly sandbox unmodi!ed native binaries. To ease adoption, HFI relies only on incremental changes to the data and control path of existing high-performance processors. We evaluate HFI for x86-64 using the gem5 simulator and compiler-based emulation on a mix of real and synthetic workloads.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>WebAssembly (Wasm) <ref type="bibr">[28]</ref> has made in-process isolation ubiquitous. In the browser, it powers applications used by billions of people daily <ref type="bibr">[49,</ref><ref type="bibr">64,</ref><ref type="bibr">79]</ref>. Beyond the browser, it enables isolation in places that existing hardware-based protection can't -from hyperconsolidated FaaS platforms <ref type="bibr">[32,</ref><ref type="bibr">76]</ref> and high-performance data planes <ref type="bibr">[59]</ref>, to data streaming platforms <ref type="bibr">[61]</ref>.</p><p>Wasm makes these novel use cases possible by enforcing isolation in software -using software-based isolation (SFI) -which avoids the high overheads imposed by existing hardware-based isolation primitives <ref type="bibr">[54,</ref><ref type="bibr">71]</ref> (e.g., processes, containers, and VMs). This approach to isolation enables several unique properties.</p><p>To start, Wasm context switches are very fast -in the low 10s of cycles <ref type="bibr">[38]</ref>, roughly the same as a function call -and orders of magnitude cheaper than a hardware context switch <ref type="bibr">[30]</ref>, let alone IPC. These fast context-switches let Wasm provide extensibility in high-performance data planes <ref type="bibr">[59]</ref>, data streaming platforms <ref type="bibr">[61]</ref>, and SaaS applications <ref type="bibr">[56]</ref>; they also enable ne-grain isolation of vulnerable libraries in latency sensitive browser renderers <ref type="bibr">[10,</ref><ref type="bibr">52]</ref>.</p><p>Wasm context creation is also very fast -production FaaS systems can spin up a new Wasm instance in 30 `s <ref type="bibr">[20]</ref>, instead of the tens to hundreds of milliseconds it takes to spin up a container or VM <ref type="bibr">[20,</ref><ref type="bibr">48,</ref><ref type="bibr">69]</ref>. Along with low context-switch overheads, this has enabled a new class of high-concurrency, low-latency edge computing platforms from Fastly <ref type="bibr">[32]</ref>, Cloudare <ref type="bibr">[76]</ref>, Akamai <ref type="bibr">[2]</ref>, etc. that would not have been possible with containers or VMs.</p><p>Unfortunately, the power of software-based isolation also comes with limitations: Performance -even the fastest Wasm implementations can easily impose a 40% overhead on code execution <ref type="bibr">[36,</ref><ref type="bibr">88]</ref> limiting Wasm's ability to support more demanding workloads; Scaling -Wasm relies on an ad-hoc system of guard regions for memory isolation ( &#167;2) which consumes huge amounts of virtual memory and limits eciency in high-scale settings like FaaS platforms; Backwards compatibility -Wasm cannot run unmodied binaries (e.g., system libraries), code that directly accesses hardware (e.g., SIMD intrinsics, assembly language), or dynamically generated code (e.g., from just-in-time compilers); Spectre safetyprocessors can speculate past security checks in Wasm making it vulnerable to Spectre attacks <ref type="foot">1</ref> . These limitations are the result of trying to bridge, in software, the gap between past models of hardware protection and the current needs of software systems.</p><p>To overcome them, we developed hardware-assisted fault isolation (HFI) -a simple ISA extension that brings support for inprocess isolation to modern processors.</p><p>HFI takes a two track approach to support in-process isolation (aka sandboxing). First, it provides hardware assistance to eliminate the limitations Wasm inherits from SFI -essentially replacing SFI with ecient hardware primitives. Second, HFI provides in-process isolation that is backwards compatible, allowing it to sandbox unmodied native binaries and dynamically generated code.</p><p>HFI oers primitives that systematically eliminate typical hardware (and software) overheads by design: it imposes near-zero overhead on sandbox setup, tear-down, and resizing; it can support an arbitrary number of concurrent sandboxes; it oers context switch overheads on the same order as a function call; it can share memory between sandboxes at near-zero cost; it provides exible low-cost mitigations for Spectre, and near-zero cost system call interposition (for native binaries).</p><p>We made a few key choices that enable this. First, HFI does everything in userspace; thus, there are no overheads from ring transitions or system calls when changing memory restrictions, or entering and leaving a sandbox. Second, HFI does not rely on the MMU for in-process isolation -instead, sandboxing is enforced via a new, orthogonal mechanism called regions ( &#167;3.2); regions enable coarse-grain isolation (e.g., heaps) and ne-grain sharing (e.g., objects) in the processes' address space. Third, HFI only keeps on-chip state for the currently executing sandbox; thus, it can scale to an arbitrary number of concurrent sandboxes -in contrast, many other systems hit a hard limit as they keep on-chip state for all active sandboxes <ref type="bibr">[9,</ref><ref type="bibr">25,</ref><ref type="bibr">27,</ref><ref type="bibr">60,</ref><ref type="bibr">66,</ref><ref type="bibr">75]</ref>.</p><p>Beyond this, there are many seemingly small overheads in common operations whose cumulative impact can be large; we sought to minimize these costs. Our design eliminates unnecessary context switches for Wasm sandboxes ( &#167;3.1), and lets software choose the most ecient mechanism for implementing context switches ( &#167;3.3). Spectre mitigations can add overhead by serializing instructions, so we devised a exible mechanism to minimize the need to serialize ( &#167;3.4). Finally, system call interposition can be complex and expensive when dealing with many concurrent sandboxes; thus, we developed a simple low-cost mechanism to enable this ( &#167;6. <ref type="bibr">4.1)</ref>.</p><p>HFI reduces the barriers to in-process isolation, allowing it to be deployed more pervasively and at scale. It achieves this with minimal additional hardware, and minor changes to the control and data paths of existing processors.</p><p>To evaluate our design, we implemented HFI twice for x86-64 ( &#167;5.2). First, we developed a gem5 simulation to enable detailed performance analysis. Next, we built a compiler based emulator that approximates HFI overheads in target workloads; this allows us to run larger and more complex workloads than would be possible in gem5. To ensure accuracy, we validated the precision of our emulator using our gem5 simulation.</p><p>We integrated HFI emulation into the Wasm2c ahead-of-time compiler and the Wasmtime just-in-time compiler ( &#167;5.1), and evaluated its performance on the SPEC CPU 2006 benchmarks ( &#167;6.1), sandboxed font and image rendering in Firefox ( &#167;6.2), and a simplied FaaS setting ( &#167;6.3). We also applied it to sandboxing native code (OpenSSL) in the NGINX webserver ( &#167;6.4.2) <ref type="foot">2</ref> .</p><p>Our results show that HFI-assisted Wasm achieves strictly better performance than stock Wasm, and oers noticeable speedups on real workloads, such as a 14%-37% improvement for image rendering in Firefox ( &#167;6.2). Our native workloads run at bare-metal speedmodulo the optional ( &#167;3.4) added cost of serializing sandbox entries and exits for Spectre protection.</p><p>In our FaaS workloads, adding Spectre protection with HFI leads to a 0%-2% increase in tail latency as compared to unsafe native execution. In contrast -adding Spectre protection using Swivel <ref type="bibr">[53]</ref>, the fastest known software-based mitigation, leads to a 9%-42% increase in tail latency (Table <ref type="table">1</ref>).</p><p>Most of Wasm's limitations stem from its reliance on software techniques for memory isolation; we explore these in the next section. We then present HFI, and how it lets us go beyond these limitations ( &#167;3). We explore how HFI accomplishes this securely, and with minimal hardware at a microarchitectural level ( &#167;4). Finally, we evaluate HFI ( &#167;6), survey related work ( &#167;7), and oer conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">LIMITATIONS OF SFI</head><p>The limitations of modern page-based protection architectures for ne-grain isolation are well known <ref type="bibr">[54]</ref>. They include expensive context switches due to protection ring transitions, heavy weight saves and restores <ref type="bibr">[30]</ref>, increased TLB ushes and contention as concurrency scales etc. SFI <ref type="bibr">[78]</ref> -and by extension, Wasm -avoids these costs by instead relying on compiler-added instrumentation to enforce isolation. However, simply adding conditional bounds checks to each memory load/store and instruction fetch can easily slow down code by a factor of 2&#8677; <ref type="bibr">[78,</ref><ref type="bibr">88]</ref>. Past SFI compilers instead relied on the (somewhat faster) technique of applying a bit mask to the addresses used by loads/stores/etc., prior to their use, thus forcing data access and control ow into the appropriate portion of address space. However, as a side-eect, this converts out-of-bounds memory accesses into (seemingly random) memory corruption.</p><p>Consequently, Wasm and other modern SFI systems <ref type="bibr">[68,</ref><ref type="bibr">86]</ref> rely on a faster and safer technique-using the MMU to enforce memory bounds implicitly -as a poor man's version of segmentation.</p><p>To accomplish this, a Wasm runtime sets aside an 8 GiB memory region per sandbox, 4 GiB for the sandbox address space, followed by a 4 GiB guard region (unmapped address space). It reserves this space by mmap()'ing the entire 8 GiB without permissions.</p><p>Additionally, the Wasm compiler restricts the format of memory operations to load(address, oset) where address is a register with a 32-bit value, and oset is a 32-bit immediate (constant). On a memory access, the Wasm compiler adds the address and oset, resulting in a maximum value of 2 33 2. It then adds this result to the base-address of the address space (the "heap base"), and performs the load. Since 8 GiB (2 33 ) has been reserved after the heap base, the load either lands in the sandbox address space and continues, or in the guard region and traps.</p><p>Finally, a Wasm program can only access the portion of address space it has explicitly requested from the runtime (e.g. using mem-ory_grow() -similar to sbrk()). To enforce this constraint, the runtime grants access to the accessible portions of memory by setting page permission using mprotect(). Thus, any memory access beyond the end of the heap will trap.</p><p>To isolate control ow, Wasm does not rely on any of the above tricks, and instead relies on software control ow integrity <ref type="bibr">[28]</ref>.</p><p>Despite this clever design, Wasm still has many limitations, some fundamental to SFI, and others specic to Wasm's design: 32-bit address spaces. The approach above only supports 32-bit address spaces on 64-bit architectures -to support larger Wasm sandboxes <ref type="bibr">[3]</ref>, or smaller processors, requires old-school SFI masking or conditionals <ref type="bibr">[78]</ref>; masking is out (Wasm requires precise trap semantics), and again conditionals are easily a 2&#8677; slowdown <ref type="bibr">[71,</ref><ref type="bibr">88]</ref>. Performance overheads. Wasm can easily impose performance overheads of 40% -sometimes less, and sometimes a lot more <ref type="bibr">[23,</ref><ref type="bibr">36]</ref>. Some costs are fundamental to SFI, such as restrictions on the formats of memory instructions and added register pressure <ref type="bibr">[71]</ref>. Some are specic to Wasm, such as the cost of software CFI, and limited access to SIMD instructions.</p><p>Another source of overhead that shows up at scale is the cost of creating and destroying sandboxes. In particular, unmapping memory incurs a TLB shootdown. In FaaS platforms, where sandboxes are constantly being created and destroyed on every incoming network request, this can signicantly harm performance. Spectre. Wasm cannot protect itself against Spectre attacks without performance penalties -to wit -software-based mitigations add an additional 62% to 100% of overhead <ref type="bibr">[53]</ref> (mitigations relying on CPU re-design to eliminate Spectre fare better <ref type="bibr">[45,</ref><ref type="bibr">82,</ref><ref type="bibr">85,</ref><ref type="bibr">87]</ref>, but entail overheads and implementation complexity that makes them unlikely to see deployment). Virtual memory consumption. Wasm's guard pages have a large virtual memory footprint that results in several challenges.</p><p>8 GiB is a lower bound -previously we noted that every Wasm instance consumes 8 GiB of virtual address space -even if it uses just a few megabytes. However, this is actually a lower bound. Popular Wasm runtimes support multiple memories per-instance <ref type="bibr">[4]</ref> (e.g., for sharing data between instances) -and these can increase an instances resource footprint by another 8 GiB per-memory. The next generation of Wasm standards <ref type="bibr">[26]</ref> promises to further increase this footprint by supporting Wasm applications composed of multiple components, where each library in the application would be a separate component, each with its own memory(s).</p><p>Virtual address space is nite -typical Intel x86-64 CPUs provide 2 47 (128 TiB) worth of user level virtual address space 3 -which seems like a lot. However, that can be used up surprisingly quickly. For example, if we assume the best case -that each Wasm instance only consumes 8 GiB -then we can run at most 16K (2 14 ) Wasm instances concurrently. In FaaS platforms, that spin up a new sandbox in 10s of `-seconds <ref type="bibr">[20]</ref> for every incoming network request, 16K instances is not a large number. Worse, FaaS functions may not 3 Intel supports 52/57-bit address spaces in certain high-end server CPUs. nish immediately -for example, they might make HTTP requests (and block). Thus, an address space can ll up quite quickly.</p><p>Operations systems are slow -FaaS systems could of course handle scaling limits by spinning up multiple processes and load balancing requests between them -relying on the OS to context switch threads between processes, and context switch processes once these exceed the number of physical cores, and so on. However, the main reason FaaS providers use Wasm is to avoid these overheads in the rst place. FaaS providers would rather schedule more instances in fewer processes -ideally one. If used eciently, 128 TiB really does support a lot of Wasm instances; not only is this more ecient, it makes systems easier to understand, which in turn makes them easier to deploy, debug, and optimize.</p><p>Finally, applications that use FaaS platforms don't always consist of just one function, they can be multiple functions that want to communicate (function chaining). In a single address space, this communication is as fast as a function call, however, this is easily 1000&#8677; to 10000&#8677; slower across process boundaries (IPC) <ref type="bibr">[30,</ref><ref type="bibr">38]</ref>.</p><p>In the next section, we explore HFI, our hardware extension that addresses these limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">THE HFI DESIGN AND INTERFACE</head><p>Hardware-assisted fault isolation (HFI) is our extension for modern processors that supports exible in process isolation, and oers the following properties:</p><p>(1) Security. HFI provides all the capabilities needed for secure sandboxing of Wasm and native binaries, including data and control ow isolation ( &#167;3.2), complete mediation of the OS interface including system calls ( &#167;3.3), and Spectre mitigation ( &#167;3.4).</p><p>(2) Eciency. HFI imposes minimal overhead for critical operations. Memory isolation with HFI imposes no overhead -all memory bounds and permission checks execute in parallel with TLB lookups ( &#167;4.2); Context switches are software managed; thus Wasm can exploit zero-cost techniques <ref type="bibr">[38]</ref> to optimize them ( &#167;3.3.1); System call interposition is near zero cost -HFI converts system calls into jumps ( &#167;4.4); HFI's overhead for creating and destroying sandboxes, and sharing and resizing sandbox memory is near zeroonly requiring a few HFI instructions to update region registers ( &#167;4.4) <ref type="foot">4</ref> . Spectre protections are exible and congurable, allowing developers to avoid unnecessary serialization ( &#167;3.4).</p><p>(3) Scalability. HFI imposes no limit on the number of concurrent sandboxes a program can run -it achieves this by keeping the amount of on-chip state constant, regardless of the total number of sandboxes ( &#167;4).</p><p>(4) Compatibility. HFI supports the unique requirements of Wasm and native binaries. For Wasm, HFI oers precise memory fault semantics (i.e., out-of-bound memory access traps), granular heap growth (64K increments), code and data separation, and direct access to system calls for Wasm sandbox runtimes (via. sandbox types ( &#167;3.3)). For unmodied native binaries, HFI eschews any changes to compilers, standard libraries, or binary formats (i.e., ABI changes), and supports simple and ecient system call interposition.</p><p>(5) Adoptability. For easy adoption, HFI minimizes changes to existing operating systems and processors. OS kernels only need to add a small amount of per-process storage to save HFI's registers during a process's context-switch. To support the OS, HFI extends the processor instructions that save and restore a process's registers (xsave and xrstor on x86) to include HFI's registers ( &#167;3.3.3). Existing processors require only a small amount of additional hardware, and minimal changes to data and control paths to support HFI ( &#167;4).</p><p>We start with a high-level overview of HFI in &#167;3.1. We then take a deep dive into regions -HFI's mechanism for controlling access to memory in &#167;3.2. In &#167;3.3, we explore HFI's other features in the context of implementing a sandboxing system and see how HFI mitigates Spectre in &#167;3.4. For reference, the complete HFI interface is listed in the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">HFI Overview</head><p>HFI allows developers to build sandboxing runtimes that can efciently create multiple in-process sandboxes each with its own view of process memory.</p><p>HFI's interface is accessible entirely in user space -there is no kernel component or ring transitions. Thus, a runtime can rapidly instantiate sandboxes, share memory, etc. HFI can support a wide range of uses cases, from sandboxing untrusted libraries in a large applications <ref type="bibr">[52]</ref>, to supporting Wasm sandboxes in a FaaS platform.</p><p>We refer to a runtime managing sandboxes alternately as the runtime or trusted runtime, since in our HFI threat model, we assume this code is trusted, and the sandboxed code is untrustedalthough things get slightly more nuanced with hybrid sandboxes (see below). HFI builds on a few central concepts: HFI mode. Each CPU core has its own HFI state, stored in registers. If HFI is enabled, code running on that core is "sandboxed", i.e, execution is restricted according to: (a) a set of region registers (that grant access to memory), (b) a register with the sandbox exit handler (where system calls and sandbox exits are redirected to), and (c) a register with sandbox option ags (e.g., the sandbox type). With a few exceptions, HFI is enabled when the trusted runtime executes an hfi_enter instruction, and disabled when sandboxed code executes an hfi_exit instruction -which transfers control back to the trusted runtime. The runtime is responsible for saving and restoring context appropriately, and can use HFI to multiplex many sandboxes across cores, scheduling them as it sees t. Interposition. HFI supports interposition on all paths out of the sandbox including system calls (and by extension, signals), and sandbox exits (hfi_exit). Thus, the runtime can completely mediate <ref type="bibr">[63]</ref> the interaction of sandboxed code with the operating system and enclosing process. Supporting interposition directly in HFI, rather than relying on existing OS mechanisms <ref type="bibr">[65,</ref><ref type="bibr">75]</ref> oers excellent exibility, reduces complexity (which benets security <ref type="bibr">[14]</ref>), improves performance, and eases deployment. Sandbox types. HFI supports two sandbox types: native to support standard in-process isolation, and hybrid to support Wasm and other SFI systems. The key dierence between these two is their trust model. In native sandboxes, HFI assumes sandboxed code is untrusted; in hybrid sandboxes, HFI assumes sandboxed code is trusted -or more specically, that it was built with a trusted compiler (or checked with a trusted verier <ref type="bibr">[37,</ref><ref type="bibr">50,</ref><ref type="bibr">86]</ref>). Because HFI knows hybrid sandboxes are trusted, it allows privileged operations such as system calls and sensitive register updates. This allows hybrid sandboxes to avoid sandbox exits (with the resulting cost of context switches) entirely, modulo whatever the runtime is doing to multiplex HFI. We explore this further in &#167;3.3.</p><p>Regions. In HFI mode, all memory access is controlled using regions. Conceptually, these can be thought of as an address range described by a base (start address for the range), a bound (the size of the range), and a set of permissions (read, write, and execute). In general, every sandbox will have a set of data regions (e.g., for its heap, stack, and shared data) and code regions. Regions are an attractive representation as they require minimal state (it is easy to save/restore for fast context switches) and can be enforced with simple hardware. As regions are a central feature of HFI, we will explore them rst.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Isolating Data &amp; Control Flow with Regions</head><p>By default, a sandbox has no access to memory -it cannot read data or run code. A runtime grants access to memory using regions, by conguring the region registers prior to sandbox entry. HFI offers two types of regions, implicit and explicit, each specialized to dierent tasks: Implicit regions. Implicit regions apply checks to every memory access, and grant access on a rst-match basis. For example, if sandboxed code executes an instruction -load address X into register Y -HFI will check if any region register has a range that includes X in parallel, then apply the permissions from the rst matching region. If the rst match has read permission, the operation will proceed, and if it does not, HFI will trap.</p><p>Implicit regions are essential for isolating unmodied native code, and similarly, for isolating control ow, situations where explicit regions (described next) would be impossible to use.</p><p>Implicit regions perform ecient bounds checks based on prex matching ( &#167;4). Concretely, each region species a base_prex (the region's base address) and an lsb_mask. To check if an address is in bounds, HFI uses the lsb_mask to remove the least signicant bits of the address, and compares the remaining prex to base_prex. With this approach, implicit regions must be power of two sized and aligned -thus, they trade granularity for ecient checking.</p><p>HFI discriminates implicit regions into code and data regions, to keep the control and data pipelines simpler and more ecient. Thus, data region checks apply only to reads and writes, while code regions check apply only to instruction fetches.</p><p>HFI provides six implicit regions per-sandbox, four data regions (e.g., for the heap), and two code regions (e.g., for the application and shared library code) 5 . Implicit regions checks are not applied to operations on explicit regions, which we discuss next.</p><p>Explicit regions. An explicit region acts as a handle to a memory range, and follows the normal (base, bound) style of addressing. Thus, addressing is always relative to the base of a specied region. For example, if a sandbox executes an instruction: read address X in region1 into register Y. If -&lt; A468&gt;=1[1&gt;D=3] is true, and reads are permitted. HFI will store the contents of A468&gt;=1[10B4_033A4BB] +into Y, otherwise it will trap. HFI provides four explicit regions, and two dierent region sizes (large/small), with dierent granularities. Large regions can address up to 256 TiB (2 48 ) and are sized and aligned to multiples of 64K (2 16 ). Small regions, in contrast, can only address up to 4 GiB (2 32 ), but are byte granular in size and alignment. Small regions have one additional restriction: they cannot span addresses which are multiples of 4 GiB.</p><p>We note that, while allowing regions which support arbitrary address ranges at any grain is conceptually simpler than specialized large and small regions, our restrictions allow bounds checking with very simple hardware. HFI's large and small regions constraints can be checked with a single 32-bit comparator, rather than the multiple 64-bit comparators needed to check arbitrary region bounds ( &#167;4.2).</p><p>Explicit regions' added granularity is critical for supporting Wasm heaps <ref type="foot">6</ref> , which grow in 64K increments <ref type="bibr">[28]</ref> -while byte granularity is critical for eciently sharing individual memory objects and sandboxing legacy code, as existing buers can be shared in-place changing code or allocators.</p><p>Explicit regions are accessed using the hmov instruction. There are four hmov instructions hmov{0-3}, one to access each of the explicit regions. For example, hmov0 is used to access the region specied by the rst region register. To ease adoption in existing compilers, hmov oers the same semantics as the normal x86 mov instruction -with the following caveat.</p><p>Unlike the normal mov, hmov ensures that only positive osets of explicit regions are accessed -a guarantee necessary for a simple implementation in the hardware ( &#167;4). To elaborate, the normal x86 mov takes multiple operands which are added to generate the eective address for a memory operation. The hmov instruction modies this in the following ways: (1) the rst operand is always ignored and replaced with the specied region's base address, (2) hmov traps if a negative value is used for the remaining operands, and (3) hmov traps if the eective address computation overows.</p><p>While some aspects of these restrictions may seem onerous at rst glance, they only rule out patterns that compilers do not rely on in-practice, or can easily work around; for instance, overows in eective address computations are undened behavior in C/C++.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Sandboxing with HFI</head><p>Next, we explore how HFI's features are used to instantiate and run sandboxes. As a motivating example, we will assume we are a function-as-a-service (FaaS) provider building a trusted runtime to sandbox client applications. In our example, our FaaS can support both Wasm applications and native applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Entering a Sandbox.</head><p>Let's assume we've gotten a network request, and our runtime is ready to start a sandbox -application code is in memory, heap space is allocated, inputs (e.g., the headers and body of an HTTP request to our FaaS service) are in a buer in memory. Our runtime can now take the following steps:</p><p>Setting up regions. To start, our runtime sets up access to the code, heap, and input memory so our application has everything it needs once the sandbox starts. The runtime does this by using the hfi_set_region instruction which stores these initial region mappings into the specied region's registers. Notably, if no code regions are mapped, HFI will immediately trap after hfi_enter is called, as the processor will not be able to fetch instructions.</p><p>Selecting a sandbox type. Our runtime selects a sandbox type (hybrid/native), which it passes as a ag to hfi_enter. If the runtime is running untrusted code, it chooses the native sandbox, causing HFI to lock all region registers from when hfi_enter is called until the sandbox exits, and redirect all system calls to our runtime's exit handler (see below).</p><p>If it is running a Wasm application, it chooses the hybrid sandbox, which leaves the region registers unlocked. So, when the Wasm runtime inside the sandbox starts up, it sets up an explicit region that points to the Wasm heap using hfi_set_region. As the Wasm code runs, the runtime inside the sandbox can make any system calls it needs to directly. It can also resize the heap, or multiplex HFI's (nite) registers among a larger number of multi-memories <ref type="bibr">[4]</ref>. Consequently, scheduling reasons aside, there should be no need to hand control back to the (external) trusted runtime until the application exits.</p><p>Saving context. Our runtime must protect its own execution context such as its stack and contents of CPU registers, before it switches to sandbox code. Unlike previous systems <ref type="bibr">[15]</ref>, HFI leaves this mechanism entirely up to software -this exibility is important for eciency. For example, if our runtime is running untrusted native code -it will have to use springboards and trampolines <ref type="bibr">[86]</ref> -lightweight assembly routines that (1) clear registers and switch to a separate stack prior to executing the sandboxed code and (2) restore these registers after the sandboxed is executed. However, if it is running Wasm code, it could opt to use zero-cost transitions <ref type="bibr">[38]</ref> that rely on the compiler to ensure that the sandbox code cannot misuse the stack or scratch registers.</p><p>Setting up an exit handler. If our runtime is using a native sandbox, it will install an exit handler to take control when hfi_exit is called, or when system calls are made in a native sandbox. To install an exit handler, our runtime species a function pointer that must be invoked on sandbox exit as parameter to hfi_enter. When the exit handler is called after the sandbox exits, it will transfer control to our runtime, which will check a model specic register (MSR) to identify the cause of the exit, and respond appropriately. If HFI is running a Wasm (hybrid) sandbox, our runtime typically will not install an exit handler (though it optionally can), as it is not interposing on system calls and hfi_exit instructions; this is safe since the code in the sandbox is trusted.</p><p>Having taken all these steps, our runtime is ready to start the sandbox. Once it calls hfi_enter, HFI mode is enabled, and the next instruction that runs will be inside a sandbox.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Leaving a Sandbox.</head><p>After the execution of sandboxed code, HFI is disabled, and control is returned to our trusted runtime through a few dierent paths: hfi_exit and system calls. If sandboxed code calls hfi_exit, HFI will record the reason for the exit in an MSR, atomically disable HFI mode, and transfer control to our trusted runtime. For native sandboxes, our runtime's exit handler is invoked on exit; for hybrid sandboxes, our runtime will not use an exit handler and instead allows hfi_exit to fall through to other trusted code placed directly after the exit (allowing an exit handler code to be inlined when using a trusted compiler). The native sandbox's interposition of system calls is nearly identical to an hfi_exit with a handler; system calls are simply converted into a jump to the exit handler by HFI, resulting in very ecient interposition. Again, the cause of the exit, including which system call and type of call (e.g., int 0x80 vs. sysenter) is recorded in an MSR that can be read by the exit handler.</p><p>Access violations and hardware faults. If sandboxed code causes a hardware trap (e.g., when dereferencing a null pointer), or an HFI bounds check violation (accessing memory outside of the HFI regions) -HFI disables the sandbox mode, records the cause of the fault in a model specic register (MSR), and generates a hardware trap -which the OS delivers as an OS signal to our trusted runtime. For example, upon encountering an HFI bounds check violation, HFI disables the sandbox and generates a fault that is delivered as a SIGSEGV signal to a signal handler that our runtime registered. The signal handler can examine the MSR to disambiguate the cause of the SIGSEGV, and take the next appropriate action.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">OS Support. Multiple processes can use HFI concurrently.</head><p>To enable this, the OS must save the contents of HFI registers (along with the general-purpose registers) when switching between processes. To support this, HFI adds a ag (save-hfi-regs) to the x86 xsave and xrstor instructions, that are used to save and restore process context. Enabling this ag in the kernel is a simple and minimal change. Since this ag modies the HFI registers, allowing code in a native sandbox to execute xrstor with this ag could break sandboxing; thus HFI will traps if this occurs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Mitigating Spectre</head><p>By construction, HFI prevents whole classes of attacks that could be used to speculatively read data outside the sandbox, since HFI's region checks are applied uniformly to all memory accesses (speculative and non-speculative) once HFI is enabled. However, an attacker could still try to trick the trusted runtime into speculatively running malicious code without HFI enabled -or to speculatively enable HFI in an inconsistent state; we address these risks next.</p><p>Serializing hfi_enter and hfi_exit. A simple way to mitigate the previously mentioned attacks is to fully serialize hfi_enter and hfi_exit. Serializing hfi_enter ensures that when we enter a sandbox, our conguration is in a consistent state -for example, that some region register does not (speculatively) contain unsafe parameters due to speculative execution, data value prediction, etc. Serializing hfi_exit ensures that malicious code cannot speculatively disable HFI, and then speculatively execute a code path that would never happen under non-speculative execution.</p><p>To serialize hfi_enter and hfi_exit, a runtime can set the is-serialized ag on sandbox entry. We expect this to add &#8673; 30 60 cycles on x86-64 based on the cost of similar serializing instructions <ref type="bibr">[30,</ref><ref type="bibr">75]</ref>; this cost is amortized in many workloads, as we see in &#167;6. However, for applications that don't want to pay this cost, we oer an optional extension to HFI called switch-on-exit, that avoids most of this overhead, but still oers Spectre protection.</p><p>The switch-on-exit extension. Often there is no need to Spectre isolate a collection of sandboxes -or the same sandbox -across invocations. For example, multiple invocations of a sandbox working on the same data (e.g., for image rendering as discussed in &#167;6.2) are a common occurrence. Other times, a developer knows there are no secrets among sandboxes that need Spectre protection, e.g., among components <ref type="bibr">[26]</ref> in an application or FaaS invocations of the same tenant. In these cases, serializing every entry and exit is needlessly expensive. To avoid this overhead -HFI optionally provides the switch-on-exit extension that enables runtimes to restrict speculative control ow to a common set of sandboxes -and ensure that any entry and exit from this set is serialized.</p><p>To use switch-on-exit, a trusted runtime must start by running itself in a hybrid sandbox with the is-serialized ag set; thus, both its entry and exit will be serialized. This ensures that any control ow into the sandbox cannot speculate beyond its (serialized) hfi_exit. Once this foundation is set, the trusted runtime can run other sandboxes by invoking hfi_enter (unserialized) with the switch-on-exit ag set -doing so will save the trusted runtime's HFI registers, and atomically switch to the registers of the new sandbox. Sandboxes started this way cannot disable HFI when they exit (e.g., with hfi_exit); instead, HFI will atomically switch back to the trusted sandbox by restoring its registers to the state prior to hfi_enter. Thus, we can run multiple sandboxes without serialization. Ultimately, serialization will take place when we leave the trusted sandbox; thus, an attacker can never speculatively run code outside this collection of sandboxes with HFI disabled.</p><p>Securing the runtime. Sandbox runtimes execute in close proximity to untrusted code. In the case of Wasm, the runtime even executes within the same sandbox, making it acutely vulnerable to Spectre attacks. Consequently, runtimes need a way to prevent themselves from being tricked into leaking privileged data <ref type="bibr">[53]</ref>.</p><p>Implicit regions are the tool that runtimes in hybrid sandboxes can use to solve this problem. Implicit regions provide a "safety net" for runtime code, such that even if runtimes are under the speculative inuence of an adversary, they can secure themselves by constraining their memory accesses to a safe portion of the address space. Runtimes of native sandboxes can also leverage this approach by executing within a dedicated hybrid sandbox.</p><p>A nal important caveat is that using HFI still requires a holistic approach to Spectre security. Running code in a sandbox does not change the inuence that malicious code exerts on prediction hardware state, which could inuence other software on the system. As a corollary, a runtime that is sandboxing with HFI must ensure that it does not allow a speculative bypass of hfi_enter altogether, followed by speculative execution of untrusted code 4 `ARCHITECTURE DESIGN HFI's `-architectural design is guided by four overarching goals:</p><p>(1) Fast. Minimizing overhead is a central goal of HFI.</p><p>(2) Secure. HFI must be robust to Spectre-style attacks, and free from Meltdown-style aws that could compromise existing software. To avoid this, HFI must not update microarchitectural state (such as the dtb, branch predictor, and data/instructions caches) based on data that is secret (i.e., data outside the sandboxed region).  (3) Scalable. HFI must not constrain the number of concurrent sandboxes. Thus, we eschew designs that store per-sandbox state (such as region details) for multiple sandboxes in on-chip caches such as the TLB. Hardware extensions that do this impose scaling limits either by restricting the total number of concurrent sandboxes <ref type="bibr">[66,</ref><ref type="bibr">75]</ref>, or requiring expensive state spills on overow <ref type="bibr">[55]</ref>, resulting in performance that scales down as concurrency increases.</p><p>(4) Minimal. HFI should not signicantly change the processor design or add expensive components. This helps generality as minimal designs are easier to implement in small power-conservative chips, and also benets security by presenting fewer features to verify and fewer potential attack surfaces. HFI's `-architecture must also conform to more specic low-level constraints that are critical for practical adoption:</p><p>(1) Low power: Power consumption is a critical metric in small devices and datacenters -in light of this, HFI eschews the use of power-hungry components such as large caches (HFI's on-chip storage is measured in bytes, not kilobytes).</p><p>(2) Zero impact on non-sandbox code: Most code is not sandboxed today, and vendors are not likely to embrace a design that slows down regular (non-sandboxed) code. Thus, we cannot add extra pipeline stages that are seen by regular instructions, nor can we add timing delays to any existing stages that are on the critical timing path of the processor (and would therefore impact the CPU maximum frequency).</p><p>(3) Low impact on circuit area: HFI enforces region bounds checks in the neighborhood of critical pipeline structures, such as the register le, address generation unit, and the TLB. Thus, large structures, even if they are not on the critical timing path, can exacerbate timing delays between critical structures that are now further apart. For example, using 64-bit comparators to check bounds would be a natural design choice, but would challenge our power and area goals.</p><p>Additional components: As discussed, HFI seeks to minimize the number of new components it requires. Of course, it's not just the component count but where they are added that matters; even a handful of gates on critical paths like memory lookup would increase the cycle time of the entire CPU. To account for this, we worked with architects at a major CPU vendor to rene our design to try to minimize impact on CPU pipelines.</p><p>In total, HFI's architecture adds: 8 instructions, 22 internal 64-bit registers (10 regions specied by 2 registers each, 1 exit handler register and 1 conguration register), an additional 22 internal 64-bit registers for the optional switch-on-exit support, one 32-bit comparator for bounded regions, four 64-bit AND gates for masking, four 64-bit equality checks for prex-checked regions, and ve 2-bit muxes (region lookup, negative oset checks, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Implicit Regions</head><p>Implicit regions enforce memory protection for control (code regions) and data (data regions), including speculative safety.</p><p>Data regions. HFI supports four data regions, and allocates two registers per-region, for the lsb_mask and base_prefix. If HFI is enabled, data region checks are applied to all load and store operations on the data path (except those that employ hmov).</p><p>Concretely, checks work by prex matching. To implement this, the region's lsb_mask is rst used to remove the least signicant bits of the eective address (with an AND operation), and then the base_prefix is compared for equality with the remaining bits of the address. To ensure eciency, these checks occur in parallel with the dtb and cache index lookup.</p><p>If prex-checking fails for all regions, or the rst matched region does not have adequate permissions (e.g., no read permission for a load operation), HFI triggers a segmentation fault (using the same circuit paths that would handle an access to unmapped memory), and saves the reason for the fault in a MSR.</p><p>Since bounds checking, dtb lookup, and cache index lookups happen in parallel (as shown in Figure <ref type="figure">1</ref>), it may appear that cache state could be modied as a result of secret (out-of-bounds) data. However, HFI is designed to prevent this sort of side-channel attack by construction: all bounds checks occur before the processor can resolve the physical address of a memory access. This is secure because the processor can update cache metadata like the LRU bits (for hits) or fetch new data blocks (for misses) only after resolving the physical address. HFI can therefore strictly prevent any metadata updates if there has been a fault.</p><p>Note that we cannot say the same for the dtb or the i-cache; here an out-of-bounds address can aect metadata -e.g., LRU bits. However, the invariant we guarantee -no secret (data stored outside the boundaries of the region) ever aects architectural state -is still not violated, since we do not allow the result of an out-of-bounds memory operation to propagate into any of these structures.</p><p>Code regions. Code regions enforce bounds checks on control ow (both speculative and non-speculative). HFI allocates four internal registers to store region metadata for the two code regions. HFI uses prex-checking to bound the program counter, employing the same technique used for data regions.</p><p>To ensure security, prex-checking is applied in parallel with the decode stage. If the check nds a matching region with execute permissions, it succeeds, and decode carries on normally. If the check fails, it prevents the decoded micro-ops from entering the pipeline, and instead translates all instructions into a faulting NOP micro-op. This ensures that instructions that are out-of-bounds are not executed during committed execution, and are also not executed speculatively.</p><p>To summarize, HFI's data pipeline is Spectre safe, since the data cache is not updated prior to bounds checks being completed; HFI's control pipeline is safe as bounds checks nish prior to instruction decode prior to the execution of instructions. This approach also helps to guarantee that any code executed as the result of PHT, BTB, and RSB (speculative) predictions are checked prior to execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Explicit Regions</head><p>Explicit regions ( &#167;3.2) oer granularity that is necessary to support Wasm heaps as well as ne-grain object sharing. They are accessed with hmov instructions, which performs bounds checks to ensure only memory specied by the accessed.</p><p>At a high level, the hmov instructions (hmov0, hmov1, hmov2, hmov3) follow a similar format to the standard x86 move instructions; it supports all variations and addressing modes of x86, including the complex addressing mode where scale, index, base and displacement operands are combined to form the eective address.</p><p>However, hmov has additional steps that: (1) choose an HFI region, (2) replace the base operand with the base address of the chosen HFI region, and (3) perform checks on the remaining operands and the resulting eective address of hmov to ensure that the memory access remains within the region. Each of these steps can be implemented with small modications to the existing x86 data path: Instruction decode. The decode pipeline stage is responsible for translating x86 instructions into simpler and easier to execute operations called micro-ops. HFI extends the decode stage with a new micro-op that diers slightly from the load or store micro-op by adding a region number and a single bit indicating that this is an hmov rather than a standard mov. Register read. As shown in Figure <ref type="figure">1</ref>, during the register read stage, hmov substitutes the base register (which would otherwise be read from the register le) with a base value read from one of the four sets of range registers.</p><p>Bounds checking on hmov. During memory operations, HFI performs bounds checking in parallel with the processor's address translation (dtb lookup). In the case of hmov, the HFI comparator unit (Figure <ref type="figure">1</ref>), ensures that the eective address is within bounds.</p><p>This could be naively accomplished with two (expensive) 64bit comparators. HFI, however, exploits the fact that the base has been precomputed -and uses this value along with three cheaper checks that require only a single 32-bit compare. Specically, HFI checks that: (1) the 32 most signicant bits of the eective address is smaller than the upper bound specied in the HFI region metadata registers, (2) the displacement and index sign bits are non-negative, and (3) eective address calculation does not cause an overow. The second and third check ensure it is impossible to generate an eective address lower than the base. Thus, we check both base and bound with a single compare (and three trivial bit checks). Next, we discuss why checking only 32 most signicant bits is secure for the two types of explicit regions HFI supports -large and small ( &#167;3.2).</p><p>Bounds checking large and small regions. Large regions require the base and bounds to be aligned to 64K (2 16 ). This means that the bottom 16-bits of the eective address can be safely ignored, as their contents cannot cause an out-of-bounds access. Additionally, x86 CPUs typically support a 48-bit virtual address space; HFI can thus ignore the top 16 bits of our address for comparison. On Intel server CPUs that support 52/57-bit address spaces, a larger 36/41-bit comparator would be necessary.</p><p>Small regions support arbitrary bounds up to 32-bits (4 GiB) in size, as long as the small region does not cross a 4 GiB boundary. Because of these restrictions any addresses in small regions cannot aect the top 32-bits of the eective address. HFI thus only checks the bottom 32-bits of the eective address of small regions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Region Updates</head><p>Region registers state can be modied by several instructions: hfi_set_region, hfi_clear_region, and hfi_get_region, either when HFI is disabled -or while HFI is enabled in a hybrid sandbox. While the semantics of these operations are quite simple (e.g., writing metadata to a region register), there are several nuances to how they are implemented to ensure performance and safety.</p><p>To start, these operations do not serialize when not in HFI mode, as they are always followed by an hfi_enter (that can be serialized) before the HFI region checks take eect. However, they do serialize when executed in a hybrid sandbox, to ensure the correctness of in-ight instructions and memory operations. Additionally, hfi_set_region(code,...) ushes any pending memory operations, and is serialized to ensure that all in-ight instructions are retired prior to applying the new bounds. These serialization costs can be avoided if we employ register renaming on HFI metadata registers (similar to that used in general-purpose registers), in essence, trading complexity for improved performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Enters, Exits, Sandbox Types, &amp; System Calls</head><p>As discussed in &#167;3.3, HFI provides hfi_enter and hfi_exit instructions to enter and leave a sandbox. On sandbox entry, the hfi_enter instruction saves its parameters -the exit handler and ags (switch-on-exit, is-serialized, is-hybrid) to internal conguration registers, and HFI mode is enabled. On exit, HFI disables the sandboxing, records the reason for the exit (e.g., executed an hfi_exit instruction, executed a syscall, traps) in an MSR, and nally jumps to an exit handler if one is specied. The semantics of sandbox entry with respect to serialization are described in &#167;3.4. Sandbox types. Hybrid and native sandboxes dier in how they deal with region updates, privileged instructions (system calls), and the semantics of hfi_exit. However, there is no special component that implements a sandbox type. Rather, other features modify their behavior based on the sandbox type as discussed in &#167;3.</p><p>System call interposition. Native sandboxes require HFI to redirect syscalls when executing sandboxed code. To do this, HFI modies the decode stage of the syscall instruction (and all variations including sysenter, int 0x80, etc.) to rst perform a microcode check if HFI is in native mode (is-hybrid is false), and redirect control ow to the HFI exit handler if this is the case. The syscall instruction is otherwise unmodied. While this approach imposes a single cycle penalty on syscall instructions for the additional check, this overhead is unlikely to impact application performance due to the relatively sparse nature of system call invocations compared to other instructions such as loads and stores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Mitigating Spectre</head><p>HFI's enter, exit, and region update instructions can be serialized when necessary for Spectre protections. However, HFI also oers another way to mitigate Spectre, that minimizes serialization overhead for common use cases. As discussed in &#167;3.4, the switch-on-exit ag allows speculative safety without the need to serialize on every entry and exit. This is safe because any speculatively run hfi_exits would atomically switch to the runtime's sandbox.</p><p>To support this, we extend HFI in three ways. First, we double the number of HFI metadata registers, so that we can store the metadata for two sandboxes -the runtime sandbox and the child sandbox. Second, we modify the hfi_enter instruction (when executed with the switch-on-exit) to preserve the trusted runtime's sandbox metadata (currently in the HFI registers), before loading the child sandbox's metadata. Finally, the hfi_exit instruction, upon execution with the switch-on-exit ag enabled, atomically switches back to using the trusted runtime sandbox metadata instead of disabling HFI.</p><p>These changes to hfi_enter and hfi_exit can be implemented in a straightforward manner in microcode. While this feature does add some additional cost in the form of internal registers, it allows us to support speculative safety for common use cases, while removing most of the cost of serialization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTAL METHODOLOGY</head><p>This section documents our experimental framework and approach (and experimental evaluation thereof) to achieving accurate hardware simulation of long-running benchmarks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Integrating HFI into Wasm</head><p>We modify two compilers used in production applications to test the use of HFI with Wasm toolchains: Wasm2c, an ahead-of-time compiler used to run untrusted libraries in Firefox, and Wasmtime, a just-in-time compiler used in Fastly's FaaS environment.</p><p>We added HFI support to Wasm2c, and secured the Wasm heap by using a explicit region, accessed by hmov. We removed mprotect() calls to setup guard regions, and replaced existing heap growth code with hfi_set_region. We added hfi_enter and hfi_exit to sandbox transitions. We omit HFI support for the Wasm stack, indirect function tables, and the code section as these would not impact our performance results. However, this would be necessary for complete Spectre mitigation.</p><p>We modify Wasmtime to use HFI for lifecycle operations (Wasm sandbox creation and growth) to understand the benets of HFI. Thus, we integrated hfi_enter, hfi_set_region, support but did not add hmov support for the Wasm heap (as this would not aect lifecycle costs). We use HFI to optimize Wasmtime's teardown of multiple sandboxes to more eciently reclaim memory on sandbox exits. We discuss this in more detail below.</p><p>Optimizing Wasmtime with HFI. Wasmtime today deallocates or tears down sandboxes using the madvise(MADV_DONTNEED) system call, which discards an old sandbox memory and replaces it with a lazy copy-on-write mapping of the next executing sandbox's memory image. These madvise() calls can be slow as they incur a cost proportional to the size of region being discarded, and worse can even perform a TLB shootdown in concurrent environments. HFI allows us to optimize these madvise() syscalls by batching multiple such calls when discarding memories of sandboxes that are adjacent in memory. This is possible as HFI eliminates the need for large regions of guard pages between dierent sandbox memories. By eliminating these guard pages, we can trivially run madvise() across immediately adjacent heaps, without paying a signicant penalty when unnecessarily discarding guard pages. Furthermore, the elimination of guard pages also reduces address-space pressure, which means we can aord to wait longer prior to discarding old sandbox memories. We thus modied Wasmtime to take advantage of batching during sandbox destruction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Hardware Simulation</head><p>Our approach to accurately modeling HFI on complex, long-running applications is twofold. We model all of the low-level costs of HFI using detailed cycle-accurate pipeline simulation, but can only produce results for relatively short-running applications. Thus, we supplement this approach with a faster software emulation model and validate the correlation between the two on some representative, small benchmarks that stress key features of HFI.</p><p>Cycle accurate pipeline model. We congure the gem5 simulator <ref type="bibr">[46]</ref> to resemble the Intel Skylake CPU on which we natively run the majority of our benchmarks. To support HFI, we add metadata registers and instructions. We add the new hmov instruction by using a new prex for x86's mov, and we modify the syscall instruction's microcode to invoke sandbox state handlers. Further details about the simulator's implementation are in the appendix.</p><p>Software emulation. The gem5 simulator is several thousand times slower than native execution. We thus primarily gauge HFI's performance by emulating the cost of HFI using available instructions, described in detail in the appendix. All benchmarks with a single exception (described below) are run on an Intel Core i7-6700K (with Skylake architecture) (4 GHz) with 64 GiB of RAM,  We also pin benchmarks to a single CPU that is isolated from other processes with CPU shield. We record execution time on benchmarks using the Hyperne timing utility, which accounts for warmup runs and averages across several subsequent runs. For the hybrid sandbox (with Wasm) evaluations, we compile source les using stock wasi-clang (i.e., clang with Wasm as a target) and Wasm2c set to employ dierent protection mechanisms -guard pages, bounds-checking, or HFI emulation. We use a native C compiler (clang or GCC) for native sandbox evaluations. Cross validation. To vet our emulation, we compare gem5 HFI simulation against our emulation of HFI performance using the Sightglass benchmark suite. Sightglass consists of various short Wasm-friendly programs, mainly primitives from cryptography, mathematics, string manipulation, and control ow. We exclude those Sightglass benchmarks which are incompatible with Wasm2c, or require over a day to execute on gem5. Figure <ref type="figure">2</ref> shows the comparison between HFI and its emulation. Across the suite, benchmarks in software emulation have cycle counts between 98% and 108% of the simulation. The geometric mean dierence in runtime is 1.62%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Security Evaluation</head><p>To ensure that out-of-bounds memory accesses trap, we employ a collection of unit tests on our HFI gem5 simulation. To ensure our simulation's Spectre resistance, we use exploits from the Transient-Fail <ref type="bibr">[11]</ref> and Google SafeSide <ref type="bibr">[24]</ref> test suites. We run the in-place Spectre-PHT attack from Google SafeSide in the gem5 simulator to demonstrate that sandboxed code can speculatively access secret data (stored in a global variable in the host application for this example) when executed without HFI. We then check that HFI prevents this attack when the host application protects this global variable using HFI's regions (the memory range containing the global variable is in an HFI region without read or write permissions). In Figure <ref type="figure">7</ref> (in the appendix), we plot the memory access latencies to ensure that HFI is able to prevent speculative access of secrets and the subsequent cache-based exltration. We similarly run the in-place Spectre-BTB attack from the TransientFail test suite, and check that this is also mitigated<ref type="foot">foot_4</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">HFI PERFORMANCE</head><p>We integrate HFI into standard benchmarks and real-world software, and evaluate its performance. We examine four important use cases: long-running applications (SPEC), library sandboxing in a browser, a JIT-based FaaS, and native sandboxing in a server workload (NGINX). Finally, we examine HFI as a Spectre mitigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">SPEC 2006 Benchmark Suite</head><p>SPEC CPU 2006 is a suite of integer and oating-point workloads.</p><p>We use SPEC06 instead of SPEC17, as many of SPEC17's benchmarks require more memory than the 4 GiB that Wasm supports. We evaluate the impact of HFI on the subset of SPEC06 that is compatible with Wasm and the WASI SDK fork of clang. Figure <ref type="figure">3</ref> shows the performance of bounds checking and HFI compared to guard pages. These are long-running applications that do not test HFI's fast transitions, but do show its low cost in steady state.</p><p>In our results, we see that HFI (geomean speedup of 3.2% over guard pages) is far less costly than bounds checking (geomean slowdown of 34.7% over guard pages), and HFI on average is modestly faster than guard pages. The 445.gobmk benchmark takes a little longer with HFI as it puts heavy pressure on the instruction cache, and in this case, we see that the hmov instructions for which we used longer encodings, impacts HFI performance. We emphasize, however, that HFI is the only scheme of these three that also offers Spectre protections. Securing Wasm compilers against Spectre, without HFI, incurs a 62% to 100% hit <ref type="bibr">[53]</ref>.</p><p>Finally, this benchmark only invokes hfi_enter and hfi_exit once each, at the beginning and end of each benchmark; thus the serialization overhead added to this benchmark is negligible. To understand why HFI improved performance, we dug into two eects more deeply: heap growth overhead and register pressure. Heap growth is an expensive operation in a sandbox that uses guard pages, as it requires a call to mprotect(), while HFI can just update a region's bound registers. To measure this dierence, we ran a simple benchmark in Wasmtime that grows the Wasm heap from a single page to 4 GiB in 64 KiB increments. In total, the mprotect() method takes 10.92 seconds, while HFI takes 370 ms, a dierence of &#8673; 30&#8677;, reecting HFIs impact on heap growth after being amortized in the context of the Wasmtime runtime.</p><p>HFI also removes the need to load Wasm's memory base and bounds into general-purpose registers for software-based checksreducing register pressure. To approximate the impact of this, we ran Wasmtime's Spidermonkey benchmark, rst reserving one register, then reserving two registers. We nd that reserving one register incurs an overhead of 2.25%, while reserving two registers incurs an overhead of 2.40%. Thus, we have a rough approximation of the improvements HFI can oer in this dimension.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Wasm Sandboxing in Firefox</head><p>To understand the end-to-end performance impact of HFI, we measure the performance of Wasm sandboxed font and image rendering in Firefox, similar to Narayan et. al <ref type="bibr">[52]</ref>, with and without HFI. The font rendering benchmark reows the text on a page ten times via the sandboxed libgraphite, using multiple font sizes to avoid any eects from font caches. When using Wasm with guard pages, libgraphite renders this in 1823 ms; using boundschecking, 2022 ms; and using HFI emulation, 1677 ms.</p><p>We also test Firefox's performance using a Wasm-sandboxed libjpeg. For this, we measure decode time for JPEG-format test images from the Image Compression benchmark suite. We use images of three resolutions and three compression levels. Figure <ref type="figure">4</ref> shows the median decode times for each conguration out of 1000 runs. As expected, HFI emulation oers the fastest sandboxing compared to the typical software-based enforcement of Wasm. HFI is faster than software bounds checks by design: at a hardware level, memory access validation happens in parallel with the memory access itself. HFI is also faster than guard pages, because it can elide calls to mprotect() (which is needed by guard pages) in favor of relying on hardware to enforce access safety.</p><p>In the font rendering benchmark, HFI outperforms guard pages by 8.7%. In the libjpeg tests, the speedup of HFI over guard pages is between 14% and 37%. Even though this benchmark has a sandbox invocation per line of pixels (a 1080x720 image requires &#8673; 720 &#8677; 2 serialized enters/exits), Figure <ref type="figure">4</ref> shows that even with this added overhead, HFI's performance benets are able to amortize this cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">HFI's Impact on Wasm-based FaaS Platforms</head><p>One of the primary uses of Wasm (outside the browser) is to isolate tenants from one another on FaaS platforms like Fastly's Compute@Edge <ref type="bibr">[20]</ref> or Cloudare Workers <ref type="bibr">[76]</ref>. We evaluate the performance and scaling impact of using HFI for such systems.</p><p>6.3.1 Cost of Sandbox Setup and Teardown. One of the bottlenecks in FaaS platforms is the setup and teardown of sandboxes. &#167;5.1 describes how HFI allows us to coalesce many sandbox teardowns into one large teardown. To evaluate the impact of this optimization, we use a custom FaaS benchmark that creates 2000 sandboxes, executes a trivial short-lived workload on each (writes some constant data to the sandbox's memory) and then tears down the sandboxes.</p><p>We run this on three versions of Wasmtime: (1) stock Wasmtime that invokes madvise() once per sandbox on teardown; (2) HFI-wasmtime that batches madvise() teardowns; and (3) non-HFI Wasmtime that batches madvise(), but does so without eliding guard pages. We nd that stock Wasmtime has a per-sandbox teardown cost of 25.7 `s, HFI-Wasmtime took 23.1 `s (a 10.1% improvement), and non-HFI batched teardown took 31.1 `s. Thus coalescing calls to madvise() during teardown improves performance, but only when using HFI, as it lets the runtime elide guard pages. 6.3.2 Scalability of Sandbox Creation. HFI's ability to let Wasm runtimes elide guard pages also impacts scalability, i.e., the number of sandboxes that can be concurrently executed by Wasm runtimes. We test this by measuring the number of 1 GiB Wasm sandboxes that can be created by Wasmtime when it is allowed to elide guard pages (by using HFI). When eliding guard pages, we nd that Wasmtime can create up to 256,000 1 GiB sandboxes in a single process, i.e., the application can make full use of its address space. This 256,000 limit would be even larger for smaller-sized sandboxes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Performance of HFI's Native Sandbox</head><p>We also examine HFI's native sandbox that sandboxes code without recompilation. This has two costs -trapping syscalls and switching protection domains. We evaluate these below.</p><p>6.4.1 Performance of Trapping Syscalls. Sandboxing systems that isolate native code restrict its access to system resources by interposing on system calls. State-of-the-art native code isolation systems like ERIM <ref type="bibr">[75]</ref> rely on Seccomp-bpf lters for this, whereas HFI has direct microarchitectural support for system call interposition. To compare the overhead of these two techniques, we ran a custom syscall benchmark that opens a le, reads it, and closes it 100,000 times, and uses Seccomp-bpf and HFI in turn to interpose on the syscalls. We found that using Seccomp-bpf version imposes an overhead of 2.1%, over the HFI version.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.2">Switching</head><p>Costs of Sandboxing OpenSSL in NGINX. We modify the NGINX webserver to estimate the performance of sandboxing crypto functions and session keys in OpenSSL, similar to ERIM <ref type="bibr">[75]</ref>. We use this to measure the costs of integrating HFI in existing applications versus the benets (e.g., blocking attacks like Heartbleed <ref type="bibr">[29]</ref> and Spectre). In particular, NGINX switches in-to and out-of sandboxed code regions rapidly, when a web client accesses encrypted content. Thus, we use this to understand the impact of serialization added by hfi_enter and hfi_exit. HFI's native sandbox by design does not impose any execution overhead, as there is no modication of the instruction stream and region checks execute in parallel with address translation. Instead, overheads only appear during sandbox enters and exits, metadata manipulation (e.g., hfi_set_metadata), and traps.</p><p>Figure <ref type="figure">5</ref> compares the throughput of the unmodied NGINX web server delivering content with unprotected session keys versus the throughput when protecting session keys with HFI and MPK respectively. Following the experimental setup of ERIM <ref type="bibr">[75]</ref>, we used the apache-bench to send millions of requests of various sizes to the NGINX web server running on a single isolated CPU core. For each le size, we sent 2,000,000 session requests from 100 clients for 60 seconds and measured the throughput. We observe that HFI's native sandbox has a low overhead that ranges from 2.9% to 6.1%. HFI's overhead is slightly larger than MPK-based protections, which range from 1.9% to 5.3%. This is because HFI takes a few cycles to move metadata from memory to HFI registers on each transition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Cost of Spectre Protections</head><p>We compared HFI's Spectre protections, to the performance of Swivel <ref type="bibr">[53]</ref>, the fastest known compiler-based approach mitigating Spectre in Wasm. We evaluated this by running several common Wasm workloads in the Rocket webserver and securing these workloads against Spectre. We compared Rocket with: the Lucet Wasm compiler without Spectre protections, Lucet+Swivel-SFI protections, and with Lucet+HFI using native sandbox. We also record the workload binaries' sizes.</p><p>Table <ref type="table">1</ref> shows that HFI guards against Spectre with very low drop in tail-latency and no noticeable binary bloat, while Swivel incurs noticeable overheads for the same. In fact, the only overheads imposed by native sandbox HFI are due to region construction and sandbox state transitions (two per connection), and these costs are amortized by the cost of the workload.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RELATED WORK</head><p>In the last few decades, diverse approaches to ne grain isolation have been explored, through software-based, OS-based, and hardware-based techniques.</p><p>Software-based isolation. SFI systems <ref type="bibr">[78,</ref><ref type="bibr">86]</ref> such as Wasm <ref type="bibr">[28]</ref> oer software-enforced isolation leveraging compilers to restrict memory accesses to a linear region of memory, restrict control transfers, and interpose on system calls. As discussed in &#167;2, this approach has broad compatibility with existing code due to its simplicity, but incurs both a substantial virtual memory footprint due to guard pages and performance overheads when trying to isolate JIT code <ref type="bibr">[5]</ref> or prevent Spectre attacks <ref type="bibr">[53,</ref><ref type="bibr">67]</ref> in software. Systems that isolate computations using safe languages such as Singularity <ref type="bibr">[19]</ref> or Erlang <ref type="bibr">[6]</ref> oer fast context switching and good scalability, but cannot isolate existing code written in systems languages like C or C++. HFI is not intended to replace these systems, but instead complements these software-based approaches with hardware primitives for performant Spectre-safe isolation.</p><p>OS-based isolation. OS operating systems, most notably microkernels such as L4 <ref type="bibr">[31]</ref> have long pushed the boundaries of isolation with page-based protection hardware. Systems have experimented with oering process-style isolation primitives to an application through the interface of a "sandboxed thread" <ref type="bibr">[13,</ref><ref type="bibr">33]</ref>. Lightweight contexts <ref type="bibr">[43]</ref> go further and allow an existing application thread to switch between sandboxed contexts. These systems have the benet of not relying on hardware changes, but have performance overheads due to expensive protection ring crossings (calls into the OS kernel) to switch the active sandbox, as well as restrictions on scaling due to limits on the total hardware-supported process contexts (ASIDs) -4096 on Intel CPUs <ref type="bibr">[35]</ref>.</p><p>Hardware-assisted isolation with page metadata. Diverse systems have proposed extensions to page table metadata to store a per-sandbox ID checked by hardware, for example Donky-x86 <ref type="bibr">[66]</ref> and others <ref type="bibr">[22]</ref>, or have cleverly reused the existing checked pagelevel metadata (for x86 rings <ref type="bibr">[40]</ref>, virtual machines <ref type="bibr">[9,</ref><ref type="bibr">25,</ref><ref type="bibr">44,</ref><ref type="bibr">60,</ref><ref type="bibr">81</ref>], ARM's memory domains <ref type="bibr">[89]</ref>, trusted execution environments <ref type="bibr">[7,</ref><ref type="bibr">8]</ref>) to provide isolation at a page-level. Unlike HFI, these approaches do not require a sandboxes' memory to be in a contiguous range. These approaches, however, inherit the limitations of page-based isolation (see Norton <ref type="bibr">[54]</ref>). For example, they require expensive calls to kernel code to update the page-level metadata (to grow the sandbox-accessible memory) or switch the active sandbox; worse, keeping this metadata consistent across CPU cores often becomes a performance bottleneck, due to the need for expensive TLB clears or shootdowns. Such systems also typically rely on separate tools to interpose on syscalls that may break isolation <ref type="bibr">[14,</ref><ref type="bibr">34,</ref><ref type="bibr">58,</ref><ref type="bibr">65,</ref><ref type="bibr">66,</ref><ref type="bibr">77]</ref>, and are incompatible with zero-cost transitions <ref type="bibr">[38]</ref> (fast sandbox entries/exits that do not have to save and restore registers) since they do not distinguish between stack and heap protection.</p><p>Intel MPK, while largely similar to the above approaches, allows switching of the active MPK domain (the current sandbox) in userspace (i.e., ring 3). MPK-based techniques <ref type="bibr">[30,</ref><ref type="bibr">34,</ref><ref type="bibr">70,</ref><ref type="bibr">75,</ref><ref type="bibr">80]</ref> have thus been explored to reduce context switch overheads and sandbox native binaries, however these tools still face the other limitations of page-based approaches described above. MPK also only supports 15 domains/sandboxes eciently; thus, making it unsuitable for server-side applications, which handle many thousands of unique requests or even client-side applications that require dozens or hundreds of unique contexts <ref type="bibr">[52]</ref>. Techniques that attempt to scale MPK domains do so by falling back on other page-level techniques (e.g., page permissions <ref type="bibr">[57]</ref>, virtualization <ref type="bibr">[27]</ref>) incurring their associated performance costs.</p><p>Hardware-assisted isolation with range checks. Isolation systems have also been built around existing hardware such as x86 Segmentation <ref type="bibr">[21,</ref><ref type="bibr">86]</ref> and Intel MPX <ref type="bibr">[39]</ref> which rely on explicit bounds checks on addresses in memory operations.</p><p>Segmented memory architectures are some of the oldest protection mechanisms <ref type="bibr">[42]</ref>, and have been used by VMMs <ref type="bibr">[1]</ref>, by operating systems like Multics <ref type="bibr">[62]</ref>, and for both OS components and applications in systems like OS/2 and AS400. Earlier SFI implementations such as Vx32 <ref type="bibr">[21]</ref> and NaCl <ref type="bibr">[86]</ref> leveraged x86 segmentation on 32-bit platforms for fast isolation -an approach that has similarities to HFI. Unfortunately, x86-64 dropped support for segmentation, thus this technique has limited utility on current hardware. HFI oers some primitives that are similar to x86 segmentation (for example the segment/region relative addressing of explicit regions) but pairs this with primitives adapted to the at memory model that can isolate unmodied applications (implicit regions). Additionally, HFI's abstractions and implementation are much more minimal and tailored to in-process isolation, avoiding complex segmentation features such as call-gates or automatically switching privilege-levels when changing regions/segments.</p><p>Intel MPX (Memory Protection Extensions), a hardware feature to support ne-grain memory safety, has also been re-purposed by systems like Memsentry <ref type="bibr">[39]</ref> for sandboxing. While this approach would work well in theory, practical implementations incurred substantial overheads comparable to software-based sandboxing, in part due to the data dependencies created by multiple range checks prior to memory accesses <ref type="bibr">[39]</ref>. HFI, in contrast, carefully avoids multiple range checks by design, and instead relies on a single range check for explicit regions, and masking for implicit regions. HFI also accounts for other requirements of isolation systems such as system call interposition and Spectre resistance.</p><p>Newer eorts like the J-extension <ref type="bibr">[47]</ref> on RISC-V have proposed hardware for address masking (similar to the software-based approach of Wahbe et al. <ref type="bibr">[78]</ref>); however, this inherits all the drawbacks of software-based masking such as the converting of out-of-bounds memory accesses into random memory corruption ( &#167;2).</p><p>Hardware-assisted isolation with capabilities. Capability-based addressing has a long history <ref type="bibr">[12,</ref><ref type="bibr">17,</ref><ref type="bibr">18,</ref><ref type="bibr">51,</ref><ref type="bibr">74]</ref>, and is seen most recently in CHERI <ref type="bibr">[84]</ref>. CHERI provides a powerful security model that can represent an unlimited number of byte-granular protected regions, which enables not just compartmentalization but also object-level memory safety. However, it also requires extensive modication to nearly every layer of the software and hardware stack: including the OS <ref type="bibr">[73]</ref>, ABI <ref type="bibr">[16]</ref>, and compiler <ref type="bibr">[72]</ref> and extensive hardware support. CHERI's use of 128-bit fat pointers to track capability metadata also comes at the cost of increased memory and cache footprint that aects performance <ref type="bibr">[83]</ref>. Alternative capability systems leveraging cryptography <ref type="bibr">[41]</ref> have also been proposed. In contrast, HFI takes a far more minimalist approach, focusing exclusively on supporting SFI and in-process sandboxing, and thus only requires small modications to existing processors, and little to no change to existing software.</p><p>Hardware changes for Spectre-resistant CPUs. Several works like speculative taint tracking <ref type="bibr">[87]</ref> and others <ref type="bibr">[45,</ref><ref type="bibr">82,</ref><ref type="bibr">85]</ref> have proposed redesigning the CPU's approach to speculative execution to prevent Spectre attacks. These approaches are complementary to HFI in that they provide Spectre protection for general programs, i.e., they provide protection for programs that do not run sandboxed code, but do so at the cost of adding complexity to the CPU design. In contrast, HFI is designed to provide isolation and Spectre safety for sandboxing and is explicitly designed to minimize changes to the CPU to allow easy adoption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSION</head><p>Modern page-based protection architectures are remarkably powerful. They have served us for decades, during massive changes in processor speeds, compute form factors, and workloads. However, one thing they are not good at is the kind of ne-grain isolation where software excels. This lack of support for ne-grain isolation is a problem and limits the way we think about building systems.</p><p>Wasm has opened the door to a new world of in-process sandboxing, safe extensibility, and high-scale concurrency. However, being software-based also brings a host of constraints that limit its potential.</p><p>HFI oers a path beyond these constraints by supporting inprocess isolation for Wasm and native binaries that preserves the benets of software-based isolation: low context-switch overheads, fast instantiation, etc. while oering a new level of safety, scalability, and performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENT</head><p>Thanks to Dan Gohman and Luke Wagner from Fastly and the architects from Intel for their insightful discussions, the anonymous reviewers and shepherd for their valuable comments for improving the quality of this paper. This work was supported in part by a Sloan Research Fellowship; by the NSF under Grant Numbers CNS-2155235, CNS-2120642, and CAREER CNS-2048262; by gifts from Intel, Google, and Mozilla; and by DARPA HARDEN under contract N66001-22-9-4017. And nally, thanks to our families, without whose support this work would not be possible.  A.2 HFI Hardware Simulation Gem5 simulation. We create a gem5 instance like table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A APPENDIX</head><p>(1) hfi_set_region: We move data from memory into new hardware registers we add to gem5 to store HFI metadata. This metadata includes information such as region masks, the exit handler etc. (2) hfi_enter and hfi_exit: These instructions write to the enforcement bit register to enable/disable sandboxing. We serialize the pipeline upon these instructions. The hfi_exit additionally jumps to a exit handler if specied. (3) hmov: We add hmov as a variant of the x86 mov instruction which takes a prex. The data and code pipeline masks on mov instructions and program counter are implemented to avoid any delays. (4) syscall microcode: We modify the microcode of privilegeescalating instructions like syscall to invoke the HFI handler (if specied) while inside an HFI sandbox. Software emulation. Overhead emulation using currently available instructions.</p><p>(1) hfi_set_region: We emulate its cost by moving the h region metadata from memory to general-purpose registers. (2) hfi_enter and hfi_exit: We use cpuid-an instruction known to serialize the pipeline, to capture their serialization cost <ref type="bibr">[35]</ref>. hfi_exit also checks to see if a syscall handler is specied to capture the cost that would normally occur in the microcode implementation. (3) hmov: We emulated it with a regular mov instruction that uses a xed region with a base of 0x7ffff000, i.e., one page prior to 2 GiB. This is the largest page-aligned address the x86 mov instruction can refer to via its constant eld (without a register). We use this to emulate costs as (1) it reserves one of the inputs to the x86 mov consistent with hmov and (2) it captures the host program's speedups due to not having to use a general-purpose register to store the base.</p><p>This hmov emulation choice is deliberate as this is the largest address x86 can address, while still allowing compilers to provide two arguments to the mov instruction. It also captures the HFI benet of reduced register pressure as it allows the host program to not consume a register for the region base.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Attempts to mitigate Spectre today requires prohibitively expensive compiler techniques<ref type="bibr">[53]</ref>, or complex workarounds that move the Wasm VM into a process<ref type="bibr">[67]</ref> and, unsurprisingly, neither approach is used in mainstream Wasm engines.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>All artifacts of our evaluation are available at https://github.com/PLSysSec/hfi-root</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>To be clear, HFI only does isolation, not resource management -thus, additional sandbox creation overheads, like memory allocation, are up to the developer.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3"><p>Regions make resizing heaps orders of magnitude faster than current Wasm implementations, as regions can be resized with just a register update. In contrast, current Wasm systems use mprotect() to limit access to only what the sandbox has requested -thus a system call is always required to resize memory.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4"><p>For completeness, we note that gem5 does not model BTB speculation with sucient detail to perform a real Spectre-BTB attack. Thus, in our tests, we instead model a Spectre-BTB attack using concrete control ow that leaks secret data using the cache-side channel.</p></note>
		</body>
		</text>
</TEI>
