<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Nu: Achieving Microsecond-Scale Resource Fungibility with Logical Processes</title></titleStmt>
			<publicationStmt>
				<publisher>USENIX</publisher>
				<date>04/17/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10506268</idno>
					<idno type="doi"></idno>
					<title level='j'>20th USENIX Symposium on Networked Systems Design and Implementation (NSDI'23)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhenyuan Ruan</author><author>Seo Jin Park</author><author>Marcos K. Aguilera</author><author>Adam Belay</author><author>Malte Schwarzkopf</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Datacenters waste significant compute and memory resources today because they lack resource fungibility: the ability to reassign resources quickly and without disruption. We propose logical processes, a new abstraction that splits the classic UNIX process into units of state called proclets. Proclets can be migrated quickly within datacenter racks, to provide fungibility and adapt to the memory and compute resource needs of the moment. We prototype logical processes in Nu, and use it to build three different applications: a social network application, a MapReduce system, and a scalable key-value store. We evaluate Nu with 32 servers. Our evaluation shows that Nu achieves high efficiency and fungibility: it migrates proclets in ≈100μs; under intense resource pressure, migration causes small disruptions to tail latency—the 99.9th percentile remains below or around 1ms—for a duration of 0.54–2.1s, or a modest disruption to throughput (<6%) for a duration of 24–37ms, depending on the application.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Compute and memory are valuable and expensive resources in datacenters today, but they are inefficiently utilized <ref type="bibr">[46,</ref><ref type="bibr">76]</ref>.</p><p>A key reason for this inefficiency is a lack of fungibility-the ability to reassign resources quickly and without disruption between different users and across different machines. Without fungibility, resources are stranded and over-provisioned for fear of running short, even as resource consumption naturally fluctuates in datacenter applications <ref type="bibr">[2,</ref><ref type="bibr">7,</ref><ref type="bibr">18,</ref><ref type="bibr">34,</ref><ref type="bibr">39]</ref>.</p><p>Existing systems fail to provide fungibility because current abstractions for compute work and memory state (VMs, containers, processes) are too coarse-grained ( &#167;2). To address this problem, we introduce the abstraction of a logical process. Logical processes provide fungibility, while retaining a familiar programming model similar to traditional processes. A logical process consists of many smaller proclets, atomic units of state and compute that can be independently migrated under resource pressure to achieve fungibility. Like a traditional process, a logical process has its own address space, isolated from other processes. But unlike a traditional process, a logical process can spread across many machines in datacenter racks as a result of the migration of its proclets. Intuitively, logical processes break down the monolithic nature of traditional processes into many proclets. A proclet consists of a heap (state) and a set of user-level threads and their execution contexts (stacks and register values). A runtime system that manages the logical process responds to spikes in load by migrating proclets quickly to a machine with spare resources.</p><p>To realize logical processes and proclets, we had to address three challenges. First, proclet migration must be fast and react to resource pressure before resources are exhausted. Second, communication between proclets and migration of proclets must impose little overhead or disruption on the application, especially if migration itself consumes resources when they are short. Third, the programming model of logical processes and proclets must support practical datacenter applications.</p><p>We respond to these challenges as follows. First, we divide process state into proclets, which are small relative to an entire process, so they can be migrated orders of magnitude faster than VMs or processes. Second, we optimize our software stack to take full advantage of modern datacenter networks (at 100-400 Gbit/s). This pushes performance far enough for proclets to migrate in &#8776;100&#181;s. We also scale proclets across machines with minimal communication overheads by using a single program image across machines and an optimized RPC stack. Third, we use a global address space to provide a programming model that is process-like and intuitive. This makes it possible to statically check types, and enables computation shipping by passing function pointers between proclets.</p><p>We prototyped logical processes and proclets in Nu, a system that provides a C++ class API and a Caladan-based userlevel threading and kernel-bypass networking runtime <ref type="bibr">[28]</ref>. Nu targets environments with tens of racks: hundreds of machines connected with an overprovisioned network that provides high full-bisection bandwidth (100-400 Gbit/s) and low latency <ref type="bibr">(10-20 &#181;s)</ref>. We implemented three applications using Nu. The first is a version of the DeathStarBench social network application <ref type="bibr">[29]</ref>, originally implemented using microservices. The Nu version of this application is simpler, shorter, and has an order of magnitude better performance than the microservice version, while preserving scalability. The second application is k-means clustering on Phoenix MapReduce <ref type="bibr">[63]</ref>, which represents a compute-intensive workload with high parallelism. Phoenix MR originally supported thread parallelism in a single NUMA machine, but the Nu version scales across multiple machines while also delivering comparable single-machine performance. The third application is a scalable key-value store implemented in Nu as a hash table whose buckets are distributed across multiple proclets.</p><p>We evaluate Nu in a setup of 32 servers with 100 GbE NICs that are connected through a top-of-rack switch. Our evaluation shows that Nu achieves high efficiency and fungibility: it reacts quickly to resource pressure and migrates (b) Proclets permit tighter packing, which fits more tasks (orange, purple) that can be migrated away quickly under resource pressure.</p><p>proclets without disruption to the application workload. Nu migrates proclets in &#8776;100&#181;s and its migration exceeds the rate at which the Linux kernel can allocate memory (&#8776; 7GB/s), so Nu handles even intense resource pressure. Under this memory pressure, the social network app adds 122&#181;s to the 99.9 thpercentile client latency for a period of 0.86s; the key-value store app adds 52&#181;s for 2.1s; and k-means loses 2.9% throughput for 37ms. Under intense compute pressure, disruption is higher, but still short-lived: the key-value store adds 1,053&#181;s for 0.54s; and k-means loses 5.8% throughput for 24ms. Finally, Nu's logical processes are efficient in the absence of resource pressure, and match or exceed the performance of strong baselines on one or more servers.</p><p>Nu has some limitations. First, logical processes require developers to structure applications as proclets. This may not be feasible for every application ( &#167;3.3), but we have shown that it is feasible for three very different applications. Second, Nu currently considers only two resources, memory capacity and compute load. We expect other resources can be added (network, caches, memory bandwidth, etc.), but that remains future work. Third, Nu targets deployments where network bandwidth is plentiful and latencies low.</p><p>Nu is available as open-source software <ref type="bibr">[66]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Motivation: Resource Fungibility</head><p>Cloud computing originally promised to deliver utility computing, with fine-grained, pay-per-use sharing of compute resources, rather than fixed-size machines that customers must purchase and own <ref type="bibr">[4,</ref><ref type="bibr">31]</ref>. But, almost two decades later, the operational reality is different: although end-users can readily rent resources, cloud providers still provision and offer these resources in fixed-size units and over long time horizons. We argue that a key problem in this setting is the lack of fungibility-the ability to reassign resources quickly and without disruption between different users and across different machines. Users today submit requests for fixed allocations (number of cores, memory, etc.) as determined by so-called "instances" (or "slots", "tasks"). These allocations tend to over-estimate actual resource use, which fluctuates at sub-second time scales. Providers bin-pack instances onto the available servers <ref type="bibr">[33,</ref><ref type="bibr">35,</ref><ref type="bibr">71,</ref><ref type="bibr">76,</ref><ref type="bibr">77]</ref>. This is inefficient because users must size instances for peak rather than typical usage, leaving substantial resources idle most of the time. Providers can reclaim some of these wasted resources by overbooking and scheduling best-effort instances in them <ref type="bibr">[2,</ref><ref type="bibr">44,</ref><ref type="bibr">76,</ref><ref type="bibr">84]</ref>. But this practice is disruptive, as machines can get intermittently overloaded, leading to performance degradation (e.g., high tail latencies), which is particularly problematic for latencysensitive workloads <ref type="bibr">[28,</ref><ref type="bibr">44]</ref>. In response, the cluster manager must kill some best-effort instances to free up resources. But doing so is also disruptive because the work done by a killed instance can be wasted and may need to be redone. Moving the instance usually is not an option as it requires an expensive VM or process migration that can take seconds or minutes because the state to be moved is large, and it requires the cluster manager to find a destination machine that has sufficient resources to take over the entire (indivisible) instance.</p><p>In other words, today's cloud is not fungible (Figure <ref type="figure">1</ref>(a)). Resources can only be reassigned on fairly long timescales, larger than the timescales over which resource consumption fluctuates. The underlying reason for this problem is that current abstractions for compute work and memory state-VMs, containers, and processes-are too coarse-grained.</p><p>A more efficient design would avoid disruption and reassign resources quickly and at fine granularity. This would make it easy for providers to increase utilization by densely packing instances across machines while rebalancing and migrating work as necessary, instead of killing instances under resource pressure. In addition, this would eliminate the burden on users to predict and specify peak per-machine resource usage for each instance, allowing them to instead pay for resources as they are used.</p><p>Our approach to fungibility. To provide fungibility, we revisit the process, a core OS abstraction that dates back to the 1960s. Traditionally, a process is an instance of a computer program that runs on one machine, consisting of memory and a set of threads. Our work extends this idea across machines to provide a similar abstraction called a logical process.</p><p>Logical processes are inspired by logical disks <ref type="bibr">[59,</ref><ref type="bibr">74]</ref>. Much like a logical disk, a logical process combines together disparate physical resources-in this case, machines rather than disks. A logical process automatically scales to use additional machines when more capacity is needed, and can recover from machine failures.</p><p>A logical process achieves fungibility through two key ideas (Figure <ref type="figure">1(b)</ref>). First, a logical process divides program state into fine-grained partitions called proclets. Second, proclets are migrated quickly between machines in response to memory or compute resource pressure. Each proclet runs on one machine at a time, and proclets communicate with each other through efficient message passing.</p><p>Because proclets are fine-grained, migrations complete quickly, causing minimal performance disruption. Inspired by prior work that shows that decomposition into small units simplifies placement <ref type="bibr">[51,</ref><ref type="bibr">57]</ref>, proclets' fine granularity makes packing them onto machines simple and avoids complex and time-consuming bin-packing on allocation or migration.</p><p>Alternative approaches. There are a few other approaches to improving fungibility, but they have drawbacks. One can migrate VMs <ref type="bibr">[20,</ref><ref type="bibr">78]</ref>, containers <ref type="bibr">[22]</ref>, or processes <ref type="bibr">[49]</ref>, but migration is slow due to their state size. An alternative that maintains the process abstraction is to use distributed shared memory (DSM) to spread a normal process across machines <ref type="bibr">[83]</ref>.</p><p>But DSM systems experience high coherence overheads with shared memory, leading to poor performance. PGAS <ref type="bibr">[5,</ref><ref type="bibr">17,</ref><ref type="bibr">54]</ref> is a type of DSM that can avoid such overheads, but its applicability is limited to parallel applications. Another approach to fungibility is to adopt new programming models to distribute the application into smaller units, as with distributed objects <ref type="bibr">[6,</ref><ref type="bibr">13,</ref><ref type="bibr">21,</ref><ref type="bibr">30,</ref><ref type="bibr">70,</ref><ref type="bibr">82]</ref>, microservices, and serverless functions. These models depart significantly from the familiar process abstraction, and they are built on top of traditional, coarse-grained instances that limit their fungibility. They also have high RPC messaging overheads (and cold start delays for serverless functions <ref type="bibr">[72]</ref>) that grow in cost as their units become smaller. Alternatively, parallel programming frameworks <ref type="bibr">[8,</ref><ref type="bibr">15,</ref><ref type="bibr">23,</ref><ref type="bibr">26,</ref><ref type="bibr">50,</ref><ref type="bibr">81]</ref> partition work via rigid compute patterns (e.g., partition-aggregate, actors). This constrains the programming model and requires data to be statically placed on machines.</p><p>Finally, some techniques provide fungibility, but in limited form. Far memory systems <ref type="bibr">[32,</ref><ref type="bibr">67,</ref><ref type="bibr">79]</ref> can incrementally extend the memory of a process, but they perform well only when the remote memory is cold. Request load balancing can make compute fungible, but it is mostly suited for stateless or read-only services. These two techniques are complementary to logical processes and can be combined with them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The Logical Process Abstraction</head><p>A logical process exists across one or several machines and contains a collection of proclets. Proclets are fine-grained partitions of program state that form units of migration. Proclets can be individually migrated between machines to relieve resource pressure (Figure <ref type="figure">2</ref>). The address space layout of a logical process running on two machines. Read-only code and data is mapped everywhere, while proclets are mapped in exactly one machine at a time.</p><p>A proclet consists of a heap and a set of threads that can access the heap concurrently via shared memory. A proclet never shares its heap memory directly with other proclets. Instead, each proclet has an associated root object, which defines a remote method interface that other proclets use to access its state. This approach allows developers to build full programs from proclets in a natural, object-oriented way. The root object may store references (pointers) to ordinary local objects stored on the proclet's heap.</p><p>The number of machines allocated to a logical process can change over time in response to shifts in the resources available on each machine. Each machine handling logical processes runs a separate runtime instance. The runtime provides location-transparent communication between proclets, detects resource pressure, migrates proclets between machines, and cleanly handles failures.</p><p>Developing software for logical processes is similar to normal UNIX processes. Code can spawn threads, use synchronization primitives to coordinate access to shared memory, and allocate memory using standard APIs like malloc or new. But there are two major differences. First, developers must partition their program state into proclets. Second, in most cases, developers must use runtime APIs instead of making system calls or performing I/O directly. We describe the logical process abstraction in more detail in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Address Spaces and Cache Coherence</head><p>A logical process uses an identical address space layout on each machine. This simplifies migration, as pointers remain valid across machines without swizzling. Runtime instances coordinate to keep their layout synchronized during initialization and whenever new proclets are created. Figure <ref type="figure">3</ref> shows an example address space layout for a logical process running on two machines. Code and shared data segments are mapped read-only on all machines. Consequently, the machines must be binary-compatible, but not necessarily identical architectures (e.g., AMD and Intel x86 CPUs). Read-only data can store large static arrays, tables, and other inputs that all proclets might need. Proclets' heaps, on the other hand, are readable and writable, only mapped on one machine at a time, and only ever accessible by the owning proclet. (This contrasts with distributed shared memory <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">25,</ref><ref type="bibr">40,</ref><ref type="bibr">52,</ref><ref type="bibr">68,</ref><ref type="bibr">69]</ref>, which typically provides cache coherence across machines.) In other words, no proclet can share  memory with another proclet. Instead, proclets communicate via remote method invocation, which passes arguments by copying if the proclets are co-located on a machine or by network transfer if they are on different machines.</p><p>This design avoids writable shared memory across machines and aligns well with current datacenter networks, which provide high throughput and low latency, but lack hardware support for cache-coherent memory across machines. Additionally, this design enables fault isolation, as it allows one proclet to fail independently from others on different machines. A failure can cause a proclet's memory to disappear at any time, and these errors can be cleanly reported via return codes of remote methods. This allows us to use standard distributed systems techniques (e.g., replication) to make critical proclets fault-tolerant.</p><p>Proclet migrations occur atomically and each proclet runs on exactly one machine at a time. Consequently, cache coherence is available within proclets, but not across proclets. This design allows for a normal programming environment inside proclets, including synchronization across threads (via spinlocks, mutexes, etc.) when they access shared memory within a single proclet's heap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Programming Model</head><p>Developers write an application as a set of proclet root classes. As in traditional object-oriented programming, each class defines methods and fields. Methods implement the proclet's application logic and expose the API for the proclet to be invoked by other proclets. Fields specify state internal to the proclet, although additional state can be allocated dynamically in the heap at runtime. Figure <ref type="figure">4</ref> shows a running example in C++. <ref type="foot">1</ref> Lines 1-6 define Accumulator as the root class for a simple proclet that keeps a value val_ and exposes two methods Add and Get to increment and retrieve the value. Here, the methods are one-liners, but in real applications they constitute most of the code.</p><p>When a logical process starts up, the runtime launches a main proclet. This proclet typically creates other proclets by calling function make_proclet with their root classes and constructor parameters. In the example, the main proclet invokes function mainfunc (for brevity we do not show the main proclet, only mainfunc), which in lines 10-11 creates two proclets with root class Accumulator.</p><p>Proclets communicate only via remote method invocations and closures. With remote method invocations, a proclet calls the methods of the root object of another proclet, either synchronously using function Run(), or asynchronously using function RunAsync(), which returns a future. Lines 13 and 17-18 show a synchronous and two asynchronous invocations of Get on proclets. The two asynchronous invocations run concurrently to hide latency.</p><p>With closures, a proclet can implement function shipping <ref type="bibr">[36,</ref><ref type="bibr">38,</ref><ref type="bibr">65,</ref><ref type="bibr">67,</ref><ref type="bibr">79,</ref><ref type="bibr">80]</ref>, and ship a function that invokes methods-interspersed with its own processing logic-on the root object of another proclet. Line 15 shows a closure that invokes Add and Get on the same proclet. This execution incurs a single roundtrip to the server hosting the proclet, even though it invokes two methods. Shipping code to data in this manner can greatly improve efficiency.</p><p>The remote runtime may execute methods and closures on the same proclet concurrently on different threads. Hence, the example uses a mutex mu_ to protect val_ against concurrent execution of Add and Get.</p><p>Naming and reference counting. Proclets need to know about each other before they can communicate. We adopt a proclet naming scheme based on smart proclet pointers. These pointers provide safety, convenience, and reference counting through a interface similar to C++'s shared_ptr.</p><p>Unlike standard RPC frameworks, remote methods or closures can take proclet pointers as arguments. Thus, code can pass handles to proclets to other proclets by passing them as parameters, similar to delegating capabilities. This feature permits a remote method or closure to chain together the execution of multiple proclets while performing computation in between. For example, line 22 shows a closure on proclet p1 that takes proclet p2 as a parameter; the closure first calls p2's Get method, followed by p1's Add method.</p><p>Proclet pointers are valid within the entire logical process, even across machines, and the runtime frees a proclet when it loses its last reference. In the example, proclets p1 and p2 are freed automatically when mainfunc() goes out of scope.</p><p>We considered using global strings as proclet names, but never needed them in building applications. A logical process is tightly coupled, and we found that passing smart pointers is more convenient than hard-coding strings. Typically, initialization code creates several proclets, and passes around their smart pointers, so the code hands over access directly.</p><p>Type checking. Because a logical process uses an identical program image across machines, static type checking of argument types is sufficient for remote invocations. This contrasts with standard RPC frameworks (e.g., Thrift or gRPC), which additionally have to perform dynamic type checking, incurring runtime overheads and requiring extra error handling. Line 24 thus fails to compile because of too many arguments.</p><p>Raw pointers into a proclet's heap are never allowed as arguments; we made this choice to discourage incorrect code that attempts to share memory between proclets. On the other hand, smart pointers are supported, and passing them as arguments causes the objects they own to be copied.</p><p>Unlike standard RPC frameworks, proclet invocations allows remote methods and closures to take function pointers and closures as arguments. This is possible as all machines in the logical process map the code segment at the same address.</p><p>Network I/O outside a logical process. Logical processes perform their I/O through abstractions provided by the runtime, rather than POSIX syscalls and I/O abstractions. This allows proclets to be machine-independent and migrate between machines without having to move hard-to-migrate local kernel state (e.g., the TCP state machine). In particular, the runtime maintains TCP network connections to clients, which can be either other logical processes or normal processes. These connections allow clients to communicate with specific proclets inside a logical process-or to spread load across groups of stateless proclets-and will forward client requests if the destination proclet has migrated. Similar to existing libraries for distributed request routing <ref type="bibr">[1,</ref><ref type="bibr">53]</ref>, the runtime informs client libraries about the proclet's new location, so that the client knows to expect the response on another network connection and to send future requests there. In our datacenter setting, client and server code are under the control of the same entity, and custom I/O libraries (e.g., for request routing and load balancing) are commonplace <ref type="bibr">[1,</ref><ref type="bibr">11,</ref><ref type="bibr">73]</ref>.</p><p>In rare cases, developers can pin proclets that need to use local resources directly to a machine. Such proclets lose their ability to migrate and reduce resource fungibility, so developers should pin proclets only if absolutely necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Porting Applications to Logical Processes</head><p>In principle, any application that can partition its state into fine-grained units can be ported to a logical process (each unit becomes a proclet). This aligns well with existing cloud applications that already partition their state (e.g., microservices, FaaS, distributed frameworks, etc.), though sometimes at a coarser granularity than proclets. There are two main considerations when dividing a logical process into proclets: the proclet granularity and its scope.</p><p>Proclet granularity. Choosing the right size for proclets is important. If proclets are too large, resource fungibility suffers. If they are too small, communication overheads increase as remote invocations become more frequent. Developers must choose a sweet spot that provides sufficient fungibility without significant overheads. &#167;6.4.2 shows empirical measurements of proclet performance at different state sizes and invocation compute intensities; in practice, proclets of a few MiB state size work well.</p><p>Proclet scope. The next consideration is how to decide what functionality goes into a proclet. One approach is functional splitting, which equates a proclet to a logical functional unit in the application (a module, a microservice, a package, etc). Well-known software engineering practices suggest how to choose appropriate units <ref type="bibr">[47,</ref><ref type="bibr">56]</ref>: the unit should include functionality that is intuitively related, that can be described simply, and that can be encapsulated through a compact and easy-to-understand API. The latter property ensures that the interface between proclets is also compact. Another approach is to use sharding. Since a functional unit may be much larger than the ideal proclet size, it may help to shard (partition) the unit. For example, consider a large chaining hash table. Each hash bucket of this data structure becomes a separate proclet and stores the proclet pointer in the hash array. To operate on a key in the hash table, the code makes the appropriate method invocation to the corresponding proclet. This results in a distributed key-value store, as proclets are spread across machines, but maintains the hashtable API.</p><p>Limitations. Some applications are hard to decompose into proclets, such as applications that manipulate large amounts of state that is not easily divisible (e.g., video encoders, architecture simulation, sorting, or graph processing). For these examples, decomposition may still be possible, but it requires new algorithmic approaches <ref type="bibr">[27,</ref><ref type="bibr">41,</ref><ref type="bibr">45,</ref><ref type="bibr">48,</ref><ref type="bibr">64]</ref>.</p><p>Other applications may require functionality that is tied to physical hardware resources, such as a GPU or an FPGA. In these cases, proclets that interact with the hardware may need to be pinned, thus reducing the logical process's fungibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Security and Threat Model</head><p>A logical process has the same isolation properties as a UNIX process-viz., its memory is isolated from other processes, but its threads share an address space-but applies this model across multiple machines. Even though proclets lack shared memory, there is no hardware memory isolation (e.g., via the MMU) between the proclets within a logical process to enforce this. We made this choice for performance reasons and because it matches the isolation model of UNIX processes. On the other hand, memory isolation is guaranteed across different logical processes: each local logical process instance runs in a different UNIX process and is isolated from other logical process instances on the machine.</p><p>Address space layout randomization (ASLR) and stack ca-naries are important defenses against buffer overflow attacks.</p><p>Although ASLR might at first glance seems incompatible with logical processes' global address space, it works as long as the loader maintains the same randomized address space layout on each machine. Stack canaries also work, as proclets cannot share stack memory and the implementation can maintain a different secret canary value for each proclet. Finally, logical processes trust the network to provide confidentiality and integrity. This is necessary to make remote method invocation and migration efficient by sending raw data and pointers. Modern datacenter NICs have hardware encryption engines that ensure these properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Fault Tolerance</head><p>A proclet may be replicated to tolerate failures. Replication creates backup copies of the proclet's heap, which the runtime places at the same virtual address in different machines. To keep the backup heaps in sync, the primary replica serializes the invocation requests and forwards them to the backup replicas. (This requires proclet operations to be deterministic.) Operations on a replica are totally ordered without overlap within each proclet-a choice that trades off some performance for strong consistency. To reduce replication latency, the primary overlaps execution with the backups, but the primary only returns from an invocation once the backups finish.</p><p>When the system detects the failure of the primary (e.g., due to an RPC time out), it atomically promotes a backup to the primary. To keep the same replication factor, it also adds a new backup by pausing the proclet and copies its heap from the new primary to the new backup replica.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">The Nu Runtime System</head><p>We built Nu, a prototype runtime that provides the logical process abstraction and runs inside a normal Linux environment. Nu shares some architectural and implementation building blocks with Caladan <ref type="bibr">[28]</ref>. Caladan was a good fit for Nu because it provides a user-level threading package with overheads low enough to hide microsecond-scale latency. For example, if a thread blocks waiting for a remote proclet invocation to return, the runtime can quickly context switch to another runnable thread with little overhead. Caladan uses work-stealing to balance these threads across cores, which reduces tail latency <ref type="bibr">[62]</ref>. Caladan also provides an optimized kernel-bypass, user-level TCP/IP networking stack to further reduce proclet communication and migration costs.</p><p>Nu adds &#8776;10,000 lines of C++ code to Caladan. This includes efficient communication infrastructure, a new memory management layer to handle multiple heaps, a well-optimized proclet migration system, and a controller to track the location of proclets. In the following, we describe these components.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Serialization and Communication</head><p>Nu serializes arguments to remote invocations using cereal, an efficient, header-only library for serialization <ref type="bibr">[16]</ref>. Cereal has a compact binary serialization format that supports most STL types, but prohibits raw pointers and references (shared_ptr and unique_ptr are still supported). We modified cereal so that it can serialize function and proclet pointers. To optimize use of cereal, Nu maintains a buffer pool for serialized outputs and eliminates extra data copies.</p><p>Nu uses C++ templates to internally produce code at compile time for serialization and deserialization of remote method arguments. This contrasts with RPC frameworks like Thrift, which require code generation and an interface description language. As a result, developers call remote methods without boilerplate, and they benefit from static type checking.</p><p>We took several steps to optimize remote method invocations. First, Nu opens one TCP connection on each core for each outgoing machine. These connections use specific 5tuples, so they have flow-level affinity matched with the core they are associated with, enabling cache-aware steering <ref type="bibr">[42,</ref><ref type="bibr">60]</ref>. This design increases the number of open connections, but Caladan easily scales to 10, 000 connections, much more than needed for our target environment. Second, Nu applies adaptive batching to combine remote method invocation payloads (requests and responses) into larger TCP transfers without impacting latency <ref type="bibr">[9]</ref>. We modified Caladan to use jumbo frames to increase the benefit of this batching. Third, each connection operates as a closed queuing system, limiting the maximum number of requests in flight. This provides flow control and prevents unbounded memory consumption under overload. Finally, when the caller and callee proclets are in the same machine, Nu substitutes the RPC with a fastpath: a local call without any RPC overheads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Memory Management</head><p>Nu uses a custom slab allocator to manage each proclet's heap. It includes a per-core object cache to increase scalability, similar to most modern multicore memory allocators <ref type="bibr">[12,</ref><ref type="bibr">14]</ref>. C++ allows a custom definition of operator new() that Nu uses to override memory allocations. Nu keeps track of which proclet each thread is associated with and directs allocations to the correct heap. In the future, we plan to explore specialized proclet allocators too. For example, an arena allocator could benefit short-lived proclets because it need not free objects until the proclet terminates, reducing overheads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Migration</head><p>Nu migrates proclets across binary-compatible machines under resource pressure. Nu separates migration mechanism from policy.</p><p>Mechanism. To migrate a proclet, the runtime first sets a migration flag, causing method invocations to the migrating proclet to be rejected and retried. Next, it preemptively pauses and saves register state for all the proclet's running threads to ensure that the data is not mutated during migration. Then, it moves proclet data, including heap, stack, and register state, to the new destination. Finally, the runtime clears the migration flag and contacts the controller to update the location of the proclet, ensuring pending and future method invocations are routed to the new destination ( &#167;4.4). We co-designed Nu's RPC layer with migration, and it routes the results of method invocations on migrated proclets back to the caller.</p><p>We optimized Nu's migration datapath. To improve TCP throughput, we use parallel connections and jumbo frames. We found that Linux's mmap (used for creating the proclet space at the destination machine) was a bottleneck, so we modified the Linux kernel to pre-zero freed pages. After this optimization, Nu can migrate at line rate on 100GbE. When we tried 200GbE, mmap again became a bottleneck-in this case due to the Linux kernel's physical frame allocation speed. As a workaround, Nu instead uses mmap to pre-fault a small pool of memory at the destination server. Then, on migration, Nu performs mremap on that memory to reuse prior frame allocations. Future Linux kernel optimizations might avoid the need for this remapping.</p><p>The CPU overhead of migration is moderate in our current prototype: it takes three hyperthreads to saturate 100GbE and five hyperthreads to saturate 200GbE.</p><p>Policy. Nu provides an extensible migration policy interface that dictates which proclets to move and where to move them under resource pressure. Many sophisticated policies are possible, including policies that react to several types of resource pressure (e.g., CPU load, cache pressure, memory capacity, memory bandwidth, network bandwidth, etc.), and policies that co-locate frequently communicating proclets to improve locality. Currently, our prototype ignores locality and focuses only on CPU load and memory capacity, two resources often subject to pressure, but we plan to extend it in the future.</p><p>Because Nu's migration is fast, we found that even the simplest policies work well ( &#167;6). In particular, Nu needs no sophisticated algorithms to predict future resource use, but rather simply migrates proclets at the last moment, when resources are nearly exhausted. To determine when migrations are needed, a monitoring thread in the runtime polls resource use. For memory, it monitors the amount of free memory and begins migrating once it falls below a threshold (e.g., 1 GiB). For CPU, it monitors system core utilization and begins migrating when a threshold of cores are busy. A better alternative might track the queueing delay of runnable threads, allowing Nu to distinguish actual overload from cases where all cores are busy but not overloaded <ref type="bibr">[19]</ref>. We plan to investigate this in the future.</p><p>Nu migrates one proclet at a time until resource pressure is eliminated. To determine which proclet to migrate, Nu uses this formula (where P is the set of proclets on the machine): arg max p&#8712;P RESOURCE_USE(p) MIGRATION_TIME(p) RESOURCE_USE() measures a proclet's use of the resource under pressure, and MIGRATION_TIME() models the migration time of a proclet by considering the size of its heap, as well as the number of threads it must pause and transfer, and the size of thread stacks. This maximizes the pressure alleviation rate and helps Nu optimize for response speed. Nu's runtime collects metrics in real time to estimate this rate.</p><p>To determine the migration destination, Nu queries a global cluster controller, which monitors resource use across servers and returns possible destinations (described next).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Controller</head><p>Nu has a controller that makes cluster-wide decisions, such as proclet placement and virtual address allocation, and tracks information, such as proclet location and resource use. Nu assumes that the controller is highly available. Although our prototype controller is centralized, high availability can be achieved through primary-backup replication or simple recovery: the controller keeps only soft state, so it can always restore its state by querying the servers. Placing proclets. The controller periodically probes servers' available resources. It uses this information to decide where to place a proclet on creation or migration. Currently, it uses a simple policy that spreads proclets evenly across machines. Allocating virtual address segments. Proclets must use nonoverlapping virtual addresses. Therefore, Nu divides the virtual address space into an array of segments. These segments are large enough (4 GiB by default) to leave room for a proclet's heap to grow. The controller keeps lists of allocated and unallocated segments. On proclet allocation, the local runtime contacts the controller to obtain an unallocated segment. Resolving proclet location. The controller keeps a location map from the starting logical address of each proclet to the IP of the machine hosting the proclet. Each local runtime maintains a cache of the location map that contains the proclets it has recently accessed. This eliminates the need for method invocations to communicate with the controller in the common case, moving the controller off the critical path for the steady-state application traffic. When a proclet migrates, the controller updates the map. This causes caches to become stale, so a local runtime may send a method invocation to the wrong machine. When this happens, the remote machine returns an error. The local machine then handles the error by invalidating its cache entry and contacting the controller to find the new machine location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Replication</head><p>Nu optionally provides traditional primary-backup replication for proclets. This works by forwarding proclet operations from a primary to backup replicas, akin to traditional state machine replication (SMR). One challenge specific to Nu is that a proclet operation can invoke sub-operations on other proclets. The backup replicas will invoke the same suboperations as the primary, but side-effect causing invocations must occur only once, and replicas must see the same results as the primary's operations. Nu supports a variant <ref type="bibr">[58]</ref> of RIFL <ref type="bibr">[37]</ref>  ID of the form proclet id + epoch + sequence number to each proclet-to-proclet invocation. Primaries forward their sub-operation invocation results to the replicas, and replicas reuse the results (identified by the unique ID). Returning the saved results instead of re-executing sub-operations ensures all replicas have the same heap state. As with an unreplicated system, if a primary crashes in the middle of an operation, its sub-operations are re-executed if the unfinished operation is retried. When Nu's controller detects a failed primary, it promotes a backup to be the new primary and updates its location map with the new primary. However, runtimes may have the old primary in their caches, which could cause a "split brain" situation if the old primary continues to serve requests. A standard epoch-based approach <ref type="bibr">[43,</ref><ref type="bibr">55]</ref> can help Nu avoid this problem: each reconfiguration increments an epoch counter and backup proclets reject operations with outdated epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Limitations</head><p>Our Nu prototype has some limitations. It requires the use of C++, and though the runtime provides many OS services (timers, external and internal network I/O, synchronization, threads, memory allocation, etc.), it does not yet support all services. Despite these limitations, we ported three very different applications to run on Nu.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Application Case Studies</head><p>We implemented three applications on Nu, which cover a range of proclet sizes, communication patterns, and compute intensities (Figure <ref type="figure">5</ref>). All applications use a Nu-enabled hashtable library. The hashtable partitions the key space with a hash function and uses proclets as data shards. A root proclet has a vector of proclet pointers to these shards and shares them with client proclets to allow direct communication. SocialNetwork (from the DeathStarBench suite <ref type="bibr">[29]</ref>) is a multi-tier, interactive web service, originally built as 12 microservices. Its overall complexity is high, with a fan-out communication pattern and many microservices that have low compute intensity, making it sensitive to both tail latency and RPC overheads. We ported SocialNetwork to a logical process, turning each microservice into a proclet. However, we found that its compute intensity was sometimes too low and that it lacked autoscaling support; both limit its overall scalability. Therefore, we also built a version of SocialNetwork that is better structured for a logical process: this version merges SocialNetwork's small, stateless microservices into a single root class, and scales by spawning it as proclets across machines. Both versions have &#8776;1,000 LOC, compared to 6,843 LOC in the original, which highlights the simplifications afforded by logical processes. Our implementation replaces the external stores used by microservices (Memcached and Redis) with a backend based on our hashtable, and leverages proclet closures to support Redis-like local operations. We also modified our external I/O subsystem to interact with unmodified Thrift-based clients. This is possible because any root proclet can handle any request, as root proclets are stateless. KV Store is a key-value store composed of the Nu-enabled hashtable library and an additional 200 LOC. It is a stateful application that is latency-sensitive and uses significant memory, making it hard to migrate. On each machine, the Nu runtime's external I/O subsystem receives requests from external clients and steers them to the right proclets. The keyvalue store has low compute intensity (1&#181;s/invocation), but large proclet state (2 MiB/proclet). K-means is a workload from Phoenix MapReduce <ref type="bibr">[63]</ref>. Phoenix MR is a NUMA-oriented, shared-memory MapReduce framework designed for single-machine operation. We run k-means-an algorithm that requires multiple iterationsin a Nu-based Phoenix MR port, using proclets to scale across machines. We modified Phoenix's task scheduler to replace worker threads with worker proclets, ship closures to the workers, and shuffle data between mappers and reducers via our hashtable (changing 548 out of Phoenix's 3,066 LOC). K-means is compute-intensive (0.7-4.6ms/invoc.), but has smaller proclet state (31 KiB-19 MiB/proclet). Overall, we found it easy to modify Phoenix MR to work in a distributed setting. Our version follows the same partition-aggregate communication pattern that makes distributed MapReduce frameworks sensitive to stragglers in k-means.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Evaluation</head><p>We evaluate Nu with these three applications, as well as microbenchmarks that measure the impact of specific design decisions. Our evaluation seeks to answer four questions:</p><p>1. Can migration in Nu prevent performance disruption during intense resource pressure? ( &#167;6.1) 2. How does porting applications to Nu impact their performance? ( &#167;6.2) 3. How well does Nu scale with the number of servers? ( &#167;6.3) 4. What is the effect of compute intensity, as well as that of our key design decisions, on Nu's performance? ( &#167;6.4)</p><p>Setup. Except &#167;6.4.2, all other experiments run on a cluster of 32 physical servers in CloudLab <ref type="bibr">[24]</ref>. The servers are c6525-100g instances (24-core AMD 7402P at 2.80GHz, 128 GB RAM, Mellanox ConnectX-5 NIC), connected by a 100 GbE network. We run &#167;6.4.2 on c6525-100g servers, c6525-25g servers (i.e., the variant with 25GbE NICs), and our local servers with a 200 GbE network. Servers run Ubuntu Linux 20.04 with kernel v5.10 patched to pre-zero free pages ( &#167;4.3). We disable ASLR, as Nu does not support it yet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Application Performance under Resource Pressure</head><p>Nu's proclet-centric design enables fine-grained, rapid migration. The key goal of this design is to achieve high application performance even under resource pressure. To evaluate this, we expose Nu and our applications ( &#167;5) to compute and memory resource pressure and measure the application performance as proclets migrate to other machines. We skip SocialNetwork for compute pressure, as this application can handle it with a standard front-end load balancer. We run experiments using 32 machines, one of which serves as the controller ( &#167;4.4). The remaining machines are either proclet servers or clients, and we partition them appropriately for the application (Figure <ref type="figure">6</ref>). To evaluate Nu's ability to manage disruptions under demanding load conditions, we generate enough client load to use &#8776;70% of CPU capacity across all proclet servers. Then, we induce resource pressure on one proclet server, causing it to migrate its proclets to the other servers.</p><p>In these experiments, memory pressure comes from an antagonist process that allocates memory as fast as Linux's virtual memory subsystem permits (&#8776; 7 GB/s measured in our machine with 4K pages). Once the memory usage of the machine goes above the threshold, Nu starts to migrate proclets to free memory. A good result would show Nu migrating proclets sufficiently quickly to keep up with the allocation rate of the antagonist, without disrupting application performance. To assess the benefit of rapid migration, we compare Nu against a baseline that emulates MigrOS <ref type="bibr">[61]</ref>, a recent RDMA-based live migration system. To emulate MigrOS, we throttle Nu's migration speed to 600 MB/s on average with a 200 ms initial delay. Since migration speed is slower than the antagonist's memory allocation, the machine starts swapping. We swap to a fast device: Linux brd, a block device backed by RAM. (The common alternative-killing processes-is even more disruptive, wastes work, and yields no meaningful baseline.)</p><p>Figure <ref type="figure">7a</ref> shows the 99.9 th percentile latency of client requests in the SocialNetwork application. At t=3.9s, the antagonist starts allocating memory, and once Nu's runtime detects that the free memory size goes below 1 GiB (a con-figurable threshold) at 4.9s, it starts migrating proclets to another machine. During the migration, client-perceived latency increases by less than 19%. At t=5.7s, all proclets have migrated and latency recovers. Figure <ref type="figure">7b</ref> shows the same experiment with the baseline (Nu emulating MigrOS's migration speed). Since it migrates memory slower than the antagonist requests, memory runs out at t=5s and Linux starts swapping. Thus, the 99.9 th latency increases from 639&#181;s to 206ms. The antagonist finishes at t=10s, and latency eventually recovers as memory use drops. Figure <ref type="figure">8</ref> summarizes the results for the same experiment on KVS and k-means, which show a similar trend (graphs in &#167;A.1).</p><p>Compute pressure is harder to handle well than memory pressure as the CPU use can spike instantly. Figure <ref type="figure">9</ref> shows that Nu experiences a higher performance impact when faced with an antagonist that suddenly uses half the available CPU cores. However, disruption is still short-lived as Nu resolves pressure rapidly through fast migration. By contrast, the performance impact on the baseline lasts &#8776;15&#215; longer.</p><p>These results show that Nu frees resources quickly under pressure, migrating proclets faster than Linux can allocate memory. Consequently, the pressuring workload (here, the antagonist) neither runs out of resources nor slows down, and the applications experience only modest tail latency increases. This means that Nu-based applications can use spare resources without risk: Nu can always migrate proclets if other workloads need the resources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Comparison with Existing Implementations</head><p>Nu seeks to provide logical processes that match or exceed the performance of current architectures even in the absence of resource pressure. Although Nu allows distributed operation, local proclet invocations would ideally match the performance of computing on a single machine. We therefore compare the performance of Nu-based applications to baseline implementations without logical processes on a single machine. We measure tail latency under varying load for longrunning services (SocialNetwork and KVS), and throughput for k-means. A good result would show Nu matching the baseline on NUMA-optimized, compute-intensive applications (e.g., Phoenix k-means), and it would outperform the baseline on RPC-based applications because Nu's fastpath avoids RPC overheads.</p><p>Figure <ref type="figure">10</ref> shows the results. Nu matches or exceeds the baseline's performance in all cases. For SocialNetwork (Figure <ref type="figure">10a</ref>), Nu serves about 850k requests/second with submillisecond 99.9 th percentile latency. The baseline implementation, which runs microservices in Docker containers and uses Thrift RPCs, scales to only 8,000 operations/second, with a 9-60 ms 99.9 th -ile latency (very left of the graph). Nu outperforms the baseline because its fastpath avoids the overheads of loopback RPCs (serialization and network syscalls) with a single machine. For KV Store, Nu outperforms memcached on Linux by 15&#215;, serving 12M operations/second to mem- Figure <ref type="figure">7</ref>: SocialNetwork runs alongside a memory antagonist that starts at 3.9s. When the memory usage reaches the high watermark, Nu starts migrating proclets rapidly (the gray window). By matching the allocation speed of the antagonist, Nu keeps the memory usage flat and resolves the pressure in 0.86s. SocialNetwork's 99.9 th -ile latency is unaffected. By contrast, the baseline fails to migrate fast enough and starts swapping, which leads to 206ms latency (322&#215;). After the antagonist finishes, the memory usage and the latency gradually return to normal.  cached's 800k at sub-millisecond latency (Figure <ref type="figure">10b</ref>). Crucially, Nu performs as well as the same KV Store running on Caladan <ref type="bibr">[28]</ref>, which also uses kernel-bypass networking and a user-level threading runtime. Finally, Nu matches Phoenix MR's performance (Figure <ref type="figure">10c</ref>). Phoenix MR is designed for scalability on a single NUMA machine, and exploits shared memory for performance, so it is a strong baseline. The kmeans workload requires sharing the intermediate clustering result across all workers. In a shared-memory setting, this shared state can be a global variable (as in the baseline), but in a distributed framework would involve per-worker copies. Since Nu supports migration, it must be prepared to oper-ate distributedly and keep per-worker (per-proclet) copies, which amplifies the application's cache footprint on a single machine. We therefore compare two Nu setups: per-worker copies (label Nu) and global state (Nu G ), and add a modified baseline with per-worker states (Baseline P ). Nu G measures the overhead of Nu's infrastructure with pinned (unmigratable) proclets. The overall results show that Nu's proclet invocations on a single machine are fast enough to match the performance of single-machine baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Workload</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Scalability</head><p>Next, we consider how Nu scales as the proclets of a logical process are spread across many machines. For each of our three applications, we run an experiment where the runtime assigns its proclets round-robin across servers. We consider two versions of the SocialNetwork application: the one from &#167;6.2, which we wrote with logical processes and proclet decomposition in mind; and a second version that mirrors the exact microservice decomposition in DeathStarBench. We measure throughput for equal-sized input, i.e., a strong-scaling setup. Because Nu's local proclet invocation is faster than remote invocation, the single-machine setup has a substantial efficiency advantage, which makes linear scalability difficult to achieve. An ideal result would therefore show scalability close to linear as the number of machines increases.</p><p>We show the results in Figure <ref type="figure">11</ref>. Nu scales well in all three applications, and achieves nearly linear scalability for KV Store and k-means. The SocialNetwork application is the most challenging to scale (Figure <ref type="figure">11a</ref>). A direct port from the original microservice architecture to Nu (where each microservice becomes one proclet) results in many proclets with methods that have low compute intensity. When invoked remotely, calls to these methods can be costly, while the additional resources of a remote machine speed up the more compute-intensive invocations. On balance, Nu's throughput  still increases with the number of machines-from 786k to 1.37M ops/sec-but less so than when the application is decomposed into proclets that have sufficient compute intensity (786k to 8.44M ops/second). However, both Nu-based implementations perform one to two orders of magnitude better than the DeathStarBench baseline (45k ops/sec). The KVS implementation on Nu scales very well (Figure <ref type="figure">11b</ref>) as it relies on client-side request steering (in response to hints from Nu's runtime) to direct clients' requests to the right machine, which then makes a local proclet invocation. K-means (Figure <ref type="figure">11c</ref>) has high compute intensity, which makes scaling easy. We conclude from these results that Nu's logical processes scale well when proclets are distributed across machines if a proclet's methods have sufficient compute intensity. &#167;6.4.1 evaluates the impact of compute intensity on Nu's efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Design Drill-Down</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.1">Impact of Compute Intensity</head><p>We now examine the efficiency of Nu's mechanism for proclet method invocation ( &#167;3.2). Intuitively, the more compute an invocation does, the easier it is to amortize the overheads of the invocation (serialization, networking, and threading); yet, the lower these overheads are, the better Nu's performance becomes. Our experiment is a sensitivity analysis in which we vary the compute duration in a proclet's method between 0.1 and 50&#181;s, and we measure the aggregate invocation throughput. We use sufficient threads to maximize throughput, saturating the machine that runs the proclet. We consider two cases for Nu: two proclets in the same machine (local), and proclets in different machines (remote). We compare the performance of Nu against three common mechanisms to invoke a task: (i) a function call in a Linux process; (ii) an RPC using Thrift, a popular open-source RPC framework <ref type="bibr">[75]</ref>; and (iii) an RPC using a modified Thrift that uses Caladan <ref type="bibr">[28]</ref> to reduce TCP and threading overheads. We measure throughput in a closedloop setting. A good result would show performance of Nu close to local function calls for local invocations, and at least as good as Thrift for remote invocations.</p><p>The results in Figure <ref type="figure">12</ref> show that when the invocation is local, Nu's performance tracks closely that of Linux function calls. This happens because of Nu's fastpath for local invocations. When invocation is on a remote proclet, compute intensity (invocation duration) matters. For short invocations (0.1&#181;s), Nu is &#8776;13&#215; worse than local function calls, but 2.4&#215; better than Caladan-based Thrift (and 29.4&#215; better than Thrift on Linux). As the invocation becomes more computeintensive, these gaps close: for a 10&#181;s task, Nu's remote invocation achieves 85% of local function call throughput. We conclude that locality matters for remote proclet invocations with low compute intensity, but that Nu delivers near-single- machine performance for tasks with compute intensity as low as 10&#181;s.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.2">Migration Time and Bandwidth</head><p>We now measure the time it takes to migrate a proclet in Nu. The experiment migrates proclets of varying sizes to another machine and measures the migration time. We vary the test proclet's memory size by adjusting its heap size, from 64 KiB to 16 MiB. The proclet has a single thread with a small stack (64 bytes). A good migration latency would be &#8776;100&#181;s for modest-sized proclets-orders of magnitude faster than traditional resource balancing mechanisms. For larger proclets, we expect the latency to approach network transfer time.  Figure <ref type="figure">13</ref> shows the results. With 100 GbE (i.e., the network setting used for all other experiments), Nu migrates small proclets (up to 1 MiB) in under 125&#181;s. This corresponds to a bandwidth of 3-9 GB/s. For larger proclets (2 MiB-16 MiB), the latency varies from 200&#181;s to 1,500&#181;s, which corresponds to a bandwidth of &#8776;11 GB/s, close to the 100 GbE line rate. The results of 25 GbE and 200 GbE show similar trends. Proclet migration benefits from higher network bandwidth; for example, with 200 GbE, Nu only takes &#8776;100&#181;s to migrate a 2 MiB proclet. We conclude that Nu migrates proclets quickly and that its migration uses the network efficiently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.3">Controller Performance</head><p>To understand whether Nu's controller can become a performance bottleneck, we benchmark it as a standalone component to measure its capacity. Depending on the type of control message, the controller achieves 0.79-0.96 million msg/s. This is two to three orders of magnitude higher than the real load demand (542-21,450 msg/s) we measured in the end-to-end experiments ( &#167;6.1). This makes sense as Nu's runtime caches the proclet location resolution result, thereby moving the controller off the critical path of steady-state application traffic. The controller is only involved in the control plane of initial proclet location resolution and migration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.4">Proclet Replication</head><p>Nu allows replicating proclets for fault-tolerant operation. Replication imposes overhead because it forwards all invocations of a proclet to a backup in a different machine ( &#167;3.5). We measure the invocation throughput of calling 8,192 remote replicated proclets, as we vary the compute intensity as in &#167;6.4.1. These invocations do not have sub-operations. The baseline is the same setup without replication. A good result would show a modest loss of throughput with replication.  Figure <ref type="figure">14</ref> shows the results. Throughput drops by 37% with a 0.1&#181;s compute intensity, but this drop gradually shrinks to 18% as compute intensity grows to 30&#181;s. Replication adds &#8776;1.2&#181;s to each operation to invoke the backup proclet, an overhead that gets amortized at larger compute intensities. This result shows that fault-tolerance for critical proclets is feasible and need not come at severe performance cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>We presented logical processes, a new abstraction that decomposes an application into proclets, which are small units of state and compute. Logical processes and proclets solve a key hindrance to increasing datacenter resource utilization: the lack of microsecond-granularity fungibility in resource use.</p><p>We found that logical processes and our Nu prototype improve fungibility by making applications granular and migrating proclets quickly under resource pressure. For three applications, Nu matches the performance of strong baselines, scales well, and migrates their proclets within hundreds of microseconds with little disruption to application performance.</p><p>Nu is available as open-source software <ref type="bibr">[66]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Appendix</head><p>In this appendix, we include the end-to-end performance results under resource pressure that were not included in &#167;6.1 due to the space constraint.</p><p>A.1 Application Performance Under Memory Pressure Figure <ref type="figure">15</ref> presents the 99.9 th tail latency of KV store under memory pressure. The results are similar to the SocialNetwork results (Figure <ref type="figure">7</ref>). Figure <ref type="figure">16</ref> shows the K-Means performance under memory pressure using the throughput metrics as it is a batch application. Nu achieves 97% throughput during migration, whereas the baseline only achieves 67% throughput.</p><p>Here we do not show the memory utilization as K-Means has a tiny per-machine memory footprint. The baseline has lower performance mainly because of the long task pausing time caused by slow migration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Application Performance Under Compute Pressure</head><p>In the next experiment, we evaluate Nu's performance under compute pressure. Compute pressure is harder to handle well than memory pressure, since Nu's solution to resource pressure-proclet migration-itself consumes compute resources. The antagonist process in this experiment is a syn-thetic CPU-spinning workload that occupies half the CPU cores on the machine, reducing the compute resources available both to the application and to Nu's proclet migrations. In addition, the CPU load of the antagonist can spike instantly; this is different from the memory load which only increases gradually. Therefore, we would expect a higher impact on application performance than when Nu migrates proclets under memory pressure. A good result for Nu would show that the application still achieves acceptable performance, even if degraded for some (ideally short) time.</p><p>Figure <ref type="figure">17a</ref> shows Nu's results. At t=4.9s, the compute pressure starts on one machine, taking away half of the application cores, and Nu immediately starts migrating proclets to reduce load on the machine. 99.9 th latency increases from 33 &#181;s to 1086 &#181;s. This latency spike makes sense as the machine's compute resources are degraded by 50% and Nu needs additional compute to migrate proclets. The latency gradually decreases as proclets migrate and the other machine starts serving client requests, and soon recovers back to 33 &#181;s as the migration ends at t=5.44s. In contrast, for the baseline, the latency disruption lasts 7.96s, which is 15X of the Nu's 0.54s duration. Figure <ref type="figure">18</ref> shows the result of K-means. Nu takes 24ms to resolve the pressure and achieves 94.2% throughput  For KV store under memory pressure, Nu is able to maintain 99.9 th tail latency under 85&#181;s as it migrates proclets faster than the memory allocation speed of the antagonist. In contrast, the baseline suffers from poor tail latency (&#8776; 2&#215;10 6 &#181;s) since it cannot keep up with the allocation speed and has to swap memory.      during migration. In contrast, the baseline takes 0.65s and only achieves 49.7% throughput. These results demonstrate that Nu's logical processes react quickly to CPU pressure. Some performance degradation is unavoidable, but the impact is short-lived: after a sub-second delay, proclet migrations relieve the resource pressure.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>A logical process can be implemented in other languages too.</p></note>
		</body>
		</text>
</TEI>
