<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Remote Direct Memory Introspection</title></titleStmt>
			<publicationStmt>
				<publisher>Usenix Security 2023</publisher>
				<date>08/15/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10524160</idno>
					<idno type="doi"></idno>
					
					<author>Hongyi Liu</author><author>Jiarong Xing</author><author>Yibo Huang</author><author>Danyang Zhuo</author><author>Srinivas Devadas</author><author>Ang Chen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Hypervisors have played a critical role in cloud security, but they introduce a large trusted computing base (TCB) and incur a heavy performance tax. As of late, hypervisor of- floading has become an emerging trend, where privileged functions are sunk into specially-designed hardware devices (e.g., Amazon’s Nitro, AMD’s Pensando) for better security with closer-to-baremetal performance.In light of this trend, this project rearchitects a classic security task that is often relegated to the hypervisor, mem- ory introspection, while only using widely-available devices. Remote direct memory introspection (RDMI) couples two types of commodity programmable devices in a novel defense platform. It uses RDMA NICs for efficient memory access and programmable network devices for efficient computa- tion, both operating at ASIC speeds. RDMI also provides a declarative language for users to articulate the introspection task, and its compiler automatically lowers the task to the hardware substrate for execution. Our evaluation shows that RDMI can protect baremetal machines without requiring a hypervisor, introspecting kernel state and detecting rootkits at high frequency and zero CPU overhead.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Security and performance are both first-order objectives at cloud scale, yet today's hypervisors struggle in both aspects. Hypervisor-based virtualization introduces a large TCB with millions of lines of low-level code, where CVEs (Common Vulnerabilities and Exposures) are continuously unearthed <ref type="bibr">[55,</ref><ref type="bibr">109]</ref>. Meanwhile, they also incur a heavy &#170;virtualization tax,&#186; consuming significant CPU cycles, taking away resources from revenue-generating VMs, and creating performance contention with tenant workloads <ref type="bibr">[82]</ref>. To achieve better security and performance, cloud providers are increasingly invested in hypervisor offloading, using tailor-made hardware devices <ref type="bibr">[4,</ref><ref type="bibr">8,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">37,</ref><ref type="bibr">39]</ref> with closed-source designs.</p><p>These emerging devices are particularly important in baremetal-as-a-service (BMaaS) offerings, where entire installations are provided to a single tenant without a software hypervisor. BMaaS not only provides ideal performance and isolation for the tenant, but also increases the provider's revenue as 100% of CPU cycles are for sale; it has gained a foothold in all major clouds <ref type="bibr">[2,</ref><ref type="bibr">7,</ref><ref type="bibr">9,</ref><ref type="bibr">10,</ref><ref type="bibr">16,</ref><ref type="bibr">33]</ref>. Owing to the absence of the hypervisor, security tasks are often anchored in these customized devices&#208;with Amazon's Nitro <ref type="bibr">[8]</ref>, Intel's IPU <ref type="bibr">[15]</ref>, AMD's Pensando <ref type="bibr">[4]</ref>, and Microsoft's Fungible <ref type="bibr">[14]</ref> vying for the market. These devices are attached to host servers as PCIe peripherals, akin to network interface cards (NICs), but their execution environments are shielded from host CPUs. They run protection tasks with a higher privilege than the OS or hypervisor. Since this mode of execution operates beneath what is traditionally known as Dom0, we will henceforth call this paradigm &#170;Dom(-1) security.&#186; Remote direct memory introspection, or RDMI, rethinks a classic security task in light of the Dom(-1) paradigm. Kernel memory introspection <ref type="bibr">[61,</ref><ref type="bibr">64,</ref><ref type="bibr">65,</ref><ref type="bibr">89]</ref> is an important forensic technique. It enables the cloud provider to perform security telemetry and detect signs of malice (e.g., rootkits), while staying transparent to the tenants by virtue of operating underneath their workloads. Traditionally, this is relegated to the hypervisor, which periodically acquires memory snapshots from guest VMs for security analysis (e.g., reconstructing task_struct lists from raw memory). In contrast to this conventional approach, RDMI sinks memory introspection tasks into the hardware layer by a whole-stack redesign; moreover, it only uses COTS (commercial-off-the-shelf) devices available to everyone instead of closed-source devices. The key enabler for RDMI is the increasing deployment of commodity programmable hardware in the cloud. In particular, RDMA NICs (RNICs) that enable remote direct memory access <ref type="bibr">[50,</ref><ref type="bibr">58]</ref>, and P4 programmable switches that can realize hardware control loops <ref type="bibr">[81,</ref><ref type="bibr">97]</ref>, form its Dom(-1) substrate:</p><p>&#8226; Memory datapaths: RNICs expose host memory to remote clients, providing a telemetry channel to acquire memory snapshots over a network in a granular manner. The conventional use of RDMA is to accelerate application-layer cloud workloads <ref type="bibr">[50,</ref><ref type="bibr">58]</ref>, whereas we use it as a vantage point for kernel memory visibility. RDMA datapaths are simple and fast, and remote accesses are fully transparent to the introspected host. &#8226; Control loops: Kernel introspection is a complex task that goes beyond individual memory accesses&#208;e.g., it might need to fetch the Linux process linked list starting at init_task, parse its next pointers, and traverse the entire list of task_structs. This requires a more expressive programming model than RDMA. We observe that introspection control loops are an ideal fit for programmable switches, which serve as a platform for realizing new control protocols at hardware speeds <ref type="bibr">[81,</ref><ref type="bibr">97]</ref>.</p><p>Combined, RDMI executes in Dom(-1) with hardwarebased memory accesses (using RDMA NICs) and inspection (using programmable switches) at ASIC speeds. This operating regime also enables RDMI to introspect baremetal installations without a hypervisor. It can be deployed to a ToR (Top-of-Rack) switch that serves a set of baremetal servers equipped with RNICs, offering security protection with novel properties not found in hypervisor-based introspection:</p><p>&#8226; Baremetal: It introspects kernel memory in baremetal installations, without requiring a hypervisor. &#8226; Remote: The introspection engine is disaggregated from host CPUs and executes over the network. &#8226; Efficient: Both the datapath (memory operations) and the control path (introspection logic) are realized in ASICs. &#8226; Commodity: It relies on widely available, COTS hardware technologies without any modification. &#8226; Programmable: Introspection tasks can be programmed in a declarative language with a few lines of code. On the last point, RDMI abstracts away the complexities of the ASICs and the intricacies of kernel introspection from the user. Instead of directly asking the user to program low-level RDMA and P4 ASICs, which would be burdensome and errorprone, RDMI exposes a set of functional operators to specify a variety of introspection tasks. Users program against a uniform &#170;kernel graph&#186; abstraction, where vertexes are the kernel data structures and edges are the pointer relations. Supporting this abstraction are the RDMI compiler and runtime that facilitate its Dom(-1) execution. The RDMI compiler maps a query via an intermediate representation (IR) that manipulates an abstract introspection machine for kernel traversal, with its instruction set instantiated in RDMA and P4 ASICs. The RDMI runtime provides auxiliary utilities (e.g., driving the introspection to different kernel locations and fetching kernel memory with RDMA operations), also shared across tasks. This design enables runtime programmability <ref type="bibr">[104,</ref><ref type="bibr">107]</ref>&#208;queries can be reconfigured in a live manner by the compiler generating different control plane configurations without downtime.</p><p>We have developed a RDMI prototype and conducted a comprehensive evaluation, with the following findings. The RDMI defense platform is capable of introspection frequencies that are orders-of-magnitude higher than hypervisor solutions, and it effectively performs memory telemetry and detects rootkits in baremetal machines. Moreover, security protection does not require any CPU involvement and incurs minimal performance disturbance to tenant workloads. We have released our prototype in open source <ref type="bibr">[34]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Rethinking Memory Introspection</head><p>Memory introspection is an important security task for the cloud <ref type="bibr">[57,</ref><ref type="bibr">64,</ref><ref type="bibr">76,</ref><ref type="bibr">90,</ref><ref type="bibr">91,</ref><ref type="bibr">110]</ref>. An abridged view of the long body of work can be summarized as follows. Since its inception two decades back <ref type="bibr">[61]</ref>, researchers have shown that rich security insight can be gleaned from memory snapshots. This works through the hypervisor scanning important kernel data structures in the VMs to detect signs of malice (e.g., rootkits). Operating in Dom0, introspection code executes with higher privilege than tenant VMs, which reside in DomU. Thus, it has a full view of VM memory and can inspect any kernel locations. Sinking security protection from VMs into the hypervisor also means that introspection can be provided &#170;as-a-service&#186; transparently to the tenants. The only setup parameters required by the hypervisor are metadata about the guest kernels (e.g., kernel versions, ASLR offsets), and from there on, the introspection code navigates the kernel state by itself. To conquer the &#170;semantic gap&#186; <ref type="bibr">[65]</ref>, introspection programs must ingeniously piece together disparate data from raw memory bytes&#208;e.g., it may enumerate Linux's task_struct process descriptors from raw memory.</p><p>Our project rethinks memory introspection in light of the trend of hypervisor offloading to Dom(-1) hardware. Isolating tenants with hypervisor software has been the de-facto cloud paradigm, but Dom(-1) technologies are chipping away at this abstraction. This is not only due to the decline of Moore's law necessitating better use of CPU cycles, but also a desire for stronger security due to leaner hardware TCBs. Historically, the desire for baremetal execution was felt in various virtualization hardware features (e.g., DPDK <ref type="bibr">[13]</ref>, SPDK <ref type="bibr">[35]</ref>, VT-x <ref type="bibr">[20]</ref>, MPK <ref type="bibr">[17,</ref><ref type="bibr">102]</ref>), but the recent rise of baremetal-as-a-service and Dom(-1) hardware devices are a more significant milestone. Industry vendors have developed tailor-made devices <ref type="bibr">[1,</ref><ref type="bibr">4,</ref><ref type="bibr">8,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">39]</ref>, where virtualization functions are not only offloaded to hardware but also gated from host CPUs via an &#170;airgap&#186; for security. Implementing security functions in Dom(-1) reduces the TCB, minimizes interference with the tenants, provides stronger protection in case of host compromises, and saves operational costs as more server CPU cycles are made available to tenants. RDMI rearchitects memory introspection to operate in Dom(-1), but does so only with COTS devices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Remote direct memory introspection</head><p>Introspection datapath: The RDMI memory datapaths are built upon one-sided RDMA &#170;verbs.&#186; For instance, RDMA READ operations take memory addresses and sizes as parameters, and the requests are encapsulated as Ethernet packets and sent over the wire. Once the requests arrive at the remote machine, the Ethernet packets are translated by the NIC hardware into DMA requests over PCIe, eliminating remote CPU involvement from the datapath and achieving ASIC speeds for memory accesses. In the context of introspection, the RDMI READs might fetch a task_struct or several of its fields. RDMA connections are established using queue pairs (QPs) with unique identifiers (QPNs, or queue pair numbers) at both the sender and the receiver sides. By default, RDMA uses vir- tual memory addresses and thus requires address translation at the NIC hardware, but it is configurable to use physical addresses directly as well <ref type="bibr">[88,</ref><ref type="bibr">101]</ref>. This is important for RDMI as it manipulates kernel objects, most of which are directly (i.e., linearly) mapped to the physical addresses.</p><p>Introspection control. While one-sided RDMA is a promising start, memory introspection goes far beyond individual memory operations. Introspection tasks require various types of kernel traversals (e.g., traversing a linked list of task_structs), which in turn involve pointer arithmetic, range checks, and complex control flow. One na&#239;ve approach is to implement the control path in software (e.g., on another server under the same rack). However, many downsides of driving RDMA operations in software have been noted by prior work, such as CPU cycle wastage due to polling <ref type="bibr">[46]</ref>, and longer latency due to software processing <ref type="bibr">[42]</ref>). More importantly, this offsets our goal of Dom(-1) execution in hardware ASICs. Our solution is inspired by recent projects that use P4 programmable switches to implement control logic that drives RDMA tasks <ref type="bibr">[73,</ref><ref type="bibr">77,</ref><ref type="bibr">105]</ref>. RDMI executes the control loop at hardware speeds inside a programmable switch, driving the memory introspection datapath.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Overview</head><p>Figures 1(a)-(b) depict the Dom(-1) substrate and (c) shows the key components of RDMI. To the best of our knowledge, RDMI is the first defense platform that a) exposes a declarative interface for users to articulate introspection tasks; b) compiles a wide range of such tasks to a reconfigurable set of hardware engines; c) executes introspection tasks at ASIC speeds with zero CPU overheads. Threat model. Our threat model is that of a fully untrusted kernel (e.g., OS compromises due to kernel-level rootkits), thus the OS may exhibit arbitrary behaviors and is capable of removing or tainting any software security agent running on host CPUs (e.g., kernel modules). However, we assume that the kernel can boot into a known-good state (e.g., by leveraging trusted boot [38] hardware) and compromises only occur after that during runtime. This trusted setup also initializes RDMI&#208;upon boot, we create a number of RDMA connections between the switch and the introspected machine, and grant these connections physical access to the host memory.</p><p>Operator Description kgraph(addr) Initialize traversal at kernel addr traverse(ptr_nxt, ptr_end, type) Traverse ptr_nxt until ptr_end with type in(ptr) Deference ptr into a different data structure iterate(array, n, type) Iterate an array of type for n steps values(f1, .., fn) Acquire values from current address assert(pred1, .., predn) Assertion on acquired values Table 1: Introspection operators. Highlighted are new operators or those that take a different meaning from Gremlin [95].</p><p>After the setup, we assume that the Dom(-1) substrate, as well as the introspection programs that it hosts, are trusted. Notably, at runtime, the trusted computing base excludes software hypervisors, which is a sizable reduction. Non-goals. There are several worthwhile goals that are nevertheless beyond the scope of RDMI. Although RDMI enables expressive policies to be developed, our goal is not to propose new introspection policies that are more adept at detecting kernel compromise. Similarly, improving detection accuracy with new analysis algorithms is also not our focus. Section 8 also describes a few other limitations in detail.</p><p>3 Programming Introspection Queries RDMI exposes a declarative interface for users to specify introspection tasks, so that they are not burdened with low-level operations with baremetal RDMA and P4 ASICs. We observe that a custom programming model is possible because RDMI tasks are highly specialized, essentially treating the kernel data structures as a graph and traversing the graph following pointers. Thus, we propose a domain-specific language (DSL) drawing inspiration from a widely-used graph query language, Gremlin <ref type="bibr">[95]</ref>. Table <ref type="table">1</ref> includes the key operators, and we showcase their expressiveness with concrete tasks.</p><p>Q1: Task list traversal <ref type="bibr">[40]</ref>. Let us start with a &#170;hello world&#186; example, where the user wishes to query all active processes and their IDs. This functionality is akin to the 'pslist' volatility toolkit <ref type="bibr">[40]</ref> for memory dump analysis. Linux uses struct task_struct as the data structure for a process, organized in a linked list with the global kernel symbol init_task as the entry. Each task_struct contains key process attributes&#208;e.g., process IDs (int pid) and credentials (struct cred; used in Q2). We depict the data structures and key variables.</p><p>/* init_task.c */ struct task_struct init_task; /* sched.h */ struct task_struct { /* definition in type.h */ struct list_head { struct list_head *next; struct list_head *prev; } tasks; int pid; /* definition in cred.h */ struct cred { kuid_t uid; kgid_t gid; } *creds; }; task_struct pid=0 &#8230; cred uid, gid tasks.next task_struct pid=1 task_struct pid=5 task_struct pid=mal_task tasks.next tasks.next (0, 0) means root priv. init_task creds cred uid, gid cred uid, gid cred 0, 0 creds creds creds RDMI articulates this traversal in three lines of code below. A useful mental model for a RDMI query is that of a &#170;cursor&#186; pointing to a specific kernel address, which moves about across the kernel graph based on the introspection logic: 1 /* Traverse all tasks and acquire pids */ 2 kgraph(init_task) // traversal source 3 // traverse init_task.tasks.next until wraparound 4 .traverse(tasks.next, &amp;init_task.tasks, task_struct) 5 .values(pid) // query pid for each task</p><p>A query always starts with a kgraph operator (Line 2), which initializes the cursor to some predefined address, such as the global symbol init_task whose address is statically determined upon boot. Line 4 defines the footprint of the traversal, performing a sequence of pointer chasing operations with the ptr_nxt field (see Table <ref type="table">1</ref> for operator arguments; in this case the argument is the tasks.next field in task_struct) until it encounters ptr_end (in this case set to init_task.tasks). In other words, the traversal halts when it wraps around and revisits init_task. The type is set to task_struct, so RDMI understands the data structure type of each traversed element. Finally, Line 5 uses values(pid) to acquire the process ID field in each visited element. Query results are forwarded to a logging server for further analysis.</p><p>Q2: Privilege escalation analysis <ref type="bibr">[5]</ref>. This query traverses each task_struct as in Q1, but it takes an excursion from the linked list to another data structure struct cred, which stores process credentials (user ID uid, group ID gid). Nonroot processes have uid and gid values larger than 1000; a rootkit may maliciously modify these values to zero for some user process (e.g., a Shell) to escalate its privilege to root access <ref type="bibr">[5]</ref>. This query is a four-liner. Line 4 zooms in on the external data structure struct cred, which hangs off of the linked list. (Note that cursor movements in Q1 do not require in, as the visited fields are contained in the current data structure (i.e., struct task_struct) for traversal.) Q3: Virtual filesystem hook detection <ref type="bibr">[76]</ref>. A rootkit may modify function pointers to divert execution to its malicious code. Q3 asserts that VFS function hooks must be within a known-good range. As depicted below, /proc is a virtual filesystem providing administrative utilities&#208;e.g., user-level forensic tools such as ps rely on information from /proc. Its root inode is represented by the proc_root data structure. Filesystem operations eventually invoke read, llseek, and other file operations, which are specified as function pointers in struct file_operations as contained in proc_root.</p><p>/* fs/proc/root.c */ /* root node for /proc */ struct proc_dir_entry proc_root = { .low_ino = PROC_ROOT_INO, .namelen = 5, .proc_fops = &amp;proc_root_operations, .name = "/proc", &#8230; }; static const struct file_operations proc_root_operations = { .read = generic_read_dir_mali, .llseek = generic_file_llseek, &#8230; }; `ls -l /proc` triggers VFS read &#8230; dr-xr-xr-x root mysql dr-xr-xr-x alice httpd &#8230; 0xffffffff9fde0000 0xffffffff9fde1000 malicious code</p><p>By modifying the function pointers, a rootkit can manipulate the forensic outputs and hide certain processes from such administrative tools <ref type="bibr">[76]</ref>. This RDMI query checks that the read function pointer must be within a knowngood range (e.g., kernel text from 0xffffffff9fc00000 to 0xffffffffa08031d1). The assertion in Line 4 evaluates a predicate and triggers notifications upon failure.</p><p>1 kgraph(proc_root) 2 .in(proc_fops) 3 .values(read) 4 .assert(KERN_TXT_BEGIN &lt; read &lt; KERN_TXT_END) Q4: Network filter hijacking detection <ref type="bibr">[25]</ref>. Netfilter <ref type="bibr">[26]</ref> is a framework within the network stack, allowing registered callback functions upon packet events. Rootkits commonly inject adversarial callbacks to intercept network traffic. For instance, a rootkit may watch for port-knocking packet sequences as command-and-control signal for triggering an attack, and then drop these packets immediately <ref type="bibr">[64]</ref>. As shown below, netfilter hooks are retrieved at the init_net symbol where struct netns_nf is stored. Inside struct netns_nf, hooks holds a two-dimensional array, and each array element points to a struct nf_hook_entries. This query requires a two-dimensional, nested traversal.</p><p>/* net_namespace.c*/ struct net init_net; /* net_namespace.h*/ struct net { struct netns_nf { struct nf_hook_entries *hooks[13][8]; }nf; }; /* netfilter.h */ struct nf_hook_entries { u16 num_hook_entries; struct nf_hook_entry hooks[]; }; struct nf_hook_entry { nf_hookfn *hook; void *priv; }; net netns_nf nf_hook_entries * &#8230; init_net nf_hook_entries &#8230; nf_hook_entry hook=funcn nf_hook_entries * nf_hook_entries * nf_hook_entries * nf_hook_entries * nf_hook_entry hook=func1 nf_hook_entry hook=func2 Lines 2+6 denote the nested traversal, where iterate operates on array elements instead of linked lists. Line 4 dereferences the value contained at the current array element as a pointer, resulting in an excursion to a different data structure struct nf_hook_entries. This struct contains a second array of registered hooks, and num_hook_entries is the number of entries. Iterating through this dynamically-allocated array, RDMI checks each of the hook functions. 1 kgraph(init_net) 2 .iterate(nf.nf_hooks, 13 * 8, ptr_t) 3 //NFPROTO_NUMPROTO=13, NF_MAX_HOOKS=8 4 .in(this) // deref current value, see appendix 5 .values(num_hook_entries) 6 .iterate(hooks, num_hook_entries, nf_hook_entry) 7 .values(hook) 8 .assert(KERN_TXT_BEGIN &lt; hook &lt; KERN_TXT_END) Q5-Q11. RDMI is expressive enough to support a range of introspection queries, summarized in Table 2. The Appendix contains the detailed queries and descriptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Abstract Introspection Machine</head><p>We now describe how the RDMI compiler decomposes operator-level introspection logic into hardware-level implementations through an intermediate language. We draw inspirations from existing projects that compile functional operators to P4 programmable switches <ref type="bibr">[63,</ref><ref type="bibr">72,</ref><ref type="bibr">108]</ref>); however, unlike existing compilers that directly lower the policy onto the hardware layer, RDMI introduces an indirection layer, which is an intermediate language that manipulates an abstract introspection machine (AIM). The key benefit provided by the AIM layer is to support runtime programmability <ref type="bibr">[107]</ref>&#208;that is, the ability to perform live query reprogramming without taking down the deployment. Existing work <ref type="bibr">[63,</ref><ref type="bibr">72,</ref><ref type="bibr">108]</ref> compiles each security task into a different P4 program, so deploying a new query requires reflashing the switch with a different program. This incurs downtime and cannot be performed in a live manner <ref type="bibr">[104]</ref>, so the switch is fixed to specific queries and cannot be reprogrammed with a new task on demand. In RDMI, the AIM layer exposes a minimalistic set of five instructions, which are instantiated in hardware and shared across RDMI operators for runtime programmability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Designing the AIM</head><p>RDMI achieves runtime programmability by designing a &#170;master&#186; P4 program that provides several introspection engines in hardware, corresponding to five AIM instructions: LOAD, MOV, PUSH, POP, JMP. Since all instructions are embedded in the master program, query changes do not require program modifications. Rather, deploying a new query only requires generating a new stream of AIM instructions, which in turn produces a different set of control plane configurations to the master program. Configurations are installed and removed from the switch control plane software without reflashing the hardware, so introspection tasks can be reprogrammed on demand. In addition, the AIM layer also enables resource sharing across introspection primitives&#208;e.g., traverse and iterate have shared logic for pointer chasing (e.g, MOV) and memory acquisition (e.g., LOAD), which can be supported by the same underlying introspection engines. The five instructions operate on the AIM (virtual) registers and stack. AIM registers: Registers store temporary introspection state (e.g., memory addresses, loop bounds) and enable arithmetic operations (e.g., pointer arithmetic, bound checking). We use R i to denote the i-th register allocated by the compiler. The LOAD(R, ADDR, SZ) instruction fetches a chunk of memory that starts at address ADDR with size SZ, and its variant LOAD(R, $CONST) assigns a compile-time constant to the register. Predicates over register values are used to implement control flow branches&#208;the conditional jump JMP(PRED, L, L &#8242; ) checks the predicate PRED (e.g., R &lt; LOOP_MAX) and branches to the L/L &#8242; labels in the instruction-level program (see &#167;4.2). We use R B to denote a special AIM register that holds the current introspection base address, such as the starting address of a task_struct under introspection, and R B is used in conjunction with relative offsets within the data structure to fetch data. The MOV(ADDR) instruction rebases introspection to a new address by fetching the pointer stored at ADDR (i.e., pointer chasing), and its variant MOV($ADDR) sets the base to ADDR. All ADDR fields in the LOAD/MOV instructions are compiled into base addresses and offsets.</p><p>AIM stack: The stack is manipulated in last-in-first-out order for traversal loops, and each stack frame contains a previously used R B . For instance, when traversing the task_struct linked list, the PUSH instruction pushes the current task_struct base to the stack top. This may be further followed by a MOV to rebase introspection to the next element. The POP instruction pops the stack top to R B , restoring the previous introspection base and resuming work from there (e.g., returning from an inner traversal to the outer layer). Every nested traversal produces exactly one stack frame, so stack depth can be analyzed statically by the compiler.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Compiling to the AIM</head><p>RDMI compiles introspection operators to the AIM instructions enabled by the master program. kgraph(addr) initializes the introspection by setting R B to the specified addr. in(ptr) is realized by a MOV instruction that rebases to a different data structure. values(f) is realized by a LOAD instruction to fetch the value. assert(pred) further performs predicate checking on the LOADed results. traverse and iterate are the most complex as they involve loops, and the loop body could further vary based on the query. RDMI compiles them into AIM instructions that implement a loop skeleton, but with loop bodies initialized to placehold-ers; they are later filled by the compiler when processing the operators within the traversals. We show the skeleton for traverse(ptr_nxt, ptr_end, type): Lines 1+10 successively move the cursor across a linked list of elements. To support nested traversals, Lines 3+9 use PUSH and POP to maintain base addresses in the stack. Line 11 checks for loop termination conditions. The loop body is left as a placeholder denoted by Lines 4-8, and it will be generated by the compiler when processing subsequent introspection primitives. For instance, the compiler may generate LOAD instructions if the operator nested in the loop is values(f).</p><p>iterate follows a similar compilation strategy with traverse. The Appendix includes more details for reference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Reconfigurable Introspection Engines</head><p>We now describe how the AIM instructions are instantiated in hardware engines, which can be reconfigured to implement different AIM instruction streams. At a high level, reconfiguration is achieved by installing control entries generated from different AIM instructions onto the match/action tables from the control plane. To explain this design, we first provide more background information on the Dom(-1) substrate.</p><p>P4 programmable switches consist of a sequence of hardware stages in their most popular models (i.e., Intel Tofino). A P4 program is a pipeline of match/action tables that are allocated on the stages, which select specific packet headers using match fields and activate processing actions. The tables have access to stateful registers, which are persistent memory that keeps state across packets, as well as ALUs (arithmetic logical units) that are capable of performing arithmetic operations with registers. A packet can only access each stage and its resources (e.g., registers, ALUs) once, and each ALU supports at most two distinct arithmetic operations. The AIM instructions are instantiated by a set of match/action tables, and the table entries are generated by the RDMI compiler and populated by the switch control plane to realize different introspection policies. This is the control path for introspection.</p><p>Unidirectional data flow Stateful registers Match/action tables ALUs Table entries</p><p>HW stages Endianness conversion Program counter Address translation AIM instruction engine RDMA retrieval Packet in, queue pair number denotes PC Little-to-big endian translation for a number of bytes Control plane entries specify PC transition behaviors for each program Linear and non-linear address translation Match on PC, execute different instructions for each PC Packet parser Packet deparser Edit packet header with memory address and size Packet out, queue pair number denotes next PC RDMA NICs transform Ethernet packets received over the wire into DMA transactions over the PCIe bus for memory access, and vice versa. We depict the RoCEv2 <ref type="bibr">[36]</ref> format, a commonly used RDMA protocol, with select fields:</p><p>A packet carries a queue pair number (QPN) that uniquely identifies an RDMA connection. Other header fields include the memory address to read from, the read size, as well as the read opcode itself. RDMA packets are encapsulated in UDP, IP, and Ethernet protocols, and they are recognized by the host and the switch by their distinct destination port number (i.e., 4791). This is the datapath for memory introspection.</p><p>Introspection engines are depicted in Figure <ref type="figure">2</ref>. The AIM instruction engine matches on the program counter (PC) carried in the packet header ( &#167;5.2), and executes different instruction streams based on the match/action entries. These entries map from PCs to their corresponding AIM instructions. The compiler generates these entries from the AIM instructions, and the control plane software installs them into the match/action tables to realize different programs. RDMI also has a runtime system, with four engines for endianness conversion, program counter, RDMA retrieval, and address translation, respectively. The lifetime of a typical introspection packet in RDMI is as follows. When the switch receives an RDMA packet, it first extracts specific headers (e.g., memory content) from the packet and performs endianness conversion. Next, the PC engine advances the program execution based on PC transition rules, which are also compiled from the AIM instructions as match/action entries. As needed, the packet also triggers address translation and page table walk. The AIM instruction engine then executes a batch of instructions, and triggers the RDMA retrieval engine for the next step of introspection. Thus, an introspection task requires several rounds of RDMA requests, each of which triggers an iteration of switch execution over the next several instructions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Reconfigurable AIM instruction engines</head><p>We now describe how the introspection engines are instantiated, deferring the runtime system description to &#167;5.2. Push, Pop, Mov: These instructions operate on the base register and the stack: pushing the base register value onto the stack, popping off the base from the stack, and modifying it, respectively. The stack is created using an array of stateful registers in P4, with additional designs to overcome the constraints imposed by the sequential hardware stages. As shown below, PUSH transfers data from the base register onto the stack, and POP in the other direction, requiring bidirectional data flow. However, if we allocate the base register at stage n and the stack at stage n + 1, backward access will incur heavy overhead; reversing their layout raises similar problems.</p><p>stack_top_idx base_idx Stack &#8230; Unidirectional data flow Stack RB Base POP PUSH RB2 RB1 Base Bidirectional flow, infeasible</p><p>Our design addresses this by observing that the stack depth can be statically analyzed by the compiler. Thus, we integrate the stack and base into a single register array, as illustrated above. Two packet metadata variables, stack_top_idx and base_idx, record the logical stack top and the base register, although physically they reside in the same register array. PUSH increases stack_top_idx by one, subsuming the current base without data copy. POP decrements the stack top index by one and updates the base to the popped value, again with unidirectional data flow. Further, the RDMI compiler statically computes the value for stack_top_idx and base_idx at each point of the program execution (as denoted by the program counter/PC; details in &#167;5.2). It produces match/action entries that match against the PC values and retrieves the current indexes for stack and base register operations; different policies result in different table entries. MOV only produces unidirectional data flow, modifying the base register to a new value. When rebasing introspection to a new address, the RDMA retrieval engine fetches the data from kernel memory.</p><p>Load, Jmp: These instructions operate on the AIM registers. The compiler allocates a stateful register in P4 hardware for each AIM (virtual) register, with an optimization that statically analyzes whether a fetched value via LOAD will be used in subsequent instructions. For instance, a LOAD(R 1 ,ADDR,SZ) instruction in conjunction with a JMP(R 1 --,L1,L2) will result in a stateful register allocated for R1. On the other hand, if a LOADed value is only used in the current round and never reused again (e.g., if the LOADed data is inspected but does not trigger additional pointer chasing), the compiler allocates a P4 metadata variable instead of a stateful register for resource savings. (Metadata variables are akin to packet headers, temporary and discarded after the packet leaves the switch.) Recall that the LOAD instruction supplies a memory address and read size; the match/action table modifies the current packet's headers to transform them into a proper RDMA read packet, and emits it from the RDMA retrieval engine. When the response packet arrives, the P4 switch distinguishes the response based on the queue pair number, parses its value, and stores it into the stateful register or metadata variable, and the LOAD instruction retires. JMP is supported by match/action tables that use an ALU to check the predicate over the register value, producing exactly two branches as required by the ALU constraints. As before, control plane entries are generated by the compiler to determine the specific checks and branching locations. If a branch is taken, the PC is modified to reflect the control flow transfer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Reconfigurable introspection runtime</head><p>Supporting the instruction engines is the RDMI runtime system, with four components depicted in Figure <ref type="figure">2</ref>. Endianness conversion. LOAD and MOV instructions fetch data from host memory. Since host data uses little-endian encoding, we develop a translation engine to convert the endianness of the RDMA read output. Its match/action tables are configured with entries that match on the read size (i.e., number of bytes) and perform bytewise conversion as actions. RDMA retrieval. LOAD and MOV instructions supply kernel addresses to the retrieval engine, which handles interactions with the kernel memory via RDMA. The match/action tables for the retrieval engine are reconfigurable to edit the packet header with the RDMA read opcode, address, size, and the queue pair number before sending it out to the host. Program counter. To keep track of instruction execution, we need a program counter that specifies the next AIM instruction to be executed. Execution may either fall through to the next instruction (for non-JMP instructions) or branch to a different location (for JMPs). The PC value is encoded as the queue pair number (QPN), which is an RDMA packet header. The PC transition logic is realized by match/action tables that match on the current PC value as the key and compute the next PC as the action. Compared to a na&#239;ve design that uses a P4 stateful register to record the PC, carrying PC values in packet headers is a judicious design choice as it enables better support for concurrent queries. As each RDMA packet comes in, RDMI locates the execution context (i.e., the query it belongs to as well as the instruction executed) based on its QPN without ambiguity. Consider some example match/action entries that implement the PC transition for a batch of AIM instructions: First, notice that the PC is not perinstruction but counts blocks of AIM instructions; each block ends with either a MOV or a LOAD instruction. This is because MOV/LOAD sends the packet out and PC (as a packet header) will disappear from the switch. Eventually, its response packet comes back asynchronously, and we need to determine where to resume the execution. This, in turn, requires us to update the QPNs of outgoing packets with the next PC values, so that match/action processing will resume based on the incoming packets' QPNs. <ref type="bibr">Moreover</ref>, we can see that JMP instructions PC val. Instr. Push Jmp 3 &#8230; Mov Push &#8230; Load Pop Jmp 10 Mov &#8230; &#8230; 10 ; // Policy end CurPC Conditional predicates NxtPC 1 Jmp pred true 2 1 Jmp pred false 3 2 unconditional 3 3 Jmp pred false 4 3 Jmp pred true 10 &#8230; &#8230; &#8230; 10 --Match/action table Key: Current PC (QPN field) Action: Compute next PC 1 2 3</p><p>are executed based on a predicate evaluation as part of the match/action processing (e.g., PC 1 may transition to 2 or 3 depending on the branching condition). Finally, a default label (e.g., PC 10) represents policy termination. Address translation. Linear kernel addresses (e.g., direct mapping area of kernel text) are translated by applying a fixed offset to obtain the physical address. Non-linear addresses (e.g., kernel modules) require a page table walk. Thus, the RDMI address translation engine is configured with two parameters&#208;a) the translation offset, which remains fixed for a single boot, for linear address translation, and b) the global kernel symbol address for init_top_pgt, which resides in the linear address space and holds the entry to the kernel page table. Linear address translation works by extracting the virtual page number from an address and then applying an offset in the ALU. Non-linear translation requires several steps: PGD <ref type="bibr">47</ref> 39 30 21 12 0 63 PUD PMD PTE PFN PGD PUD PMD PTE PFN 48 38 29 20 11 The figure above depicts a typical four-level page table, where a memory address is segmented into several components: a) bits [47..39] as the index to the first-level PGD (page global directory) table, as located by the global kernel symbol, b) bits [38..30] for the second-level PUD (page upper directory), c) bits [29..21] for the third-level PMD (page middle directory), and <ref type="figure">d</ref>) bits <ref type="bibr">[20.</ref>.12] for the PTE (page table entry directory). Successive RDMA packets are sent to fetch the respective translation entries to compute the physical frame number (PFN). The page offset (i.e., bits <ref type="bibr">[11.</ref>.0]) is then concatenated with the PFN to form the actual physical address.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Reconfiguring queries at runtime</head><p>Each of the introspection engines is reconfigurable from the switch control plane. Thus, adding, removing, or colocating queries is a seamless operation without downtime. To support co-existing queries, RDMI simply needs to configure the execution context (i.e., PC values and their QPNs) and PC-toinstruction mappings for each of the queries individually.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Security analysis</head><p>Trusted boot. RDMI relies on a trusted boot process, where the RDMA NIC is initialized and appropriate QPNs (which denote the various PCs) are assigned to RDMI so that it can perform subsequent introspection. This is a practical assumption, as the boot process can be protected by initializing the system with a known image and relying on hardware support available in modern CPUs <ref type="bibr">[38]</ref>.</p><p>Runtime TCB reduction. Hypervisor-based introspection has a large TCB&#208;the virtualization layer often exceeds several million lines of C code <ref type="bibr">[109]</ref>. In RDMI, after the trusted boot, the software TCB includes the P4 master program and its control plane entries generated by the compiler, totally less than 3K lines. Even the RDMI compiler itself can be excluded from the software TCB, as eventually only its outputs are deployed to the switch. If desired, one could even formally verify the correctness of P4 programs <ref type="bibr">[59,</ref><ref type="bibr">80]</ref> as a further step for assurance in RDMI. For hypervisor-based solutions, this is much harder to achieve. Dom(-1) hardware encloses vendor-provided firmware, which RDMI relies on for correct execution. The size of device firmware varies, and an example NIC (Netronome Agilio <ref type="bibr">[29]</ref>) contains 52K lines of code in its firmware.</p><p>Runtime tampering. We now consider adversaries that specifically attempt to tamper with RDMI operations. Since RDMA bypasses kernel and CPU software, this already provides a degree of stealth by virtue of executing in Dom(-1). Nevertheless, a powerful attacker may attempt to guess RDMA configurations (e.g., queue pair and sequence numbers) and launch attacks in the following scenarios.</p><p>The adversary can launch (1) a disconnection attack where she forcibly shuts down RDMI's queue pairs&#208;or even the entire RNIC hardware itself&#208;so that operations on these queue pairs will fail. RDMI detects such attacks by constantly monitoring the packet-level behavior for each of its queue pairs, and by raising alarms when certain queue pairs are unresponsive despite read requests. In addition, the adversary can also launch (2) an injection attack without shutting down queue pairs, where she attempts to guess the correct RDMA sequence number used for introspection, and inject spoofed data (e.g., incorrect task_struct addresses) to confound RDMI. Our defense relies on the fact that the RNIC hardware will execute RDMI requests and produce responses with identical sequence numbers as the adversary's injected packets. (Otherwise, if the adversary injects packets with incorrect sequence numbers, they will be rejected.) Therefore, RDMI keeps track of the correct sequence numbers and raises alarms when it detects duplicate packets, which is a sign of a spoofing attack. We have developed both defenses as part of the P4 master program, and Figure <ref type="figure">3</ref> shows the results for two introspection experiments, where around t=2s and t=3s, respectively, an adversary launches the disconnection and spoof attacks. In both cases, the P4 switch is able to detect the malicious behaviors Policies LoC (vs. libVMI) #instr. #entries P1 3 (225) 8 109 P2 4 (217) 10 135 P3 4 (145) 8 161 P4 7 (167) 16 135 P5 11 (200) 20 234 P6 4 (135) 8 91 P7 7 (252) 18 195 P8 5 (151) 8 123 P9 4 (104) 8 119 P10 6 (138) 9 173 P11 11 (231) 20 226</p><p>Table <ref type="table">3</ref>: RDMI is expressive for a range of introspection policies and is much more concise than LibVMI implementations.</p><p>and raise alarms to the operator. The throughput drop also shows that the effect of disrupting the RNIC operations is noticeably different from the normal RDMI operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Evaluation</head><p>We present a comprehensive evaluation to answer three research questions: a) how well can RDMI support diverse introspection policies? b) how effective is RDMI in detecting rootkits? c) what are the introspection overheads of RDMI?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Prototype and setup</head><p>We have implemented RDMI in 5200 lines of code. The RDMI compiler (2700 lines of code in C++) ingests an introspection policy in our domain-specific language, and emits control plane configurations for the master program. The master program implements the introspection engines in a P4 program with 2500 lines of code. We deploy RDMI to an Intel Tofino Wedge 100BF-32X P4 programmable hardware switch, with 32x100Gbps ports connected to a set of servers. All servers come with six Intel Xeon E5-2643 Quad-core 3.40 GHz CPUs (24 cores), 128 GB RAM, and run Ubuntu 18.04 as the OS. Each server also has a Mellanox CX-4 RDMA NIC operating at 25Gbps. Our primary baseline defense is LibVMI [22], a state-ofthe-art hypervisor-based introspection engine that obtains and analyzes guest memory snapshots in software. We use the KVM/QEMU (v4.2.1) hypervisor, and have manually implemented policies P1-P11 using LibVMI (v0.13.0). A LibVMI program runs inside the hypervisor and establishes a KVMI (KVM introspection) socket with a guest VM on the same physical server, and issues acquisition requests via this channel. Since LibVMI cannot support remote or baremetal introspection, we have created another defense system by stripping RDMI of its hardware control loop in the P4 switch. The introspection program runs in a dedicated server to implement the control loop in software, and the introspected machine is connected via the same P4 switch that only runs a basic forwarding program; the memory datapath is still implemented using RDMA NICs. We call this baseline &#170;Software RDMI,&#186; as it can be viewed as an intermediate step toward full RDMI assuming that the top-of-rack switch is not P4 programmable. We have implemented policies P1-P11 in software, and further integrated them with RDMA for remote memory acquisition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">RDMI language and compiler</head><p>We start by evaluating the domain-specific language and compiler in expressiveness, TCB size, compilation speeds, compiled configurations, and switch resource utilization. Expressiveness. Table <ref type="table">3</ref> shows the lines of code for each of the 11 policies in RDMI and in LibVMI implementations. RDMI supports the policies in at most 11 lines of code, where LibVMI implementations are one or two orders of magnitude larger&#208;with 104-252 lines of code across policies. Further, LibVMI programs are developed inside the hypervisor, requiring low-level programming skills from the developer. We also show the number of AIM instructions compiled from the functional operators, as well as the control plane entry counts for each policy, which range from 91 to 234. Thus, the RDMI compiler successfully hides the task complexity and shifts a substantial amount of work inside itself, while automatically configuring different introspection tasks. TCB reduction. The P4 master program has 2500 lines of code, and the control plane entries generated by the compiler are less than 210 lines across all policies. Thus, the runtime</p><p>Resource ALU (%) Hash unit (%) SRAM (%) TCAM (%) Endian conversion 0 1.68 0.21 0 Program counter 2.08 3.61 2.6 0 Address translation 8.33 6.61 4.69 3.47 AIM instructions 4.17 4.81 3.12 6.94 RDMA retrieval 4.17 4.09 2.81 1.39 Overall 22.92 26.68 16.15 11.8 Table 4: Switch resource utilization with 11 policies Time (sec) iPerf throughput (Gbps) 10 12 14 16 18 0 10 20 30 40 50 Adore-ng detected Netfilter rootkit detected TTY rootkit detected Diamorphine, Sutekh detected Spy detected P3 inserted P4 inserted P5 inserted P6 Inserted P7 inserted TCB is smaller than 3K lines of code and configurations, a significant reduction compared to the size of a hypervisor. Compilation speed. Next, we measure the turnaround time for the RDMI compiler to generate the control plane configurations for each policy. Figure <ref type="figure">4</ref> shows the turnaround time: across all policies, RDMI spends 10&#177;38 milliseconds to produce the compiled configurations, which is very efficient. Also, the turnaround time is correlated with the number of AIM instructions and control plane configuration entries that the compiler needs to generate&#208;more complex configurations tend to have higher compilation time (e.g., P11 &gt; P10 &gt; P9). RDMI supports policy composition naturally, as each policy results in its set of entries that are installed to the same set of introspection engines in the master program. Compiling multiple policies is equivalent to compiling each of them one by one, and then installing all the resulting entries to the switch (not shown, but see Figure <ref type="figure">5</ref> for concurrent queries). Switch resources. Table <ref type="table">4</ref> measures the switch resource utilization of the RDMI master program&#208;installed with all 11 policies&#208;and decomposes the usage across several components. In RDMI, header operations are performed in ALUs and hash units, stateful registers are supported in SRAM, and match/action entries are in SRAM and TCAM. Across all resource types, the switch has 11.8%-26.68% utilization, leaving plenty of room for other types of switch programs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Detecting rootkits, remotely</head><p>We now evaluate the effectiveness of RDMI to detect rootkits in baremetal installations over the network remotely. We collected four rootkits <ref type="bibr">[23,</ref><ref type="bibr">24,</ref><ref type="bibr">27,</ref><ref type="bibr">28]</ref> that are commonly used in kernel security evaluation, and added two more by implementing attack mechanisms in existing projects <ref type="bibr">[25,</ref><ref type="bibr">64,</ref><ref type="bibr">87]</ref>. RDMI is configured with all 11 policies in the switch. Adore-ng <ref type="bibr">[23]</ref> is a rootkit that has been evaluated in several existing detectors <ref type="bibr">[64,</ref><ref type="bibr">67,</ref><ref type="bibr">76,</ref><ref type="bibr">91,</ref><ref type="bibr">94]</ref>. It hooks itself to function pointers in the kernel virtual file system, such as the inode lookup and file iterate operations. After hooks are installed, the rootkit will collect parameters passed from the inode lookup function and match them against predefined secrets for triggering privilege escalation of a requested process. Further, this rootkit covers its tracks by hiding information from administrative tools&#208;the file iterate function hides data about malicious files, and its hook on tcp_seq_afinfo pointers hides network connections from netstat. RDMI successfully detected this rootkit from three policies. P2 detected a userland process whose privilege has been escalated; P3 and P10 detected function pointer values in the virtual file system and TCP stack that are outside the regular kernel text. sutekh <ref type="bibr">[28]</ref> is a rootkit that hooks into the execve and umask functions in the system call table (i.e., sys_call_table, which is an array of syscall pointers) <ref type="bibr">[69]</ref>. Invoking hooked syscalls will result in modification to the credential data structures for specified userland processes. RDMI detected this rootkit using P2 (which detects escalation) and P6 (which detects system call table tampering). Diamorphine <ref type="bibr">[24]</ref> is a rootkit that also targets system call hooks, and it manipulates kill, getdent and getdent64 <ref type="bibr">[67,</ref><ref type="bibr">74,</ref><ref type="bibr">84,</ref><ref type="bibr">92]</ref>. For instance, the kill syscall is repurposed as a communication mechanism between the rootkit and a userland process&#208;e.g., kill -sig -para sends signals from the userland to the rootkit, and the signal numbers further trigger information hiding or privilege escalation capabilities of the rootkit. on behalf of certain userland processes. RDMI detected this rootkit with policies P2 and P6. Spy <ref type="bibr">[27]</ref> is a keyboard logging rootkit, which manipulates register_keyboard_notifier in the kernel to add itself to a set of consoles (i.e., keyboard_notifier_blocks) that receive notifications upon keystrokes <ref type="bibr">[92]</ref>. The rootkit then converts the keystrokes into a buffer maintained by the debugfs virtual filesystem. Our system detected this rootkit using policy P8, which checks keyboard logging functions. TTY rootkit is a rootkit that we have implemented using the techniques proposed in a related project <ref type="bibr">[87]</ref>. It targets Linux tty units and manipulates the receive_buf function, which is a function called by tty_driver for sending characters to the tty line discipline. By doing this, it can hijack and eavesdrop on any data typed in a terminal. RDMI detected this rootkit with policy P5, which monitors the TTY activities. Netfilter rootkit implements techniques utilized by two projects <ref type="bibr">[25,</ref><ref type="bibr">64]</ref> that target Linux Netfilter. It registers a Netfilter handler nf_register_net_hook, specifically at the location NF_IP_LOCAL_IN. It thus obtains control when receiving network packets, and can take arbitrary actions including monitoring port-knocking traffic [32] for activation and then dropping such traffic to avoid suspicion. RDMI detected this</p><p>1 2 4 8 16 32 64 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Query Time Slowdown RDMI 2.675 ms 4.477 ms 0.0155 ms 0.299 ms 0.304 ms 0.934 ms 150.093 ms 0.049 ms 1.79 ms 0.017 ms 85.94 ms Software RDMI LibVMI rootkit with policy P4 that checks the Netfilter system. Dynamic, concurrent queries. To demonstrate RDMI's flexibility to deploy new queries at runtime, we start with the master program with an empty configuration, and the gradually add five policies to detect the above rootkits that are installed inside a server. Figure <ref type="figure">5</ref> shows the throughput of an iperf client during the reconfiguration, and labels the respective policies and the detected rootkits. We can see that query reprogramming does not disrupt the network transfer or impact service availability. In all cases, the added policies were able to detect the respective rootkits effectively. Without this capability, deploying new policies would require reflashing the switch and taking down the cluster for reconfiguration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Benefits of baremetal security</head><p>Next, we showcase the benefits of baremetal security, as enabled by RDMI. As discussed, LibVMI only supports virtualized environments, and &#170;Software RDMI&#186; is an approximation of RDMI that relegates introspection logic to remote server software (but still uses RDMA to fetch kernel memory). Introspection time. Figure <ref type="figure">6</ref> shows the time it takes to complete an introspection policy across the three defenses&#208; RDMI executes 9-58 times faster than the state-of-the-art LibVMI solution. &#170;Software RDMI,&#186; as an approximation, is also much faster than LibVMI but it still falls behind the full RDMI, which outperforms the former by roughly two times. This is not only because &#170;Software RDMI&#186; still involves remote CPU overheads, but also that the introspection and introspected machines are necessarily located farther away from each other, connected by an intermediate switch. RDMI, on the other hand, is a switch-resident defense and has an immediate reach to all servers under the same rack. The introspection time also varies across policies&#208;e.g., P3 is the fastest (15.5&#181;s) and P7 is the slowest (150.1ms) for RDMI. We further measure the capture rates across the three defenses by introducing a &#170;cat-and-mouse&#186; game, where a kernel rootkit rapidly modifies the credential data structures of specific processes and then modifies them back. By setting the attacker to different modification frequencies, we compare how well the defenses can capture the modifications by</p><p>0 20 40 60 80 100 0 100 200 300 400 500 Capture rate (%) Modifications per second RDMI Software RDMI LibVMI 0 1 2 3 4 5 6 7 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 1 0 P 1 1 CPU utilization (# of cores) Figure 8: CPU overheads for security policies with software RDMI under the same introspection throughput as LibVMI.</p><p>measuring their capture rates. Figure <ref type="figure">7</ref> shows the results for up to 500 modifications per second, with RDMI consistently achieving the highest capture rates. When the attack goes beyond 50 modifications per second, LibVMI drops to about 20% capture rate; when it increases to 200 modifications per second, even Software RDMI drops to 57%&#208;but at both frequencies RDMI stays at 100% and it only drops for much faster attacks. Introspection CPU overheads. As motivated before, hypervisor offloading aims to reduce software CPU overheads from the servers. RDMI achieves introspection entirely in programmable hardware, without CPU software overheads. Thus, we measure the CPU overhead for the other defenses to understand the cost for security. We configure LibVMI to introspect varying numbers of VMs with each of the policies, and measure the number of CPU cores that are required for the introspection tasks. Each KVMI introspection socket is</p><p>0 5 10 15 20 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 # of CPU Cores 2 VMs 4 VMs 8 VMs 16 VMs 32 VMs 48 5 qp s 27 7 qp s 94 K qp s 7. 9K qp s 4. 9K qp s 2. 6K qp s 13 .2 qp s 68 K qp s 1. 1K qp s 83 K qp s 6. 34 qp s 0 10 20 30 40 50 60 S e t G e t R a n g e P u s h m S e t Tput Reduction (%) LibVMI RDMI (a) Simple introspection task (P3; Redis) 0 10 20 30 40 50 60 S e t G e t R a n g e P u s h m S e t Tput Reduction (%) LibVMI RDMI (b) Medium introspection task (P1; Redis) 0 10 20 30 40 50 60 S e t G e t R a n g e P u s h m S e t Tput Reduction (%) LibVMI RDMI (c) Complex introspection task (P11; Redis) 0 10 20 30 40 50 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B Tput Reduction (%) LibVMI RDMI (d) Simple introspection task (P3; Nginx) 0 10 20 30 40 50 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B Tput Reduction (%) LibVMI RDMI (e) Medium introspection task (P1; Nginx) 0 10 20 30 40 50 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B Tput Reduction (%) LibVMI RDMI (f) Complex introspection task (P11; Nginx) Figure 10: Throughput reduction of local Redis and Nginx with LibVMI and RDMI, normalized to their respective baselines.</p><p>attached to a single introspection program, which is dedicated to a policy and executes it repeatedly. As shown in Figure <ref type="figure">9</ref>, for 32 VMs, LibVMI requires 10-14 CPU cores (out of 24 cores) just for the introspection tasks. This is a significant cost, as the resulting CPU overheads take away valuable cycles that are no longer provisioned for the tenants. Next, we configure Software RDMI to execute at the same query speeds that are achievable by LibVMI with 32 VMs, and measure the CPU costs of the introspection machine. As Figure <ref type="figure">8</ref> shows, the CPU cost is significantly lower&#208;roughly on par with what would be needed to introspect 4-8 VMs with the LibVMI solution. Nevertheless, Software RDMI still requires 1.3-6.2 CPU cores to drive the introspection task, whereas RDMI removes the CPU overhead entirely by executing in Dom(-1). Summary. Concretely, the performance gain of RDMI comes from two factors. First, RDMI executes entirely in programmable ASICs at hardware speeds both for memory re-trieval and for introspection computation, whereas the hypervisor solutions execute on CPU software with high CPU overhead. Moreover, LibVMI requires the presence of a hypervisor, and as shown in &#167;7.5, this by itself incurs a large footprint and takes away resources from any colocated tasks.</p><p>In contrast, RDMI executes in a baremetal setting off the host, so introspection tasks and tenant workloads cause minimal interference to each other. In addition, our &#170;cat-and-mouse&#186; experiment shows that this performance gain translates to concrete security benefits in capture rates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5">Introspection interference</head><p>We have seen that introspection is a heavyweight task with high CPU overheads. Next, we quantify the introspection interference to tenant workloads. For LibVMI, we use the 32-VM setting where one of the VMs is executing tenant services, and measure the performance overhead to these workloads with hypervisor introspection. For RDMI, we measure the same workloads on a baremetal machine and quantify the performance downgrade of the introspected machine. We omit Software RDMI from this measurement, as its overhead on the introspected machine is the same with RDMI&#208;the downside of Software RDMI comes from overheads of the remote introspection machine, which is dedicated to introspection tasks and does not run tenant workloads. We use two common workloads&#208;key/value operations (Redis) and web transfers (Nginx)&#208;and test them both locally and via the network.</p><p>Redis key/value workloads (local). Figure <ref type="figure">10</ref> shows the throughput reduction of key/value workloads, with the Redis client and server colocated on the same machine, and using varying key/value operations common for Redis benchmarking (i.e., Set, Get, Range-100, Push, mSet). For fairness of comparison, we normalize the LibVMI-enabled throughput against the Redis throughput without LibVMI (both in VMs), and RDMI-enabled throughput against the case without RDMI (both in baremetal servers). Further, we choose three representative introspection policies based on their complexity in the LibVMI implementations (simple: P3, medium: P1, complex: P11</p><p>). We can see that RDMI incurs 0.1%-4% throughput overhead across all key/value workloads, whereas LibVMI is 11-486 times higher with reductions ranging from 22%-56%. This is because LibVMI introspection incurs high CPU overheads, and creates severe contention. RDMI, on the other hand, does not involve the remote CPUs. Its overhead comes from the RDMA reads that are converted to memory accesses, which incur a small amount of memory contention. Web server workloads (local). Using a similar methodology, we have measured the introspection interference of LibVMI 0 0.5 1 1.5 2 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 1 0 P 1 1 Network bandwidth usage/Gbps and RDMI with Apache web server workloads. In this experiment, we generate HTTP requests to download files of varying sizes from the web server, and measure the throughput of requests served per second. As Figure <ref type="figure">10</ref> shows, RDMI only introduces 0.4%-5.1% throughput overheads across all workloads, whereas LibVMI incurs an overhead ranging from 25%-51%, which is 7-711 times higher. Remote workloads. Next, we measure the same workloads when the requests are coming through the network. Whereas the local experiments are designed to measure CPU and memory overheads due to introspection interference, this setup additionally accounts for the impact of network traffic as generated by RDMI. LibVMI only performs local operations, so it does not incur any network IO. Figure <ref type="figure">11</ref> measures the Redis and Nginx throughputs over the network, respectively. Even accounting for network overheads, RDMI only incurs a throughput degradation between 0.1%-5.6%. In comparison, when LibVMI is handling remote client requests, the degradation ranges from 20%-60% across the workloads, which is 5-864 times higher. Therefore, these results show that, with requests coming through the network, RDMI is still able to perform security tasks with minimal performance interference, unlike state-of-the-art hypervisor solutions.</p><p>Figure <ref type="figure">12</ref> further shows the amount of network bandwidth overhead under different introspection policies. We set RDMI to introspect the baremetal machine at the same speed (in terms of queries per second) as what LibVMI can achieve at its peak throughput with 32 VMs locally. Across all queries, the network bandwidths due to remote introspection range from 0.3&#177;1.91Gbps; over a 100Gbps link, this translates to 0.3%-1.91% network overheads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Discussions</head><p>Cost of deploying RDMI. Since RDMA NICs and programmable switches have been in use at major cloud providers <ref type="bibr">[44,</ref><ref type="bibr">47,</ref><ref type="bibr">60,</ref><ref type="bibr">78]</ref>, we believe that the barrier to deploying RDMI is reasonably low. Nevertheless, for a cloud provider that does not already use RDMA NICs and programmable switches, there will be an extra cost for upgrading these devices. We provide some data points for understanding the CapEx and OpEx cost based on available prices. Our programmable switch costs $10,060 <ref type="bibr">[41]</ref>, and a nonprogrammable switch operating at the same port speeds (32x100Gbps) costs $9,399 <ref type="bibr">[11]</ref>. Our RDMA NIC costs $488 [30], and a non-RDMA NIC at the same speed costs $355 <ref type="bibr">[18]</ref>. The extra cost for upgrading a single switch and a single RDMA NIC is therefore $794. As we have shown in &#167;7.4, hypervisor-based introspection requires 11.2&#177;14 CPU cores on average with 32 VMs across policies P1-P11. These core counts are similar to what is provisioned in Amazon's m5zn.3xlarge EC2 instances, sold at $0.991/hour <ref type="bibr">[3,</ref><ref type="bibr">6]</ref>; hence, RDMI would be more cost-effective after 33.4 days of operation. Although device and VM costs change over time and across vendors, we believe that this back-of-the-envelope calculation paints a representative picture. Also, we note again that cloud providers that already invest in these devices would not incur additional capital cost.</p><p>Attacks to RDMA and P4 systems. RDMI NICs and P4 programmable switches are part of our TCB, and they are assumed to be trustworthy. However, existing work has identified security issues with both types of devices <ref type="bibr">[71,</ref><ref type="bibr">96]</ref>. As some examples, adversaries could launch side channel <ref type="bibr">[100]</ref>, exfiltration, injection, and denial-of-service attacks <ref type="bibr">[96]</ref> to RDMA deployments. P4 programmable devices may also exhibit corner-case behaviors under carefully crafted traffic patterns by the attacker <ref type="bibr">[71]</ref>. However, these generic attacks are not specific to RDMI, and known defenses exist <ref type="bibr">[105]</ref>.</p><p>Cache coherence, consistency, registers. RDMI performs introspection in an out-of-band (OOB) manner, and OOB introspection <ref type="bibr">[90]</ref> has several limitations shared with RDMI.</p><p>(i) Cache coherence: RDMA memory accesses are not cache coherent with host CPUs unless more advanced interconnects (e.g., CXL [12]) become available. Thus, when RDMI ac-quires a data structure from the main memory, it is not guaranteed to be the latest version as modified data may exist in the CPU cache. (ii) Consistency: Kernel state is in constant flux, and this leads to another degree of asynchrony between RDMI's view and the true system state. For instance, while RDMI traverses the task_struct linked list, processes could be added to or removed from the data structure and these changes may not be reflected in RDMI's view. In fact, even for hypervisor-based solutions, inconsistent views could arise unless guest VMs are paused during introspection, but this would lead to significant overheads. (iii) Registers: OOB solutions cannot introspect CPU state such as register values <ref type="bibr">[76,</ref><ref type="bibr">90,</ref><ref type="bibr">98]</ref>. Previous work <ref type="bibr">[66]</ref> has shown that advanced attackers could manipulate the CR3 register so that the actual page tables used by the OS are different from those seen by OOB introspection. Despite these limitations, OOB solutions have been shown to be effective in existing work <ref type="bibr">[90]</ref>, and RDMI corroborates these findings. Importantly, in baremetal settings, the host machine does not have a hypervisor to perform in-band introspection, so OOB solutions like RDMI are necessary in order to protect baremetal kernels. Introspection capability. Our current RDMI experiments disable the IOMMU, but when it is enabled, DMA requests from PCIe devices may be further translated by IOMMU.</p><p>In this case, an attacker could create incorrect IOMMU mappings to confound security mechanisms like RDMI <ref type="bibr">[45,</ref><ref type="bibr">98]</ref>. As a potential mitigation, PCIe devices that implement the ATS (Address Translation Service) feature can tag DMA requests so that they are not further translated by the IOMMU, thus ensuring trusted memory acquisition <ref type="bibr">[45]</ref>. Recent RNICs have been built with ATS features <ref type="bibr">[31]</ref>. In terms of introspection tasks, RDMI supports tasks that traverse the kernel graph in a well-defined footprint to detect attacks. However, not all memory forensic tasks fall into this category&#208;e.g., performing a cryptographic hash over kernel text to ensure integrity <ref type="bibr">[90,</ref><ref type="bibr">91]</ref>, or regular expression matches over memory content <ref type="bibr">[51,</ref><ref type="bibr">56,</ref><ref type="bibr">99]</ref>, would go beyond RDMI's current DSL and hardware capabilities. With an imperative language (e.g., C programs that use LibVMI), one could also write introspection tasks that walk kernel pointers in arbitrary patterns; such tasks also create challenges for the current DSL. Nevertheless, we have demonstrated that RDMI is sufficiently expressive for a range of tasks and can detect real-world rootkits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Related work</head><p>Memory introspection. The art of introspecting memory snapshots to detect malice dates back to two decades ago <ref type="bibr">[61]</ref>. Since then, many techniques have been developed to improve the accuracy of kernel memory analysis <ref type="bibr">[21,</ref><ref type="bibr">48,</ref><ref type="bibr">49,</ref><ref type="bibr">52,</ref><ref type="bibr">54,</ref><ref type="bibr">57,</ref><ref type="bibr">79,</ref><ref type="bibr">91,</ref><ref type="bibr">93,</ref><ref type="bibr">98,</ref><ref type="bibr">103]</ref> and narrow the semantic gap <ref type="bibr">[43,</ref><ref type="bibr">53,</ref><ref type="bibr">62,</ref><ref type="bibr">86]</ref>. Hypervisor-based systems, such as ImEE <ref type="bibr">[110]</ref> and livewire <ref type="bibr">[61]</ref>, use software solutions to introspect guest VMs. Our project is, in particular, related to out-ofband (OOB) introspection techniques that leverage hardware assistance. In this space, KI-Mon <ref type="bibr">[76]</ref> and Vigilare <ref type="bibr">[85]</ref> add a special security module that snoops the memory bus and detects kernel object modifications. Copilot <ref type="bibr">[90]</ref> contributes a system prototype for an Intel StrongARM EBSA-285 evaluation board that can acquire memory over PCIe. In contrast to existing OOB platforms, RDMI a) only uses COTS devices for baremetal introspection; b) it contributes a domainspecific language, compiler, and runtime for introspection tasks, which c) can be executed efficiently in programmable hardware without CPU involvement. P4 and RDMA. P4 and RDMA programmable devices have been used for performance acceleration in cloud systems&#208; separately <ref type="bibr">[68,</ref><ref type="bibr">70]</ref> and in conjunction <ref type="bibr">[73,</ref><ref type="bibr">75]</ref>. The security community has developed a line of work using P4 programmable switches for network protection <ref type="bibr">[83,</ref><ref type="bibr">106]</ref>, including for RDMA vulnerabilities <ref type="bibr">[105]</ref>. RDMI demonstrates their use in a novel setting for kernel security.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">Conclusion</head><p>Hypervisor offloading has gained popularity in datacenters. We rearchitect a classic security task that is usually relegated to the hypervisor&#208;memory introspection&#208;to enable introspection of baremetal servers entirely in programmable ASICs. RDMI leverages recent hardware advances in RDMA NICs and P4 programmable switches, and designs a domainspecific language, compiler, and runtime system. RDMI incurs no CPU overheads in introspection tasks, outperforming state-of-the-art hypervisor-based solutions, and detects a variety of rootkits even in baremetal installations.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>/* Credential telemetry for all processes */</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>kgraph(init_task)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>.traverse(tasks.next, &amp;init_task.tasks, task_struct)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>.in(creds)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>.values(uid, gid)   </p></note>
		</body>
		</text>
</TEI>
