NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

https://doi.org/10.1145/3676641.3715996

Kamath, Aditya K; Prabhu, Ramya; Mohan, Jayashree; Peter, Simon; Ramjee, Ramachandran; Panwar, Ashish (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
EMPower: The Case for a Cloud Power Control Plane

Park, Jonggyu; Stavrinos, Theano; Peter, Simon; Anderson, Thomas (December 2024, ACM SIGENERGY Energy Informatics Review)

Full Text Available
(MC)^2: Lazy MemCopy at the Memory Controller

Kamath, Aditya K; Peter, Simon (June 2024, ACM)

(MC)^2 is a lazy memory copy mechanism which can be used within memcpy-like functions to significantly reduce the CPU overhead for copies that are sparsely accessed. It can also hide copy latencies by enhancing the CPU’s ability to execute them asynchronously. (MC)^2’s lazy memcpy avoids copying data at the time of invocation. Instead, (MC)^2 tracks prospective copies. If copied data is later accessed by a CPU or the cache, (MC)^2 uses the tracking information to lazily execute a copy, when necessary. Placing (MC)^2 at the memory controller puts it at the perfect vantage point to eliminate the largest source of memcpy overhead—CPU stalls due to cache misses in the critical path—while imposing minimal overhead itself. (MC)^2 consists of three main components: memory controller extensions that implement a lazy memcpy operation, a new instruction exposing the lazy memcpy, and a flexible software wrapper with semantics identical to memcpy. We implement and evaluate (MC)^2 in the gem5 simulator using a variety of microbenchmarks and workloads, including Google’s Protobuf, where (MC)^2 provides a 43% speedup and Linux huge page copy-on-write faults, where (MC)^2 provides 250× lower latency.
more » « less
Full Text Available
EMPower: The Case for a Cloud Power Control Plane

Park, Jonggyu; Stavrinos, Theano; Peter, Simon; Anderson, Thomas (July 2024, HotCarbon; ACM Energy Informatics Review)

Escalating application demand and the end of Dennard scaling have put energy management at the center of cloud operations. Because of the huge cost and long lead time of provisioning new data centers, operators want to squeeze as much use out of existing data centers as possible, often limited by power provisioning fixed at the time of construction. Workload demand spikes and the inherent variability of renewable energy, as well as increased power unreliability from extreme weather events and natural disasters, make the data center power management problem even more challenging. We believe it is time to build a power control plane to provide fine-grained observability and control over data center power to operators. Our goal is to help make data centers substantially more elastic with respect to dynamic changes in energy sources and application needs, while still providing good performance to applications. There are many use cases for cloud power control, including increased power oversubscription and use of green energy, resilience to power failures, large-scale power demand response, and improved energy efficiency.
more » « less
Full Text Available
Can Storage Devices be Power Adaptive?

https://doi.org/10.1145/3655038.3665945

Xie, Dedong; Stavrinos, Theano; Zhu, Kan; Peter, Simon; Kasikci, Baris; Anderson, Thomas (July 2024, ACM)

Power is becoming a scarce resource for data centers, raising the need for power adaptive system design—the ability to dynamically change power consumption—to match available power. Storage makes up an increasing fraction of total data center power consumption. As such, it holds great potential to contribute to data center power adaptivity. To this end, we conduct a measurement study of power control mechanisms on a variety of modern data center storage devices. By changing device power states and shaping IO, we achieve a power dynamic range of up to 59.4% of the device’s maximum operating power. We also study power control trade-offs, including throughput and latency. Based on our observations, we construct storage device power-throughput models and discuss the implications on power adaptive storage system design.
more » « less
Full Text Available
ScaleDB: A Scalable, Asynchronous In-Memory Database

Mehdi, Syed; Hwang, Deukyeon; Peter, Simon; Alvisi, Lorenzo (July 2023, USENIX)

ScaleDB is a serializable in-memory transactional database that achieves excellent scalability on multi-core machines by asynchronously updating range indexes. We find that asynchronous range index updates can significantly improve database scalability by applying updates in batches, reducing contention on critical sections. To avoid stale reads, ScaleDB uses small hash indexlets to hold delayed updates. We use indexlets to design ACC, an asynchronous concurrency control protocol providing serializability. With ACC, it is possible to delay range index updates without adverse performance effects on transaction execution in the common case. ACC delivers scalable serializable isolation for transactions, with high throughput and low abort rate. Evaluation on a dual-socket server with 36 cores shows that ScaleDB achieves 9.5× better query throughput than Peloton on the YCSB benchmark and 1.8× better transaction throughput than Cicada on the TPC-C benchmark.
more » « less
Full Text Available
ScaleDB: A Scalable, Asynchronous In-Memory Database

Mehdi, Syed Akbar; Hwang, Deukyeon; Peter, Simon; Alvisi, Lorenzo (July 2023, Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation)

ScaleDB is a serializable in-memory transactional database that achieves excellent scalability on multi-core machines by asynchronously updating range indexes. We find that asynchronous range index updates can significantly improve database scalability by applying updates in batches, reducing contention on critical sections. To avoid stale reads, ScaleDB uses small hash indexlets to hold delayed updates. We use in- dexlets to design ACC, an asynchronous concurrency control protocol providing serializability. With ACC, it is possible to delay range index updates without adverse performance effects on transaction execution in the common case. ACC delivers scalable serializable isolation for transactions, with high throughput and low abort rate. Evaluation on a dual- socket server with 36 cores shows that ScaleDB achieves 9.5× better query throughput than Peloton on the YCSB bench- mark and 1.8× better transaction throughput than Cicada on the TPC-C benchmark.
more » « less
Full Text Available
RDMA is Turing complete, we just did not know it yet!

Reda, Waleed; Kostić, Dejan; Canini, Marco; Peter, Simon (January 2022, 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI))

It is becoming increasingly popular for distributed systems to exploit offload to reduce load on the CPU. Remote Direct Memory Access (RDMA) offload, in particular, has become popular. However, RDMA still requires CPU intervention for complex offloads that go beyond simple remote memory access. As such, the offload potential is limited and RDMA-based systems usually have to work around such limitations. We present RedN, a principled, practical approach to implementing complex RDMA offloads, without requiring any hardware modifications. Using self-modifying RDMA chains, we lift the existing RDMA verbs interface to a Turing complete set of programming abstractions. We explore what is possible in terms of offload complexity and performance with a commodity RDMA NIC. We show how to integrate these RDMA chains into applications, such as the Memcached key-value store, allowing us to offload complex tasks such as key lookups. RedN can reduce the latency of key-value get operations by up to 2.6× compared to state-of-the-art KV designs that use one-sided RDMA primitives (e.g., FaRM-KV), as well as traditional RPC-over-RDMA approaches. Moreover, compared to these baselines, RedN provides performance isolation and, in the presence of contention, can reduce latency by up to 35× while providing applications with failure resiliency to OS and process crashes.
more » « less
Full Text Available
FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism

Shashidhara, Rajath; Stamler, Timothy; Kaufmann, Antoine; Peter, Simon (January 2022, 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI))

FlexTOE is a flexible, yet high-performance TCP offload engine (TOE) to SmartNICs. FlexTOE eliminates almost all host data-path TCP processing and is fully customizable. FlexTOE interoperates well with other TCP stacks, is robust under adverse network conditions, and supports POSIX sockets. FlexTOE focuses on data-path offload of established connections, avoiding complex control logic and packet buffering in the NIC. FlexTOE leverages fine-grained parallelization of the TCP data-path and segment reordering for high performance on wimpy SmartNIC architectures, while remaining flexible via a modular design. We compare FlexTOE on an Agilio-CX40 to host TCP stacks Linux and TAS, and to the Chelsio Terminator TOE. We find that Memcached scales up to 38% better on FlexTOE versus TAS, while saving up to 81% host CPU cycles versus Chelsio. FlexTOE provides competitive performance for RPCs, even with wimpy SmartNICs. FlexTOE cuts 99.99th-percentile RPC RTT by 3.2× and 50% versus Chelsio and TAS, respectively. FlexTOE's data-path parallelism generalizes across hardware architectures, improving single connection RPC throughput up to 2.4× on x86 and 4× on BlueField. FlexTOE supports C and XDP programs written in eBPF. It allows us to implement popular data center transport features, such as TCP tracing, packet filtering and capture, VLAN stripping, flow classification, firewalling, and connection splicing.
more » « less
Full Text Available
zIO: Accelerating IO-Intensive Applications with Transparent Zero-Copy IO

Stamler, Timothy; Hwang, Deukyeon; Raybuck, Amanda; Zhang, Wei; Peter, Simon (January 2022, Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation)

Full Text Available

« Prev Next »

Search for: All records