NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

High-throughput and Flexible Host Networking for Accelerated Computing

Skiadopoulos, Athinagoras; Xie, Zhiqiang; Zhao, Mark; Cai, Qizhe; Agarwal, Saksham; Adelman, Jacob; Ahern, David; Contavalli, Carlo; Goldflam, Michael; Mayatskikh, Vitaly; et al (July 2024, OSDI)
Efficiently Programming Large Language Models using SGLang.

Zheng, Lianmin; Yin, Liangsheng; Xie, Zhiqiang; Huang, Jeff; Sun, Chuyue; Yu, Cody_Hao; Cao, Shiyi; Kozyrakis, Christos; Stoica, Ion; Gonzalez, Joseph E; et al (December 2023, arXiv)

Full Text Available
RAIL: Predictable, Low Tail Latency for NVMe Flash

https://doi.org/10.1145/3465406

Litz, Heiner; Gonzalez, Javier; Klimovic, Ana; Kozyrakis, Christos (February 2022, ACM Transactions on Storage)

Flash-based storage is replacing disk for an increasing number of data center applications, providing orders of magnitude higher throughput and lower average latency. However, applications also require predictable storage latency. Existing Flash devices fail to provide low tail read latency in the presence of write operations. We propose two novel techniques to address SSD read tail latency, including Redundant Array of Independent LUNs (RAIL) which avoids serialization of reads behind user writes as well as latency-aware hot-cold separation (HC) which improves write throughput while maintaining low tail latency. RAIL leverages the internal parallelism of modern Flash devices and allocates data and parity pages to avoid reads getting stuck behind writes. We implement RAIL in the Linux Kernel as part of the LightNVM Flash translation layer and show that it can reduce read tail latency by 7× at the 99.99th percentile, while reducing relative bandwidth by only 33%.
more » « less
Full Text Available
RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers

Zhu, Hang; Kaffes, Kostis; Chen, Zixu; Liu, Zhenming; Kozyrakis, Christos; Stoica, Ion (January 2021, OSDI 2021)
null (Ed.)
Full Text Available
RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers

Zhu, Hang; Kaffes, Kostis; Chen, Zixu; Liu, Zhenming; Kozyrakis, Christos; Stoica, Ion; Jin, Xin (November 2020, 14th USENIX Symposium on Operating Systems Design and Implementation)
null (Ed.)
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k- choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
more » « less
Full Text Available
Classifying Memory Access Patterns for Prefetching

https://doi.org/10.1145/3373376.3378498

Ayers, Grant; Litz, Heiner; Kozyrakis, Christos; Ranganathan, Parthasarathy (March 2020, ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems)

Prefetching is a well-studied technique for addressing the memory access stall time of contemporary microprocessors. However, despite a large body of related work, the memory access behavior of applications is not well understood, and it remains difficult to predict whether a particular application will benefit from a given prefetcher technique. In this work we propose a novel methodology to classify the memory access patterns of applications, enabling well-informed reasoning about the applicability of a certain prefetcher. Our approach leverages instruction dataflow information to uncover a wide range of access patterns, including arbitrary combinations of offsets and indirection. These combinations or prefetch kernels represent reuse, strides, reference locality, and complex address generation. By determining the complexity and frequency of these access patterns, we enable reasoning about prefetcher timeliness and criticality, exposing the limitations of existing prefetchers today. Moreover, using these kernels, we are able to compute the next address for the majority of top-missing instructions, and we propose a software prefetch injection methodology that is able to outperform state-of-the-art hardware prefetchers.
more » « less
Full Text Available
AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers

https://doi.org/10.1109/MM.2020.2986212

Nagendra, Nayana Prasad; Ayers, Grant; August, David I.; Cho, Hyoun Kyu; Kanev, Svilen; Kozyrakis, Christos; Krishnamurthy, Trivikram; Litz, Heiner; Moseley, Tipp; Ranganathan, Parthasarathy (May 2020, IEEE Micro)

Full Text Available
From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers

Fouladi, Sadjad; Romero, Francisco; Iter, Dan; Li, Qian; Chatterjee, Shuvo; Kozyrakis, Christos; Zaharia, Matei; Winstein, Keith (July 2019, 2019 USENIX Annual Technical Conference (USENIX ATC 19))

We present gg, a framework and a set of command-line tools that helps people execute everyday applications—e.g., software compilation, unit tests, video encoding, or object recognition—using thousands of parallel threads on a cloud-functions service to achieve near-interactive completion time. In the future, instead of running these tasks on a laptop, or keeping a warm cluster running in the cloud, users might push a button that spawns 10,000 parallel cloud functions to execute a large job in a few seconds from start. gg is designed to make this practical and easy. With gg, applications express a job as a composition of lightweight OS containers that are individually transient (lifetimes of 1–60 seconds) and functional (each container is hermetically sealed and deterministic). gg takes care of instantiating these containers on cloud functions, loading dependencies, minimizing data movement, moving data between containers, and dealing with failure and stragglers. We ported several latency-sensitive applications to run on gg and evaluated its performance. In the best case, a distributed compiler built on gg outperformed a conventional tool (icecc) by 2–5×, without requiring a warm cluster running continuously. In the worst case, gg was within 20% of the hand-tuned performance of an existing tool for video encoding (ExCamera).
more » « less
Full Text Available

Search for: All records