NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

https://doi.org/10.1145/3651890.3672274

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, ACM)

Full Text Available
A Foundation for Real-time Applications onFunction-as-a-Service

Nguyen, Hai Duc; Chien, Andrew A. (February 2024, A Foundation for Real-time Applications onFunction-as-a-Service)

Serverless (or Function-as-a-Service) compute model enables new applications with dynamic scaling. However, all current Serverless systems are best-effort, and as we prove this means they cannot guarantee hard real-time deadlines, rendering them unsuitable for such real-time applications. We analyze a proposed extension of the Serverless model that adds a guaranteed invocation rate to the serverless model called Real-time Serverless. This approach aims to meet real-time deadlines with dynamically allocated function invocations. We first prove that the Serverless model does not support real-time guarantees. Next, we analyze Real-time Serverless, showing it can guarantee application real-time deadlines for rate-monotonic real-time workloads. Further, we derive bounds on the required invocation rate to meet any set of workload runtimes and periods. Subsequently, we explore an application technique, pre-invocation, and show that it can reduce the required guaranteed invocation rate. We derive bounds for the feasible rate guarantee reduction, and corresponding overhead in wasted compute resources. Finally, we apply the theoretical results to improve the experience quality of a distributed virtual reality/ augmented reality application as well as simplify the application design and resource management.
more » « less
Full Text Available
OneAdapt: Fast Adaptation for Deep Learning Applications via Backpropagation

https://doi.org/10.1145/3620678.3624653

Du, Kuntai; Liu, Yuhan; Hao, Yitian; Zhang, Qizheng; Wang, Haodong; Huang, Yuyang; Ananthanarayanan, Ganesh; Jiang, Junchen (October 2023, SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing)
Storm-RTS: Stream Processing with Stable Performance for Multi-Cloud and Cloud-edge

https://doi.org/10.1109/CLOUD60044.2023.00015

Nguyen, Hai Duc; Chien, Andrew A. (July 2023, IEEE)

Stream Processing Engines (SPEs) traditionally de-ploy applications on a set of shared workers (e.g., threads, processes, or containers) requiring complex performance man-agement by SPEs and application developers. We explore a new approach that replaces workers with Rate-based Abstract Ma-chines (RBAMs). This allows SPEs to translate stream operations into FaaS invocations, and exploit guaranteed invocation rates to manage performance. This approach enables SPE applications to achieve transparent and predictable performance. We realize the approach in the Storm-RTS system. Exploring 36 stream processing scenarios over 5 different hardware config-urations, we demonstrate several key advantages. First, Storm-RTS provides stable application performance and can enable flexible reconfiguration across cloud resource configurations. Sec-ond, SPEs built on RBAM can be resource-efficient and scalable. Finally, Storm-RTS allows the stream-processing paradigm to be extended from the cloud to the edge, using its performance stability to hide edge heterogeneity and resource competition. An experiment with 4 cloud and edge sites over 300 cores shows how Storm-RTS can support flexible reconfiguration and simple high-level declarative policies that optimize resource cost or other criteria.
more » « less
Full Text Available
Reducing the Carbon Impact of Generative AI Inference (today and in 2035)

Chien, Andrew A.; Lin, Liuzixuan; Nguyen, Hai; Rao, Varsha; Sharma, Tristan; Wijayawardana, Rajini (July 2023, ACM Hot Carbon 2023)

Generative AI, exemplified in ChatGPT, Dall-E 2, and Stable Diffusion, are exciting new applications consuming growing quantities of computing. We study the compute, energy, and carbon impacts of generative AI inference. Using ChatGPT as an exemplar, we create a workload model and compare request direction approaches (Local, Balance, CarbonMin), assessing their power use and carbon impacts. Our workload model shows that for ChatGPT-like services, in- ference dominates emissions, in one year producing 25x the carbon-emissions of training GPT-3. The workload model characterizes user experience, and experiments show that carbon emissions-aware algorithms (CarbonMin) can both maintain user experience and reduce carbon emissions dramatically (35%). We also consider a future scenario (2035 workload and power grids), and show that CarbonMin can reduce emissions by 56%. In both cases, the key is intelligent direction of requests to locations with low-carbon power. Combined with hardware technology advances, CarbonMin can keep emissions increase to only 20% compared to 2022 levels for 55x greater workload. Finally we consider datacenter headroom to increase effectiveness of shifting. With headroom, CarbonMin reduces 2035 emissions by 71%.
more » « less
Full Text Available
Adapting Datacenter Capacity for Greener Datacenters and Grid

https://doi.org/10.1145/3575813.3595197

Lin, Liuzixuan; Chien, Andrew A (June 2023, ACM Symposium on Future Energy Systems (E-Energy 2023))

Cloud providers are adapting datacenter (DC) capacity to reduce carbon emissions. With hyperscale datacenters exceeding 100 MW individually, and in some grids exceeding 15% of power load, DC adaptation is large enough to harm power grid dynamics, increasing carbon emissions, power prices, or reduce grid reliability. To avoid harm, we explore coordination of DC capacity change varying scope in space and time. In space, coordination scope spans a single datacenter, a group of datacenters, and datacenters with the grid. In time, scope ranges from online to day-ahead. We also consider what DC and grid information is used (e.g. real-time and day-ahead average carbon, power price, and compute backlog). For example, in our proposed PlanShare scheme, each datacenter uses day-ahead information to create a capacity plan and shares it, allowing global grid optimization (over all loads, over entire day). We evaluate DC carbon emissions reduction. Results show that local coordination scope fails to reduce carbon emissions significantly (3.2%–5.4% reduction). Expanding coordination scope to a set of datacenters improves slightly (4.9%–7.3%). PlanShare, with grid-wide coordination and full-day capacity planning, performs the best. PlanShare reduces DC emissions by 11.6%–12.6%, 1.56x–1.26x better than the best local, online approach’s results. PlanShare also achieves lower cost. We expect these advantages to increase as renewable generation in power grids increases. Further, a known full-day DC capacity plan provides a stable target for DC resource management.
more » « less
Full Text Available
Privid: Practical, Privacy-Preserving Queries on Public Video

Cangialosi, Frank; Agarwal, Neil; Arun, Venkat; Jiang, Junchen; Narayana, Srinivas; Sarwate, Anand; Netravali, Ravi (April 2022, 19th USENIX Symposium on Networked Systems Design and Implementation)

Full Text Available
Understanding the potential of server-driven edge video analytics

https://doi.org/10.1145/3508396.3512872

Zhang, Qizheng; Du, Kuntai; Neil Agarwal; Netravali, Ravi; Jiang, Junchen (January 2022, HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications)

Full Text Available
Motivating High Performance Serverless Workloads

Nguyen, Hai Duc; Yang, Zhifei; Chien, Andrew A. (June 2021, THE 1ST WORKSHOP ON HIGH PERFORMANCE SERVERLESS COMPUTING)
Foster, Ian; Chard, Kyle; Babuji, Yadu (Ed.)
The historical motivation for serverless comes from internet-of-things, smartphone client server, and the objective of simplifying programming (no provisioning) and scale-down (pay-for-use). These applications are generally low-performance best-effort. However, the serverless model enables flexible software architectures suitable for a wide range of applications that demand high-performance and guaranteed performance. We have studied three such applications - scientific data streaming, virtual/augmented reality, and document annotation. We describe how each can be cast in a serverless software architecture and how the application performance requirements translate into high performance requirements (invocation rate, low and predictable latency) for the underlying serverless system implementation. These applications can require invocations rates as high as millions per second (40 MHz) and latency deadlines below a microsecond (300 ns), and furthermore require performance predictability. All of these capabilities are far in excess of today's commercial serverless offerings and represent interesting research challenges.
more » « less
Full Text Available
Evaluating Coupling Models for Cloud Datacenters and Power Grids

https://doi.org/10.1145/3447555.3464868

Lin, Liuzixuan; Zavala, Victor M.; Chien, Andrew A. (June 2021, InThe TwelfthACM International Conference on Future Energy Systems (e-Energy ’21),)
Ardakanian, Omid; Niesse, Astrid (Ed.)
The rapid growth of datacenter (DC) loads can be leveraged to help meet renewable portfolio standard (RPS, renewable fraction)targets in power grids. The ability to manipulate DC loads over time(shifting) provides a mechanism to deal with temporal mismatch between non-dispatchable renewable generation (e.g. wind and solar) and overall grid loads, and this flexibility ultimately facilitates the absorption of renewables and grid decarbonization. To this end, we study DC-grid coupling models, exploring their impact on grid dispatch, renewable absorption, power prices, and carbon emissions.With a detailed model of grid dispatch, generation, topology, and loads, we consider three coupling approaches: fixed, datacenter-local optimization (online dynamic programming), and grid-wide optimization (optimal power flow). Results show that understanding the effects of dynamic DC load management requires studies that model the dynamics of both load and power grid. Dynamic DC-grid coupling can produce large improvements: (1) reduce grid dispatch cost (-3%), (2) increase grid renewable fraction (+1.58%), and (3) reduce DC power cost (-16.9%).It also has negative effects: (1) increase cost for both DCs and non-DC customers, (2) differentially increase prices for non-DC customers, and (3) create large power-level changes that may harm DC productivity.
more » « less
Full Text Available

« Prev Next »

Search for: All records