NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Boki: Towards Data Consistency and Fault Tolerance with Shared Logs in Stateful Serverless Computing

https://doi.org/10.1145/3653072

Jia, Zhipeng; Witchel, Emmett (September 2024, ACM transactions on computer systems)

Boki is a new serverless runtime that exports a shared log API to serverless functions. Boki shared logs enable stateful serverless applications to manage their state with durability, consistency, and fault tolerance. Boki shared logs achieve high throughput and low latency. The key enabler is themetalog, a novel mechanism that allows Boki to address ordering, consistency and fault tolerance independently. The metalog orders shared log records with high throughput and it provides read consistency while allowing service providers to optimize the write and read path of the shared log in different ways. To demonstrate the value of shared logs for stateful serverless applications, we build Boki support libraries that implement fault-tolerant workflows, durable object storage, and message queues. Our evaluation shows that shared logs can speed up important serverless workloads by up to 4.2 ×.
more » « less
Full Text Available
The Key Ideas Behind Boki's Shared Logs

https://doi.org/10.1145/3689051.3689054

Jia, Zhipeng; Witchel, Emmett (August 2024, ACM SIGOPS Operating Systems Review)

The shared log approach has emerged as an attractive state management option for distributed systems. A shared log not only serves as persistent, strongly consistent, and faulttolerant storage, its ability to provide a total order enables fine-grained state machine replication. Boki is a recent shared log system that includes an intuitive LogBook abstraction and novel shared log design choices. Despite Boki being designed as storage for serverless functions, its design principals are applicable to other distributed systems that disaggregate storage from compute.
more » « less
Full Text Available
Disaggregated GPU Acceleration for Serverless Applications

https://doi.org/10.1145/3606557.3606560

Fingler, Henrique; Zhu, Zhiting; Yoon, Esther; Jia, Zhipeng; Witchel, Emmett; Rossbach, Christopher J. (June 2023, ACM SIGOPS Operating Systems Review)

Serverless platforms have been attracting applications from traditional platforms because infrastructure management responsibilities are shifted from users to providers. Many applications well-suited to serverless environments could leverage GPU acceleration to enhance their performance. Unfortunately, current serverless platforms do not expose GPUs to serverless applications.
more » « less
Full Text Available
DGSF: Disaggregated GPUs for Serverless Functions

https://doi.org/10.1109/IPDPS53621.2022.00077

Fingler, Henrique; Zhu, Zhiting; Yoon, Esther; Jia, Zhipeng; Witchel, Emmett (April 2022, IEEE International Parallel and Distributed Processing Symposium)

Ease of use and transparent access to elastic resources have attracted many applications away from traditional platforms toward serverless functions. Many of these applications, such as machine learning, could benefit significantly from GPU acceleration. Unfortunately, GPUs remain inaccessible from serverless functions in modern production settings. We present DGSF, a platform that transparently enables serverless functions to use GPUs through general purpose APIs such as CUDA. DGSF solves provisioning and utilization challenges with disaggregation, serving the needs of a potentially large number of functions through virtual GPUs backed by a small pool of physical GPUs on dedicated servers. Disaggregation allows the provider to decouple GPU provisioning from other resources, and enables significant benefits through consolidation. We describe how DGSF solves GPU disaggregation challenges including supporting API transparency, hiding the latency of communication with remote GPUs, and load-balancing access to heavily shared GPUs. Evaluation of our prototype on six workloads shows that DGSF’s API remoting optimizations can improve the runtime of a function by up to 50% relative to unoptimized DGSF. Such optimizations, which aggressively remove GPU runtime and object management latency from the critical path, can enable functions running over DGSF to have a lower end-to-end time than when running on a GPU natively. By enabling GPU sharing, DGSF can reduce function queueing latency by up to 53%. We use DGSF to augment AWS Lambda with GPU support, showing similar benefits.
more » « less
Full Text Available
Boki: Stateful Serverless Computing with Shared Logs

https://doi.org/10.1145/3477132.3483541

Jia, Zhipeng; Witchel, Emmett (October 2021, Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles)

Boki is a new serverless runtime that exports a shared log API to serverless functions. Boki shared logs enable stateful serverless applications to manage their state with durability, consistency, and fault tolerance. Boki shared logs achieve high throughput and low latency. The key enabler is the metalog, a novel mechanism that allows Boki to address ordering, consistency and fault tolerance independently. The metalog orders shared log records with high throughput and it provides read consistency while allowing service providers to optimize the write and read path of the shared log in different ways. To demonstrate the value of shared logs for stateful serverless applications, we build Boki support libraries that implement fault-tolerant workflows, durable object storage, and message queues. Our evaluation shows that shared logs can speed up important serverless workloads by up to 4.7x.
more » « less
Full Text Available
Nightcore: efficient and scalable serverless computing for latency-sensitive, interactive microservices

https://doi.org/10.1145/3445814.3446701

Jia, Zhipeng; Witchel, Emmett (April 2021, Architectrual Support for Programming Languages and Operating Systems)
null (Ed.)
The microservice architecture is a popular software engineering approach for building flexible, large-scale online services. Serverless functions, or function as a service (FaaS), provide a simple programming model of stateless functions which are a natural substrate for implementing the stateless RPC handlers of microservices, as an alternative to containerized RPC servers. However, current serverless platforms have millisecond-scale runtime overheads, making them unable to meet the strict sub-millisecond latency targets required by existing interactive microservices. We present Nightcore, a serverless function runtime with microsecond-scale overheads that provides container-based isolation between functions. Nightcore’s design carefully considers various factors having microsecond-scale overheads, including scheduling of function requests, communication primitives, threading models for I/O, and concurrent function executions. Nightcore currently supports serverless functions written in C/C++, Go, Node.js, and Python. Our evaluation shows that when running latency-sensitive interactive microservices, Nightcore achieves 1.36×–2.93× higher throughput and up to 69% reduction in tail latency.
more » « less
Full Text Available
Telekine: Secure Computing with Cloud GPUs

Hunt, Tyler; Jia, Zhipeng; Miller, Vance; Szekely, Ariel; Hu, Yige; Rossbach, Christopher J; Witchel, Emmett (February 2020, 17th USENIX Symposium on Networked Systems Design and Implementation)

GPUs have become ubiquitous in the cloud due to the dramatic performance gains they enable in domains such as machine learning and computer vision. However, offloading GPU computation to the cloud requires placing enormous trust in providers and administrators. Recent proposals for GPU trusted execution environments (TEEs) are promising but fail to address very real side-channel concerns. To illustrate the severity of the problem, we demonstrate a novel attack that enables an attacker to correctly classify images from ImageNet by observing only the timing of GPU kernel execution, rather than the images themselves. Telekine enables applications to use GPU acceleration in the cloud securely, based on a novel GPU stream abstraction that ensures execution and interaction through untrusted components are independent of any secret data. Given a GPU with support for a TEE, Telekine employs a novel variant of API remoting to partition application-level software into components to ensure secret-dependent behaviors occur only on trusted components. Telekine can securely train modern image recognition models on MXNet with 10%–22% performance penalty relative to an insecure baseline with a locally attached GPU. It runs graph algorithms using Galois on one and two GPUs with 18%–41% overhead.
more » « less
Full Text Available

Search for: All records