NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications

Liu, Yuhan; Wan, Chengcheng; Du, Kuntai; Hoffmann, Henry; Jiang, Junchen; Lu, Shan; Maire, Michael (July 2024, Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation)

ML APIs have greatly relieved application developers of the burden to design and train their own neural network models—classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications. To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%.
more » « less
Full Text Available
ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications

Liu, Yuhan; Wan, Chengcheng; Du, Kuntai; Hoffmann, Henry; Jiang, Junchen; Lu, Shan; Maire, Michael (July 2024, USENIX, OSDI 2024)

Full Text Available
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, Association for Computing Machinery, New York, NY, United States)

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5--4.3x and the total delay in fetching and processing contexts by 3.2--3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.
more » « less
Full Text Available
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

https://doi.org/10.1145/3651890.3672274

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, ACM)

Full Text Available
Run-Time Prevention of Software Integration Failures of Machine Learning APIs

https://doi.org/10.1145/3622806

Wan, Chengcheng; Liu, Yuhan; Du, Kuntai; Hoffmann, Henry; Jiang, Junchen; Maire, Michael; Lu, Shan (October 2023, Proceedings of the ACM on Programming Languages)

Due to the under-specified interfaces, developers face challenges in correctly integrating machine learning (ML) APIs in software. Even when the ML API and the software are well designed on their own, the resulting application misbehaves when the API output is incompatible with the software. It is desirable to have an adapter that converts ML API output at runtime to better fit the software need and prevent integration failures. In this paper, we conduct an empirical study to understand ML API integration problems in real-world applications. Guided by this study, we present SmartGear, a tool that automatically detects and converts mismatching or incorrect ML API output at run time, serving as a middle layer between ML API and software. Our evaluation on a variety of open-source applications shows that SmartGear detects 70% incompatible API outputs and prevents 67% potential integration failures, outperforming alternative solutions.
more » « less
Full Text Available
OneAdapt: Fast Adaptation for Deep Learning Applications via Backpropagation

https://doi.org/10.1145/3620678.3624653

Du, Kuntai; Liu, Yuhan; Hao, Yitian; Zhang, Qizheng; Wang, Haodong; Huang, Yuyang; Ananthanarayanan, Ganesh; Jiang, Junchen (October 2023, SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing)
ACCMPEG: OPTIMIZING VIDEO ENCODING FOR VIDEO ANALYTICS

Du, Kuntai; Zhang, Qizheng; Arapin, Anton; Wang, Haodong; Xia, Zhengxu; Jiang, Junchen (July 2022, Proceedings of the 5 th MLSys Conference)

Full Text Available
Understanding the potential of server-driven edge video analytics

https://doi.org/10.1145/3508396.3512872

Zhang, Qizheng; Du, Kuntai; Neil Agarwal; Netravali, Ravi; Jiang, Junchen (January 2022, HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications)

Full Text Available
Server-Driven Video Streaming for Deep Learning Inference

https://doi.org/10.1145/3387514.3405887

Du, Kuntai; Pervaiz, Ahsan; Yuan, Xin; Chowdhery, Aakanksha; Zhang, Qizheng; Hoffmann, Henry; Jiang, Junchen (August 2020, Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication (SIGCOMM ’20))

Full Text Available

Search for: All records