NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PerfEstimator: A Generic and Extensible Performance Estimator for Data Parallel DNN Training

Yang, Chengru; Li, Zhehao; Ruan, Chaoyi; Xu, Guanbin; Li, Cheng; Chen, Ruichuan; Yan, Feng (May 2021, ICSE21 Workshop on Cloud Intelligence (ICSE Workshops 2021))

Full Text Available
The Age of Correlated Features in Supervised Learning based Forecasting

Shisher, Md Kamran; Qin, Heyang; Yang, Lei; Yan, Feng; Sun, Yin (May 2021, IEEE INFOCOM Age of Information Workshop (INFOCOM Workshops 2021))
null (Ed.)
Full Text Available
SciChain: Blockchain-enabled Lightweight and Efficient Data Provenance for Reproducible Scientific Computing

https://doi.org/10.1109/ICDE51399.2021.00166

Al-Mamun, Abdullah; Yan, Feng; Zhao, Dongfang (April 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE))
null (Ed.)
Full Text Available
Lunule: An Agile and Judicious Metadata Load Balancer for CephFS

Wang, Yiduo; Li, Cheng; Shao, Xinyang; Chen, Youxu; Yan, Feng; Xu, Yinlong (January 2021, 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2021))
null (Ed.)
Full Text Available
AutoGR: Automated Geo-Replication with Fast System Performance and Preserved Application Semantics

Wang, Jiawei; Li, Cheng; Ma, Kai; Huo, Jingze; Yan, Feng; Feng, Xinyu; Xu, Yinlong (January 2021, The 47th International Conference on Very Large Data Bases (VLDB 2021))
null (Ed.)
Full Text Available
BAASH: Lightweight, Efficient, and Reliable Blockchain-As-A-Service for HPC Systems

Al-Mamun, Abdullah; Yan, Feng; Zhao, Dongfang (January 2021, 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2021))
null (Ed.)
Full Text Available
Gradient Compression Supercharged High-Performance Data Parallel DNN Training

Bai, Youhui; Li, Cheng; Zhou, Quan; Yi, Jun; Gong, Ping; Yan, Feng; Chen, Ruichuan; Xu, Yinlong (January 2021, The 28th ACM Symposium on Operating Systems Principles (SOSP 2021))

Full Text Available
CEDULE+: Resource Management for Burstable Cloud Instances Using Predictive Analytics

https://doi.org/10.1109/TNSM.2020.3039942

Pinciroli, Riccardo; Ali, Ahsan; Yan, Feng; Smirni, Evgenia (November 2020, IEEE Transactions on Network and Service Management)
null (Ed.)
Nearly all principal cloud providers now provide burstable instances in their offerings. The main attraction of this type of instance is that it can boost its performance for a limited time to cope with workload variations. Although burstable instances are widely adopted, it is not clear how to efficiently manage them to avoid waste of resources. In this paper, we use predictive data analytics to optimize the management of burstable instances. We design CEDULE+, a data-driven framework that enables efficient resource management for burstable cloud instances by analyzing the system workload and latency data. CEDULE+ selects the most profitable instance type to process incoming requests and controls CPU, I/O, and network usage to minimize the resource waste without violating Service Level Objectives (SLOs). CEDULE+ uses lightweight profiling and quantile regression to build a data-driven prediction model that estimates system performance for all combinations of instance type, resource type, and system workload. CEDULE+ is evaluated on Amazon EC2, and its efficiency and high accuracy are assessed through real-case scenarios. CEDULE+ predicts application latency with errors less than 10%, extends the maximum performance period of a burstable instance up to 2.4 times, and decreases deployment costs by more than 50%.
more » « less
Full Text Available
BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching

https://doi.org/10.1109/SC41405.2020.00073

A. Ali, R. Pinciroli (November 2020, 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Atlanta, GA, US, 2020 pp. 972-986. doi: 10.1109/SC41405.2020.00073)

Serverless computing is a new pay-per-use cloud service paradigm that automates resource scaling for stateless functions and can potentially facilitate bursty machine learning serving. Batching is critical for latency performance and cost-effectiveness of machine learning inference, but unfortunately it is not supported by existing serverless platforms due to their stateless design. Our experiments show that without batching, machine learning serving cannot reap the benefits of serverless computing. In this paper, we present BATCH, a framework for supporting efficient machine learning serving on serverless platforms. BATCH uses an optimizer to provide inference tail latency guarantees and cost optimization and to enable adaptive batching support. We prototype BATCH atop of AWS Lambda and popular machine learning inference systems. The evaluation verifies the accuracy of the analytic optimizer and demonstrates performance and cost advantages over the state-of-the-art method MArk and the state-of-the-practice tool SageMaker.
more » « less
Full Text Available
Memory Scaling of Cloud-based Big Data Systems: A Hybrid Approach

https://doi.org/10.1109/TBDATA.2020.3035522

Wang, Xinying; Xu, Cong; Wang, Ke; Yan, Feng; Zhao, Dongfang (November 2020, IEEE Transactions on Big Data)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records