Apollo:: An ML-assisted Real-Time Storage Resource Observer

Rajesh, Neeraj; Devarajan, Hariharan; Garcia, Jaime Cernuda; Bateman, Keith; Logan, Luke; Ye, Jie; Kougkas, Anthony; Sun, Xian-He

doi:10.1145/3431379.3460640

Citation Details

Apollo:: An ML-assisted Real-Time Storage Resource Observer

Applications and middleware services, such as data placement engines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency monitoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall performance. Apollo has been designed from the ground up to provide low latency, using Publish–Subscribe (Pub-Sub) semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquiring complex insights with a memory overhead of ~57MB and CPU overhead being only 7% more than existing state-of-the-art systems. more »

Award ID(s):: 1835764 1730488 1814872

PAR ID:: 10295039

Author(s) / Creator(s):: Rajesh, Neeraj; Devarajan, Hariharan; Garcia, Jaime Cernuda; Bateman, Keith; Logan, Luke; Ye, Jie; Kougkas, Anthony; Sun, Xian-He

Publisher / Repository:: ACM

Date Published:: 2021-06-21

Journal Name:: The 30th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '21)

Page Range / eLocation ID:: 147 to 159

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3431379.3460640

More Like this