ArrayMorph: Optimizing Hyperslab Queries on the Cloud for Machine Learning Pipelines

Jiang, Ruochen; Blanas, Spyros

doi:10.14778/3746405.3746437

Citation Details

This content will become publicly available on May 1, 2026

ArrayMorph: Optimizing Hyperslab Queries on the Cloud for Machine Learning Pipelines

Cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage are widely used to store raw data for machine learning applications. When the data is later processed, the analysis predominantly focuses on regions of interest (such as a small bounding box in a larger image) and discards uninteresting regions. Machine learning applications can significantly accelerate their I/O if they push this data filtering step to the cloud. Prior work has proposed different methods to partially read array (tensor) objects, such as chunking, reading a contiguous byte range, and evaluating a lambda function. No method is optimal; estimating the total time and cost of a data retrieval requires an understanding of the data serialization order, the chunk size and platform-specific properties. This paper introduces ArrayMorph, a cloud-based array data storage system that automatically determines which is the best method to use to retrieve regions of interest from data on the cloud. ArrayMorph formulates data accesses as hyperslab queries, and optimizes them using a multi-phase cost-based approach. ArrayMorph seamlessly integrates with Python/PyTorch-based ML applications, and is experimentally shown to transfer up to 9.8X less data than existing systems. This makes ML applications run up to 1.7X faster and 9X cheaper than prior solutions. more »

Award ID(s):: 2112606

PAR ID:: 10640808

Author(s) / Creator(s):: Jiang, Ruochen; Blanas, Spyros

Publisher / Repository:: Proceedings of the VLDB Endowment

Date Published:: 2025-05-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 18

Issue:: 9

ISSN:: 2150-8097

Page Range / eLocation ID:: 3189 to 3202

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 1, 2026
Journal Article:
https://doi.org/10.14778/3746405.3746437

More Like this