We present the vision of LiveDataLab and discuss the new research directions and application opportunities it opens up. LiveDataLab is envisioned to be a cloud-based open lab infrastructure where research, education, and application development in big data can be integrated in one unified platform, thus accelerating research, technology transfer, and workforce development in big data.
more »
« less
OneDataShare - A Vision for Cloud-hosted Data Transfer Scheduling and Optimization as a Service [OneDataShare - A Vision for Cloud-hosted Data Transfer Scheduling and Optimization as a Service]
Fast, reliable, and efficient data transfer across wide-area networks is a predominant bottleneck for dataintensive cloud applications. This paper introduces OneDataShare, which is designed to eliminate the issues plaguing effective cloud-based data transfers of varying file sizes and across incompatible transfer end-points. The vision of OneDataShare is to achieve high-speed data transfer, interoperability between multiple transfer protocols, and accurate estimation of delivery time for advance planning, thereby maximizing user-profit through improved and faster data analysis for business intelligence. The paper elaborates on the desirable features of OneDataShare as a cloud-hosted data transfer scheduling and optimization service, and how it is aligned with the vision of harnessing the power of the cloud and distributed computing. Experimental evaluation and comparison with existing real-life file transfer services show that the transfer throughout achieved by OneDataShare is up to 6.5 times greater compared to other approaches.
more »
« less
- Award ID(s):
- 1724898
- PAR ID:
- 10074014
- Date Published:
- Journal Name:
- Proceedings of the 8th International Conference on Cloud Computing and Services Science
- Volume:
- 1
- Page Range / eLocation ID:
- 616 to 625
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The emergence of big data has created new challenges for researchers transmitting big data sets across campus networks to local (HPC) cloud resources, or over wide area networks to public cloud services. Unlike conventional HPC systems where the network is carefully architected (e.g., a high speed local interconnect, or a wide area connection between Data Transfer Nodes), today's big data communication often occurs over shared network infrastructures with many external and uncontrolled factors influencing performance. This paper describes our efforts to understand and characterize the performance of various big data transfer tools such as rclone, cyberduck, and other provider-specific CLI tools when moving data to/from public and private cloud resources. We analyze the various parameter settings available on each of these tools and their impact on performance. Our experimental results give insights into the performance of cloud providers and transfer tools, and provide guidance for parameter settings when using cloud transfer tools. We also explore performance when coming from HPC DTN nodes as well as researcher machines located deep in the campus network, and show that emerging SDN approaches such as the VIP Lanes system can deliver excellent performance even from researchers' machines.more » « less
-
Vision Language models (VLMs) have transformed Generative AI by enabling systems to interpret and respond to multi-modal data in real-time. While advancements in edge computing have made it possible to deploy smaller Large Language Models (LLMs) on smartphones and laptops, deploying competent VLMs on edge devices remains challenging due to their high computational demands. Furthermore, cloud-only deployments fail to utilize the evolving processing capabilities at the edge and limit responsiveness. This paper introduces a distributed architecture for VLMs that addresses these limitations by partitioning model components between edge devices and central servers. In this setup, vision components run on edge devices for immediate processing, while language generation of the VLM is handled by a centralized server, resulting in up to 33% improvement in throughput over traditional cloud-only solutions. Moreover, our approach enhances the computational efficiency of off-the-shelf VLM models without the need for model compression techniques. This work demonstrates the scalability and efficiency of a hybrid architecture for VLM deployment and contributes to the discussion on how distributed approaches can improve VLM performance. Index Terms—vision-language models (VLMs), edge computing, distributed computing, inference optimization, edge-cloud collaboration.more » « less
-
null (Ed.)Sea ice acts as both an indicator and an amplifier of climate change. High spatial resolution (HSR) imagery is an important data source in Arctic sea ice research for extracting sea ice physical parameters, and calibrating/validating climate models. HSR images are difficult to process and manage due to their large data volume, heterogeneous data sources, and complex spatiotemporal distributions. In this paper, an Arctic Cyberinfrastructure (ArcCI) module is developed that allows a reliable and efficient on-demand image batch processing on the web. For this module, available associated datasets are collected and presented through an open data portal. The ArcCI module offers an architecture based on cloud computing and big data components for HSR sea ice images, including functionalities of (1) data acquisition through File Transfer Protocol (FTP) transfer, front-end uploading, and physical transfer; (2) data storage based on Hadoop distributed file system and matured operational relational database; (3) distributed image processing including object-based image classification and parameter extraction of sea ice features; (4) 3D visualization of dynamic spatiotemporal distribution of extracted parameters with flexible statistical charts. Arctic researchers can search and find arctic sea ice HSR image and relevant metadata in the open data portal, obtain extracted ice parameters, and conduct visual analytics interactively. Users with large number of images can leverage the service to process their image in high performance manner on cloud, and manage, analyze results in one place. The ArcCI module will assist domain scientists on investigating polar sea ice, and can be easily transferred to other HSR image processing research projects.more » « less
-
null (Ed.)This paper introduces a novel LiDAR point cloud data encoding solution that is compact, flexible, and fully supports distributed data storage within the Hadoop distributed computing environment. The proposed data encoding solution is developed based on Sequence File and Google Protocol Buffers. Sequence File is a generic splittable binary file format built in the Hadoop framework for storage of arbitrary binary data. The key challenge in adopting the Sequence File format for LiDAR data is in the strategy for effectively encoding the LiDAR data as binary sequences in a way that the data can be represented compactly, while allowing necessary mutation. For that purpose, a data encoding solution, based on Google Protocol Buffers (a language-neutral, cross-platform, extensible data serialisation framework) was developed and evaluated. Since neither of the underlying technologies is sufficient to completely and efficiently represent all necessary point formats for distributed computing, an innovative fusion of them was required to provide a viable data storage solution. This paper presents the details of such a data encoding implementation and rigorously evaluates the efficiency of the proposed data encoding solution. Benchmarking was done against a straightforward, naive text encoding implementation using a high-density aerial LiDAR scan of a portion of Dublin, Ireland. The results demonstrated a 6-times reduction in data volume, a 4-times reduction in database ingestion time, and up to a 5 times reduction in querying time.more » « less
An official website of the United States government

