NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Art of Sparsity: Mastering High-Dimensional Tensor Storage

https://doi.org/10.1109/IPDPSW63119.2024.00094

Dong, Bin; Wu, Kesheng; Byna, Suren (May 2024, IEEE)

Full Text Available
Preparing Spectral Data for Machine Learning: A Study of Geological Classification from Aerial Surveys

Chung, Jun Woo; Sim, Alex; Quiter, Brian; Wu, Yuxin; Zhao, Weijie; Wu, Kesheng (December 2023, Machine Learning and the Physical Sciences Workshop, NeurIPS 2023.)

This study focuses on improving the preparation of spectral data for machine learning. It does so by conducting a case study that involves matching an airborne gamma-ray spectral survey of the San Francisco Bay area to geological classifications provided by the United States Geological Survey (Graymer et al., 2006).Our investigation has revealed three key approaches for enhancing accuracy in this task:1) eliminating extraneous data segments unrelated to the main task,2) augmenting minority classes to improve class balances,and 3) merging inconsistent classes.By incorporating these methods, we were able to achieve a significant increase in classification accuracy. Specifically, we increased the accuracy from an initial 40.8% to approximately 72.7%. We plan to continue our work to further enhance performance, with the goal of extending the applicability of these methods to other data types and tasks. One potential future application is the detection of rare earth elements from aerial surveys.
more » « less
Full Text Available
Serving Deep Learning Models from Relational Databases

https://doi.org/10.48786/edbt.2024.61

Zhou, Lixi; Lin, Qi; Chowdhury, Kanchan; Masood, Saif; Eichenberger, Alexandre; Min, Hong; Sim, Alexander; Wang, Jie; Wang, Yida; Wu, Kesheng; et al (January 2024, OpenProceedings.org)

Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains, sparking growing interest recently. In this visionary paper, we embark on a comprehensive exploration of representative architectures to address the requirement. We highlight three pivotal paradigms: The state-of-the-art \textit{DL-centric} architecture offloads DL computations to dedicated DL frameworks. The potential \textit{UDF-centric} architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS). The potential \textit{relation-centric} architecture aims to represent a large-scale tensor computation through relational operators. While each of these architectures demonstrates promise in specific use scenarios, we identify urgent requirements for seamless integration of these architectures and the middle ground in-between these architectures. We delve into the gaps that impede the integration and explore innovative strategies to close them. We present a pathway to establish a novel RDBMS for enabling a broad class of data-intensive DL inference applications.
more » « less
Automatic Data Transformation Using Large Language Model - An Experimental Study on Building Energy Data

https://doi.org/10.1109/BigData59044.2023.10386931

Sharma, Ankita; Li, Xuanmao; Guan, Hong; Sun, Guoxin; Zhang, Liang; Wang, Lanjun; Wu, Kesheng; Cao, Lei; Zhu, Erkang; Sim, Alexander; et al (December 2023, Proceedings of 2023 IEEE International Conference on Big Data (IEEE BigData 2023))

Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
more » « less
Analyzing Transatlantic Network Traffic over Scientific Data Caches

https://doi.org/10.1145/3589012.3594897

Deng, Ziyue; Sim, Alex; Wu, Kesheng; Guok, Chin; Hazen, Damian; Monga, Inder; Andrijauskas, Fabio; Würthwein, Frank; Weitzel, Derek (July 2023, Proceedings of the 2023 on Systems and Network Telemetry and Analytics)

Full Text Available
Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

https://doi.org/10.1145/3526064.3534111

Bellavita, Julian; Sim, Alex; Wu, Kesheng; Monga, Inder; Guok, Chin; Würthwein, Frank; Davila, Diego (June 2022, SNTA '22: Fifth International Workshop on Systems and Network Telemetry and Analytics)

Full Text Available
Access Trends of In-network Cache for Scientific Data

https://doi.org/10.1145/3526064.3534110

Han, Ruize; Sim, Alex; Wu, Kesheng; Monga, Inder; Guok, Chin; Würthwein, Frank; Davila, Diego; Balcas, Justas; Newman, Harvey (June 2022, SNTA '22: Fifth International Workshop on Systems and Network Telemetry and Analytics)

Full Text Available
Analyzing Scientific Data Sharing Patterns for In-network Data Caching

https://doi.org/10.1145/3452411.3464441

Copps, Elizabeth; Zhang, Huiyi; Sim, Alex; Wu, Kesheng; Monga, Inder; Guok, Chin; Würthwein, Frank; Davila, Diego; Fajardo, Edgar (June 2021, SNTA '21: Proceedings of the 2021 on Systems and Network Telemetry and Analytics)
null (Ed.)
The volume of data moving through a network increases with new scientific experiments and simulations. Network bandwidth requirements also increase proportionally to deliver data within a certain time frame. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user for various reasons. In-network data caching for the shared data has shown to reduce the redundant data transfers and consequently save network traffic volume. In addition, overall application performance is expected to improve with in-network caching because access to the locally cached data results in lower latency. This paper shows how much data was shared over the study period, how much network traffic volume was consequently saved, and how much the temporary in-network caching increased the scientific application performance. It also analyzes data access patterns in applications and the impacts of caching nodes on the regional data repository. From the results, we observed that the network bandwidth demand was reduced by nearly a factor of 3 over the study period.
more » « less
Full Text Available
ArrayBridge: Interweaving Declarative Array Processing in SciDB with Imperative HDF5-Based Programs

https://doi.org/10.1109/ICDE.2018.00092

Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros; Byna, Suren; Prabhat, M.; Wu, Kesheng; Brown, Paul (April 2018, IEEE 34th International Conference on Data Engineering (ICDE) 2018)

Full Text Available

Search for: All records