NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Submarine: A subscription‐based data streaming framework for integrating large facilities and advanced cyberinfrastructure

https://doi.org/10.1002/cpe.5256

Zamani, Ali Reza; AbdelBaky, Moustafa; Balouek‐Thomert, Daniel; Villalobos, J. J.; Rodero, Ivan; Parashar, Manish (April 2019, Concurrency and Computation: Practice and Experience)

Summary Large scientific facilities provide researchers with instrumentation, data, and data products that can accelerate scientific discovery. However, increasing data volumes coupled with limited local computational power prevents researchers from taking full advantage of what these facilities can offer. Many researchers looked into using commercial and academic cyberinfrastructure (CI) to process these data. Nevertheless, there remains a disconnect between large facilities and CI that requires researchers to be actively part of the data processing cycle. The increasing complexity of CI and data scale necessitates new data delivery models, those that can autonomously integrate large‐scale scientific facilities and CI to deliver real‐time data and insights. In this paper, we present our initial efforts using the Ocean Observatories Initiative project as a use case. In particular, we present a subscription‐based data streaming service for data delivery that leverages the Apache Kafka data streaming platform. We also show how our solution can automatically integrate large‐scale facilities with CI services for automated data processing.
more » « less
Leveraging user access patterns and advanced cyberinfrastructure to accelerate data delivery from shared-use scientific observatories

https://doi.org/10.1016/j.future.2021.03.004

Qin, Yubo; Rodero, Ivan; Simonet, Anthony; Meertens, Charles; Reiner, Daniel; Riley, James; Parashar, Manish (September 2021, Future Generation Computer Systems)
null (Ed.)
Full Text Available
Harnessing the Computing Continuum for Urgent Science

https://doi.org/10.1145/3439602.3439618

Balouek-Thomert, Daniel; Rodero, Ivan; Parashar, Manish (November 2020, ACM SIGMETRICS Performance Evaluation Review)

Full Text Available
An edge-aware autonomic runtime for data streaming and in-transit processing

https://doi.org/10.1016/j.future.2020.03.037

Zamani, Ali Reza; Balouek-Thomert, Daniel; Villalobos, J.J.; Rodero, Ivan; Parashar, Manish (September 2020, Future Generation Computer Systems)
null (Ed.)
Full Text Available
A Distributed Multi-Sensor Machine Learning Approach to Earthquake Early Warning

https://doi.org/10.1609/aaai.v34i01.5376

Fauvel, Kevin; Balouek-Thomert, Daniel; Melgar, Diego; Silva, Pedro; Simonet, Anthony; Antoniu, Gabriel; Costan, Alexandru; Masson, Véronique; Parashar, Manish; Rodero, Ivan; et al (June 2020, Proceedings of the AAAI Conference on Artificial Intelligence)

Our research aims to improve the accuracy of Earthquake Early Warning (EEW) systems by means of machine learning. EEW systems are designed to detect and characterize medium and large earthquakes before their damaging effects reach a certain location. Traditional EEW methods based on seismometers fail to accurately identify large earthquakes due to their sensitivity to the ground motion velocity. The recently introduced high-precision GPS stations, on the other hand, are ineffective to identify medium earthquakes due to its propensity to produce noisy data. In addition, GPS stations and seismometers may be deployed in large numbers across different locations and may produce a significant volume of data consequently, affecting the response time and the robustness of EEW systems.In practice, EEW can be seen as a typical classification problem in the machine learning field: multi-sensor data are given in input, and earthquake severity is the classification result. In this paper, we introduce the Distributed Multi-Sensor Earthquake Early Warning (DMSEEW) system, a novel machine learning-based approach that combines data from both types of sensors (GPS stations and seismometers) to detect medium and large earthquakes. DMSEEW is based on a new stacking ensemble method which has been evaluated on a real-world dataset validated with geoscientists. The system builds on a geographically distributed infrastructure, ensuring an efficient computation in terms of response time and robustness to partial infrastructure failures. Our experiments show that DMSEEW is more accurate than the traditional seismometer-only approach and the combined-sensors (GPS and seismometers) approach that adopts the rule of relative strength.
more » « less
Full Text Available
Exploring the Potential of Elastic Computing Clusters in Geo-Distributed Data Centers with Fast Fabric Interconnection

https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00135

Chen, Shouwei; Wang, Wensheng; Rodero, Ivan (August 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS))

Large-scale enterprise computing systems are growing rapidly, to address the increasing demand for data processing; however, in many cases, the computing resources in a single data center may not be sufficient for critical data-centric workloads, and important factors, such as space limitations, power availability, or company policies, limit the possibilities of expanding the data center's resources. In this paper, we explore the potential of harvesting spare computing resources across geo-distributed data centers with fast fabric interconnection for real-world enterprise applications. We specifically characterize the computing resource utilization of four large-scale production data centers, and we show how to efficiently combine local storage and computing clusters with remote and elastic computation resources. The primary challenge is incorporating the available remote computing resources efficiently. To achieve this goal, we propose leveraging the capabilities of Kubernetes-based elastic computing clusters to utilize the spare computing resources across geo-distributed data centers for Big Data applications. We also provide an experimental performance evaluation based on real-use case scenarios via an empirical execution and a simulation, which shows that the proposed system can accelerate Big Data services by employing existing computing resources more efficiently across geo-distributed data centers.
more » « less
Full Text Available
Towards a Smart, Internet-Scale Cache Service for Data Intensive Scientific Applications

https://doi.org/10.1145/3322795.3331464

Qin, Yubo; Simonet, Anthony; Davis, Philip E.; Nouri, Azita; Wang, Zhe; Qin, Manish Parashar; Rodero, Ivan (June 2019, ScienceCloud '19: Proceedings of the 10th Workshop on Scientific Cloud Computing)

Full Text Available
Optimizing Performance and Computing Resource Management of In-memory Big Data Analytics with Disaggregated Persistent Memory

https://doi.org/10.1109/CCGRID.2019.00012

Chen, Shouwei; Wang, Wensheng; Wu, Xueyang; Fan, Zhen; Huang, Kunwu; Zhuang, Peiyu; Li, Yue; Rodero, Ivan; Parashar, Manish; Weng, Dennis (May 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID))

The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can suffer a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at affordable cost by co-designing the proposed in-memory distributed file system with large-volume DIMM-based persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a production-level cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5- fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
more » « less
Full Text Available
Enabling Data Streaming-based Science Gateways through Federated Cyberinfrastructure

Rodero, I; Qin, Y; Valls, J; Simonet, A; Villalobos, J.J.; Parashar, M; Youn, C; Wang, C; Thareja, K; Ruth, P; et al (January 2019, Gateways 2019)

Large scientific facilities are unique and complex infrastructures that have become fundamental instruments for enabling high quality, world-leading research to tackle scientific problems at unprecedented scales. Cyberinfrastructure (CI) is an essential component of these facilities, providing the user community with access to data, data products, and services with the potential to transform data into knowledge. However, the timely evolution of the CI available at large facilities is challenging and can result in science communities requirements not being fully satisfied. Furthermore, integrating CI across multiple facilities as part of a scientific workflow is hard, resulting in data silos. In this paper, we explore how science gateways can provide improved user experiences and services that may not be offered at large facility datacenters. Using a science gateway supported by the Science Gateway Community Institute, which provides subscription-based delivery of streamed data and data products from the NSF Ocean Observatories Initiative (OOI), we propose a system that enables streaming-based capabilities and workflows using data from large facilities, such as the OOI, in a scalable manner. We leverage data infrastructure building blocks, such as the Virtual Data Collaboratory, which provides data and comput- ing capabilities in the continuum to efficiently and collaboratively integrate multiple data-centric CIs, build data-driven workflows, and connect large facilities data sources with NSF-funded CI, such as XSEDE. We also introduce architectural solutions for running these workflows using dynamically provisioned federated CI.
more » « less
Full Text Available
Runtime Management of Data Quality for Scientific Observatories Using Edge and In-Transit Resources

https://doi.org/10.1109/SBAC-PAD.2018.00053

Zamani, Ali Reza; Balouek-Thomert, Daniel; Villalobos, J. J.; Rodero, Ivan; Parashar, Manish (January 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing)

Full Text Available

« Prev Next »

Search for: All records