Summary Large scientific facilities provide researchers with instrumentation, data, and data products that can accelerate scientific discovery. However, increasing data volumes coupled with limited local computational power prevents researchers from taking full advantage of what these facilities can offer. Many researchers looked into using commercial and academic cyberinfrastructure (CI) to process these data. Nevertheless, there remains a disconnect between large facilities and CI that requires researchers to be actively part of the data processing cycle. The increasing complexity of CI and data scale necessitates new data delivery models, those that can autonomously integrate large‐scale scientific facilities and CI to deliver real‐time data and insights. In this paper, we present our initial efforts using the Ocean Observatories Initiative project as a use case. In particular, we present a subscription‐based data streaming service for data delivery that leverages the Apache Kafka data streaming platform. We also show how our solution can automatically integrate large‐scale facilities with CI services for automated data processing.
more »
« less
OpenMSIStream: A Python package for facilitatingintegration of streaming data in diverse laboratory environments
OpenMSIStream provides seamless connection of scientific data stores with streaming infrastructure to allow researchers to leverage the power of decoupled, real-time data streaming architectures. Data streaming is the process of transmitting, ingesting, and processing data continuously rather than in batches. Access to streaming data has revolutionized many industries in the past decade and created entirely new standards of practice and types of analytics. While not yet commonly used in scientific research, data streaming has the potential to become a key technology to drive rapid advances in scientific data collection (e.g., Brookhaven National Lab (2022)). This paucity of streaming infrastructures linking complex scientific systems is due to a lack of tools that facilitate streaming in the diverse and distributed systems common in modern research. OpenMSIStream closes this gap between underlying streaming systems and common scientific infrastructure. Closing this gap empowers novel streaming applications for scientific data including automation of data curation, reduction, and analysis; real-time experiment monitoring and control; and flexible deployment of AI/ML to guide autonomous research. Streaming data generally refers to data continuously generated from multiple sources and passed in small packets (termed messages). Streaming data messages are typically organized in groups called topics and persist for periods of time conducive to processing for multiple uses either sequentially or in small groups. The resulting flows of raw data, metadata, and processing results form “ecosystems” that automate varied data-driven tasks. A strength of data streaming ecosystems is the use of publish-subscribe (“pub/sub”) messaging backbones that decouple data senders (publishers) and recipients (subscribers). Popular message-focused middleware solutions such as RabbitMQ (VMware, 2022), Apache Pulsar (Apache Software Foundation, 2022b), and Apache Kafka (Apache Software Foundation, 2022a) all provide differing capabilities as backbones. OpenMSIStream provides robust and efficient, yet easy, access to the rich data streaming systems of Apache Kafka.
more »
« less
- PAR ID:
- 10481032
- Publisher / Repository:
- Journal of Open Source Software
- Date Published:
- Journal Name:
- Journal of Open Source Software
- Volume:
- 8
- Issue:
- 83
- ISSN:
- 2475-9066
- Page Range / eLocation ID:
- 4896
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
IoT devices influence many different spheres of society and are predicted to have a huge impact on our future. Extracting real-time insights from diverse sensor data and dealing with the underlying uncertainty of sensor data are two main challenges of the IoT ecosystem In this paper, we propose a data processing architecture, M-DB, to effectively integrate and continuously monitor uncertain and diverse IoT data. M-DB constitutes of three components:(1) model-based operators (MBO) as data management abstractions for IoT application developers to integrate data from diverse sensors. Model-based operators can support event-detection and statistical aggregation operators,(2) M-Stream, a dataflow pipeline that combines model-based operators to perform computations reflecting the uncertainty of underlying data, and (3) M-Store, a storage layer separating the computation of application logic from physical sensor data management, to effectively deal with missing or delayed sensor data. M-DB is designed and implemented over Apache Storm and Apache Kafka, two open-source distributed event processing systems. Our illustrated application examples throughout the paper and evaluation results illustrate that M-DB provides a realtime data-processing architecture that can cater to the diverse needs of IoT applications.more » « less
-
As data analytics applications become increasingly important in a wide range of domains, the ability to develop large-scale and sustainable platforms and software infrastructure to support these applications has significant potential to drive research and innovation in both science and business domains. This paper characterizes performance and power-related behavior trends and tradeoffs of the two predominant frameworks for Big Data analytics (i.e., Apache Hadoop and Spark) for a range of representative applications. It also evaluates system design knobs, such as storage and network technologies and power capping techniques. Experimental results from empirical executions provide meaningful data points for exploring the potential of software-defined infrastructure for Big Data processing systems through simulation. The results provide better understanding of the design space to build multi-criteria application-centric models as well as show significant advantages of software-defined infrastructure in terms of execution time, energy and cost. It motivates further research focused on in-memory processing formulations regarding systems with deeper memory hierarchies and software-defined infrastructure.more » « less
-
Background The digitization of biological specimens has revolutionized morphology, generating massive 3D datasets such as microCT scans. While open-source platforms like 3D Slicer and SlicerMorph have democratized access to advanced visualization and analysis software, a significant “compute gap” persists. Processing high-resolution 3D data requires high-end GPUs and substantial RAM, resources that are frequently unavailable at Primarily Undergraduate Institutions (PUIs) and other educational settings. This “digital divide” prevents many researchers and students from utilizing the very data and software that have been made open to them. Methods We present MorphoCloud, a platform designed to bridge this hardware barrier by providing on-demand, research-grade computing environments via a web browser. MorphoCloud utilizes an “IssuesOps” architecture, where users manage their remote workstations entirely through GitHub Issues using natural-language commands (e.g., /create, /unshelve). The technology stack leverages GitHub Issues and Actions for front-end and orchestration respectively, JetStream2 for backend compute, and Apache Guacamole to deliver a high-performance, GPU-accelerated desktop experience to any modern browser. Results The platform enables a streamlined lifecycle for remote instances, which come pre-configured with the SlicerMorph ecosystem, R/RStudio, and AI-assisted segmentation tools like NNInteractive and MEMOs. Users have access to a persistent storage volume that is decoupled from the instance. For educational purposes, MorphoCloud supports “Workshop” instances that allow for bulk provisioning and stay online continuously for short-term events. This identical environment ensures that instructors can conduct complex 3D workflows without the typical troubleshooting delays caused by heterogeneous student hardware. Conclusion MorphoCloud demonstrates that true scientific accessibility requires not just open data and software, but also open infrastructure. By abstracting the complexities of cloud administration into a simple, command-driven interface, MorphoCloud empowers researchers at under-resourced institutions to engage in high-performance morphological analysis and AI-assisted segmentation.more » « less
-
Big data systems have evolved beyond scalable storage and rudimentary processing to supporting complex data analytics in near real-time, such as Apache Spark Streaming [31], Comet [14], Incremental Hadoop [17], MapReduce Online [7], Apache Storm [28], StreamScope [19], and IBM Streams [1]. These systems are particularly challenging to build owing to two requirements: low latency and fault tolerance. Many of the above systems evolved from a batch processing design and are thus architected to break down a steady stream of input events into a series of micro-batches and then perform batch-like computations on each successive micro-batch as a micro-batch job. In terms of latency, the systems are expected to respond to each micro-batch in seconds with an output The constant operation further entails that the systems must be robust to hardware, software and network-level failures. To incorporate fault-tolerance, the common approach is to use checkpointing and rollback recovery, whereby a streaming application periodically saves its in-memory state to persistent storage.more » « less
An official website of the United States government

