There is great value embedded in reusing scientific data for secondary discoveries. However, it is challenging to find and reuse the large amount of existing scientific data distributed across the web and data repositories. Some of the challenges reside in the volume and complexity of scientific data, others pertain to the current practices and workflow of research data management. AIDR 2019 (Artificial Intelligence for Data Discovery and Reuse) is a new conference that brings together researchers across a broad range of disciplines, computer scientists, tool developers, data providers, and data curators, to share innovative solutions that apply artificial intelligence to scientific data discovery and reuse, and discuss how various stakeholders work together to create a health data ecosystem. This editorial summarizes the main themes and takeaways from the inaugural AIDR '19 conference.
more »
« less
Arctic Ice
Integrating field data, remote satellite imagery, scientific analysis, and multimedia visual representation to document Arctic ice that is disappearing due to climate change, this artwork is the outcome of a four-year collaboration involving art, design, and polar science between artist Cy Keener, landscape researcher Justine Holzman, climatologist Ignatius Rigor, and scientist John Woods. With this work, Keener and Holzman’s goal is to make scientific data tangible, visceral, and experiential. They ask how artistic and creative practices can contribute to scientific endeavors while making scientific research visible to the public.
more »
« less
- Award ID(s):
- 1951762
- PAR ID:
- 10415781
- Date Published:
- Journal Name:
- Issues in science and technology
- Volume:
- 39
- Issue:
- 1
- ISSN:
- 0748-5492
- Page Range / eLocation ID:
- 48-53
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Holme, Thomas (Ed.)Reading and understanding scientific literature is an essential skill for any scientist to learn. While students’ scientific literacy can be improved by reading research articles, an article’s technical language and structure can hinder students’ understanding of the scientific material. Furthermore, many students struggle with interpreting graphs and other models of data commonly found in scientific literature. To introduce students to scientific literature and promote improved understanding of data and graphs, we developed a guided-inquiry activity adapted from a research article on snow chemistry and implemented it in a general chemistry laboratory course. Here, we describe how we adapted figures from the primary literature source and developed questions to scaffold the guided-inquiry activity. Results from semi-structured qualitative interviews suggest that students learn about snow chemistry processes and engage in scientific practices, including data analysis and interpretation, through this activity. This activity is applicable in other introductory science courses as educators can adapt most scientific articles into a guided-inquiry activity.more » « less
-
Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.more » « less
-
OpenMSIStream provides seamless connection of scientific data stores with streaming infrastructure to allow researchers to leverage the power of decoupled, real-time data streaming architectures. Data streaming is the process of transmitting, ingesting, and processing data continuously rather than in batches. Access to streaming data has revolutionized many industries in the past decade and created entirely new standards of practice and types of analytics. While not yet commonly used in scientific research, data streaming has the potential to become a key technology to drive rapid advances in scientific data collection (e.g., Brookhaven National Lab (2022)). This paucity of streaming infrastructures linking complex scientific systems is due to a lack of tools that facilitate streaming in the diverse and distributed systems common in modern research. OpenMSIStream closes this gap between underlying streaming systems and common scientific infrastructure. Closing this gap empowers novel streaming applications for scientific data including automation of data curation, reduction, and analysis; real-time experiment monitoring and control; and flexible deployment of AI/ML to guide autonomous research. Streaming data generally refers to data continuously generated from multiple sources and passed in small packets (termed messages). Streaming data messages are typically organized in groups called topics and persist for periods of time conducive to processing for multiple uses either sequentially or in small groups. The resulting flows of raw data, metadata, and processing results form “ecosystems” that automate varied data-driven tasks. A strength of data streaming ecosystems is the use of publish-subscribe (“pub/sub”) messaging backbones that decouple data senders (publishers) and recipients (subscribers). Popular message-focused middleware solutions such as RabbitMQ (VMware, 2022), Apache Pulsar (Apache Software Foundation, 2022b), and Apache Kafka (Apache Software Foundation, 2022a) all provide differing capabilities as backbones. OpenMSIStream provides robust and efficient, yet easy, access to the rich data streaming systems of Apache Kafka.more » « less
-
Abstract Modern science’s ability to produce, store, and analyze big datasets is changing the way that scientific research is practiced. Philosophers have only begun to comprehend the changed nature of scientific reasoning in this age of “big data.” We analyze data-focused practices in biology and climate modeling, identifying distinct species of data-centric science: phenomena-laden in biology and phenomena-agnostic in climate modeling, each better suited for its own domain of application, though each entail trade-offs. We argue that data-centric practices in science are not monolithic because the opportunities and challenges presented by big data vary across scientific domains.more » « less