Scientific workflow management systems (WfMS) provide a systematic way to streamline necessary processes in scientific research. The demand for FAIR (Findable, Accessible, Interoperable, and Reusable) workflows is increasing in the scientific community, particularly in GIScience, where data is not just an output but an integral part of iterative advanced processes. Traditional WfMS often lack the capability to ensure geospatial data and process transparency, leading to challenges in reproducibility and replicability of research findings. This paper proposes the conceptualization and development of FAIR-oriented GIScience WfMS, aiming to incorporate the FAIR principles into the entire lifecycle of geospatial data processing and analysis. To enhance the findability and accessibility of workflows, the WfMS utilizes Harvard Dataverse to share all workflow-related digital resources, organized into workflow datasets, nodes, and case studies. Each resource is assigned a unique DOI (Digital Object Identifier), ensuring easy access and discovery. More importantly, the WfMS complies with the Common Workflow Language (CWL) standard to guarantee interoperability and reproducibility of workflows. It also enables the integration of diverse tools and software, supporting complex analyses that require multiple processing steps. This paper demonstrates the prototype of the GIScience WfMS and illustrates two geospatial science case studies, reflecting its flexibility in selecting appropriate techniques for various datasets and research goals. The user-friendly workflow designer makes it accessible to users with different levels of technical expertise, promoting reusable, reproducible, and replicable GIScience studies. 
                        more » 
                        « less   
                    
                            
                            Geoweaver: Advanced Cyberinfrastructure for Managing Hybrid Geoscientific AI Workflows
                        
                    
    
            AI (artificial intelligence)-based analysis of geospatial data has gained a lot of attention. Geospatial datasets are multi-dimensional; have spatiotemporal context; exist in disparate formats; and require sophisticated AI workflows that include not only the AI algorithm training and testing, but also data preprocessing and result post-processing. This complexity poses a huge challenge when it comes to full-stack AI workflow management, as researchers often use an assortment of time-intensive manual operations to manage their projects. However, none of the existing workflow management software provides a satisfying solution on hybrid resources, full file access, data flow, code control, and provenance. This paper introduces a new system named Geoweaver to improve the efficiency of full-stack AI workflow management. It supports linking all the preprocessing, AI training and testing, and post-processing steps into a single automated workflow. To demonstrate its utility, we present a use case in which Geoweaver manages end-to-end deep learning for in-time crop mapping using Landsat data. We show how Geoweaver effectively removes the tedium of managing various scripts, code, libraries, Jupyter Notebooks, datasets, servers, and platforms, greatly reducing the time, cost, and effort researchers must spend on such AI-based workflows. The concepts demonstrated through Geoweaver serve as an important building block in the future of cyberinfrastructure for AI research. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10193367
- Date Published:
- Journal Name:
- ISPRS International Journal of Geo-Information
- Volume:
- 9
- Issue:
- 2
- ISSN:
- 2220-9964
- Page Range / eLocation ID:
- 119
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract. Processing Earth observation data modelled in a time-series of raster format is critical to solving some of the most complex problems in geospatial science ranging from climate change to public health. Researchers are increasingly working with these large raster datasets that are often terabytes in size. At this scale, traditional GIS methods may fail to handle the processing, and new approaches are needed to analyse these datasets. The objective of this work is to develop methods to interactively analyse big raster datasets with the goal of most efficiently extracting vector data over specific time periods from any set of raster data. In this paper, we describe RINX (Raster INformation eXtraction) which is an end-to-end solution for automatic extraction of information from large raster datasets. RINX heavily utilises open source geospatial techniques for information extraction. It also complements traditional approaches with state-of-the- art high-performance computing techniques. This paper discusses details of achieving big temporal data extraction with RINX, implemented on the use case of air quality and climate data extraction for long term health studies, which includes methods used, code developed, processing time statistics, project conclusions, and next steps.more » « less
- 
            Computational science today depends on complex, data-intensive applications operating on datasets from a variety of scientific instruments. A major challenge is the integration of data into the scientist's workflow. Recent advances in dynamic, networked cloud resources provide the building blocks to construct reconfigurable, end-to-end infrastructure that can increase scientific productivity. However, applications have not adequately taken advantage of these advanced capabilities. In this work, we have developed a novel network-centric platform that enables high-performance, adaptive data flows and coordinated access to distributed cloud resources and data repositories for atmospheric scientists. We demonstrate the effectiveness of our approach by evaluating time-critical, adaptive weather sensing workflows, which utilize advanced networked infrastructure to ingest live weather data from radars and compute data products used for timely response to weather events. The workflows are orchestrated by the Pegasus workflow management system and were chosen because of their diverse resource requirements. We show that our approach results in timely processing of Nowcast workflows under different infrastructure configurations and network conditions. We also show how workflow task clustering choices affect throughput of an ensemble of Nowcast workflows with improved turnaround times. Additionally, we find that using our network-centric platform powered by advanced layer2 networking techniques results in faster, more reliable data throughput, makes cloud resources easier to provision, and the workflows easier to configure for operational use and automation.more » « less
- 
            Abstract Despite the proliferation of computer‐based research on hydrology and water resources, such research is typically poorly reproducible. Published studies have low reproducibility due to incomplete availability of data and computer code, and a lack of documentation of workflow processes. This leads to a lack of transparency and efficiency because existing code can neither be quality controlled nor reused. Given the commonalities between existing process‐based hydrologic models in terms of their required input data and preprocessing steps, open sharing of code can lead to large efficiency gains for the modeling community. Here, we present a model configuration workflow that provides full reproducibility of the resulting model instantiations in a way that separates the model‐agnostic preprocessing of specific data sets from the model‐specific requirements that models impose on their input files. We use this workflow to create large‐domain (global and continental) and local configurations of the Structure for Unifying Multiple Modeling Alternatives (SUMMA) hydrologic model connected to the mizuRoute routing model. These examples show how a relatively complex model setup over a large domain can be organized in a reproducible and structured way that has the potential to accelerate advances in hydrologic modeling for the community as a whole. We provide a tentative blueprint of how community modeling initiatives can be built on top of workflows such as this. We term our workflow the “Community Workflows to Advance Reproducibility in Hydrologic Modeling” (CWARHM; pronounced “swarm”).more » « less
- 
            Constructing and executing reproducible workflows is fundamental to performing research in a variety of scientific domains. Many of the current commercial and open source solutions for workflow en- gineering impose constraints—either technical or budgetary—upon researchers, requiring them to use their limited funding on expensive cloud platforms or spend valuable time acquiring knowledge of software systems and processes outside of their domain expertise. Even though many commercial solutions offer free-tier services, they often do not meet the resource and architectural requirements (memory, data storage, compute time, networking, etc) for researchers to run their workflows effectively at scale. Tapis Workflows abstracts away the complexities of workflow creation and execution behind a web-based API with a simplified workflow model comprised of only pipelines and tasks. This paper will de- tail how Tapis Workflows approaches workflow management by exploring its domain model, the technologies used, application architecture, design patterns, how organizations are leveraging Tapis Workflows to solve unique problems in their scientific workflows, and this projects’s vision for a simple, open source, extensible, and easily deployable workflow engine.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    