Biologists often set out to find relevant data in an ever-changing landscape of interesting databases. While leading journals publish descriptions of databases, they are usually not recent and do not frequently update the list that discards defunct or poor-quality databases. These indices usually include databases that are proactively requested to be included by their authors. The challenge for individual biologists, then, is to discover, explore, and select databases of interest from a large unorganized collection and effectively use them in their analysis without too large of an investment. The advocation of the FAIR data principle to improve searching, finding, accessing, and inter-operating among these diverse information sources in order to increase usability is proving to be a difficult proposition and consequently, a large number of data sources are not FAIR-compliant. Since linked open data do not guarantee FAIRness, biologists are now left to individually search for information in open networks. In this paper, we propose SoDa, for intelligent data foraging on the internet by biologists. SoDa helps biologists to discover resources based on analysis requirements and generate resource access plans, as well as storing cleaned data and knowledge for community use. SoDa includes a natural language-powered resource discovery tool, a tool to retrieve data from remote databases, organize and store collected data, query stored data, and seek help from the community when things do not work as anticipated. A secondary search index is also supported for community members to find archived information in a convenient way to enable its reuse. The features supported in SoDa endows biologists with data integration capabilities over arbitrary linked open databases and construct powerful computational pipelines using them, capabilities that are not supported in most contemporary biological workflow systems, such as Taverna or Galaxy. 
                        more » 
                        « less   
                    This content will become publicly available on December 4, 2025
                            
                            Supporting Data Foragers in Scientific Computing Community Ecosystems for Life Sciences
                        
                    
    
            Biology today is heavily data-driven and knowledge-centric that are stored across the linked open web in numerous heterogeneous deep web databases. To improve searching, finding, accessing, and inter-operating among these diverse information sources to increase usability, the FAIR data principle has been proposed. Unfortunately, FAIR compliance is extremely low and linked open data does not guarantee FAIRness, leaving biologists on a solo hunt for information on the open network. In this paper, we propose {\em SoDa}, for intelligent data foraging on the internet. SoDa helps biologists discover resources based on analysis requirements, generate resource access plans, and store cleaned data and knowledge for community use. A secondary search index is also supported for community members to find archived information conveniently. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2410668
- PAR ID:
- 10631966
- Publisher / Repository:
- Springer Nature Switzerland
- Date Published:
- Page Range / eLocation ID:
- 118 to 123
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            null (Ed.)Access and reuse of authoritative phylogenetic knowledge have been a longstanding challenges in the evolutionary biology community — leading to a number of research efforts (e.g. focused on interoperation, standardization of formats, and development of minimum reporting requirements). The Phylotastic project was launched to provide an answer to such challenges — as an architectural concept collaboratively designed by evolutionary biologists and computer scientists. This paper describes the first comprehensive implementation of the Phylotastic architecture, based on an open platform for Web services composition. The implementation provides a portal, which composes Web services along a fixed collection of workflows, as well as an interface to allow users to develop novel workflows. The Web services composition is guided by automated planning algorithms and built on a Web services registry and an execution monitoring engine. The platform provides resilience through seamless automated recovery from failed services.more » « less
- 
            A vast proportion of scientific data remains locked behind dynamic web interfaces, often called the deep web—inaccessible to conventional search engines and standard crawlers. This gap between data availability and machine usability hampers the goals of open science and automation. While registries like FAIRsharing offer structured metadata describing data standards, repositories, and policies aligned with the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, they do not enable seamless, programmatic access to the underlying datasets. We present FAIRFind, a system designed to bridge this accessibility gap. FAIRFind autonomously discovers, interprets, and operationalizes access paths to biological databases on the deep web, regardless of their FAIR compliance. Central to our approach is the Deep Web Communication Protocol (DWCP), a resource description language that represents web forms, HyperText Markup Language (HTML) tables, and file-based data interfaces in a machine-actionable format. Leveraging large language models (LLMs), FAIRFind combines a specialized deep web crawler and web-form comprehension engine to transform passive web metadata into executable workflows. By indexing and embedding these workflows, FAIRFind enables natural language querying over diverse biological data sources and returns structured, source-resolved results. Evaluation across multiple open-source LLMs and database types demonstrates over 90% success in structured data extraction and high semantic retrieval accuracy. FAIRFind advances existing registries by turning linked resources from static references into actionable endpoints, laying a foundation for intelligent, autonomous data discovery across scientific domains.more » « less
- 
            Abstract Graph databases capture richly linked domain knowledge by integrating heterogeneous data and metadata into a unified representation. Here, we present the use of bespoke, interactive data graphics (bar charts, scatter plots, etc.) for visual exploration of a knowledge graph. By modeling a chart as a set of metadata that describes semantic context (SPARQL query) separately from visual context (Vega-Lite specification), we leverage the high-level, declarative nature of the SPARQL and Vega-Lite grammars to concisely specify web-based, interactive data graphics synchronized to a knowledge graph. Resources with dereferenceable URIs (uniform resource identifiers) can employ the hyperlink encoding channel or image marks in Vega-Lite to amplify the information content of a given data graphic, and published charts populate a browsable gallery of the database. We discuss design considerations that arise in relation to portability, persistence, and performance. Altogether, this pairing of SPARQL and Vega-Lite—demonstrated here in the domain of polymer nanocomposite materials science—offers an extensible approach to FAIR (findable, accessible, interoperable, reusable) scientific data visualization within a knowledge graph framework.more » « less
- 
            Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse— Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020).more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
