Biology today is heavily data-driven and knowledge-centric that are stored across the linked open web in numerous heterogeneous deep web databases. To improve searching, finding, accessing, and inter-operating among these diverse information sources to increase usability, the FAIR data principle has been proposed. Unfortunately, FAIR compliance is extremely low and linked open data does not guarantee FAIRness, leaving biologists on a solo hunt for information on the open network. In this paper, we propose {\em SoDa}, for intelligent data foraging on the internet. SoDa helps biologists discover resources based on analysis requirements, generate resource access plans, and store cleaned data and knowledge for community use. A secondary search index is also supported for community members to find archived information conveniently.
more »
« less
This content will become publicly available on January 10, 2026
Ad Hoc Data Foraging in a Life Sciences Community Ecosystem Using SoDa
Biologists often set out to find relevant data in an ever-changing landscape of interesting databases. While leading journals publish descriptions of databases, they are usually not recent and do not frequently update the list that discards defunct or poor-quality databases. These indices usually include databases that are proactively requested to be included by their authors. The challenge for individual biologists, then, is to discover, explore, and select databases of interest from a large unorganized collection and effectively use them in their analysis without too large of an investment. The advocation of the FAIR data principle to improve searching, finding, accessing, and inter-operating among these diverse information sources in order to increase usability is proving to be a difficult proposition and consequently, a large number of data sources are not FAIR-compliant. Since linked open data do not guarantee FAIRness, biologists are now left to individually search for information in open networks. In this paper, we propose SoDa, for intelligent data foraging on the internet by biologists. SoDa helps biologists to discover resources based on analysis requirements and generate resource access plans, as well as storing cleaned data and knowledge for community use. SoDa includes a natural language-powered resource discovery tool, a tool to retrieve data from remote databases, organize and store collected data, query stored data, and seek help from the community when things do not work as anticipated. A secondary search index is also supported for community members to find archived information in a convenient way to enable its reuse. The features supported in SoDa endows biologists with data integration capabilities over arbitrary linked open databases and construct powerful computational pipelines using them, capabilities that are not supported in most contemporary biological workflow systems, such as Taverna or Galaxy.
more »
« less
- Award ID(s):
- 2410668
- PAR ID:
- 10631914
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Applied Sciences
- Volume:
- 15
- Issue:
- 2
- ISSN:
- 2076-3417
- Page Range / eLocation ID:
- 621
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A vast proportion of scientific data remains locked behind dynamic web interfaces, often called the deep web—inaccessible to conventional search engines and standard crawlers. This gap between data availability and machine usability hampers the goals of open science and automation. While registries like FAIRsharing offer structured metadata describing data standards, repositories, and policies aligned with the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, they do not enable seamless, programmatic access to the underlying datasets. We present FAIRFind, a system designed to bridge this accessibility gap. FAIRFind autonomously discovers, interprets, and operationalizes access paths to biological databases on the deep web, regardless of their FAIR compliance. Central to our approach is the Deep Web Communication Protocol (DWCP), a resource description language that represents web forms, HyperText Markup Language (HTML) tables, and file-based data interfaces in a machine-actionable format. Leveraging large language models (LLMs), FAIRFind combines a specialized deep web crawler and web-form comprehension engine to transform passive web metadata into executable workflows. By indexing and embedding these workflows, FAIRFind enables natural language querying over diverse biological data sources and returns structured, source-resolved results. Evaluation across multiple open-source LLMs and database types demonstrates over 90% success in structured data extraction and high semantic retrieval accuracy. FAIRFind advances existing registries by turning linked resources from static references into actionable endpoints, laying a foundation for intelligent, autonomous data discovery across scientific domains.more » « less
-
null (Ed.)Abstract Current barriers hindering data-driven discoveries in deep-time Earth (DE) include: substantial volumes of DE data are not digitized; many DE databases do not adhere to FAIR (findable, accessible, interoperable and reusable) principles; we lack a systematic knowledge graph for DE; existing DE databases are geographically heterogeneous; a significant fraction of DE data is not in open-access formats; tailored tools are needed. These challenges motivate the Deep-Time Digital Earth (DDE) program initiated by the International Union of Geological Sciences and developed in cooperation with national geological surveys, professional associations, academic institutions and scientists around the world. DDE’s mission is to build on previous research to develop a systematic DE knowledge graph, a FAIR data infrastructure that links existing databases and makes dark data visible, and tailored tools for DE data, which are universally accessible. DDE aims to harmonize DE data, share global geoscience knowledge and facilitate data-driven discovery in the understanding of Earth's evolution.more » « less
-
This beginner’s guide is intended for plant biologists new to network analysis. Here, we introduce key concepts and resources for researchers interested in incorporating network analysis into research, either as a stand-alone component for generating hypotheses or as a framework for examining and visualizing experimental results. Network analysis provides a powerful tool to predict gene functions. Advances in and reduced costs for systems biology techniques, such as genomics, transcriptomics, and proteomics, have generated abundant -omics data for plants; however, the functional annotation of plant genes lags. Therefore, predictions from network analysis can be a starting point to annotate genes and ultimately elucidate genotype-phenotype relationships. In this paper, we introduce networks and compare network-building resources available for plant biologists, including databases and software for network analysis. We then compare four databases available for plant biologists in more detail: AraNet, GeneMANIA, ATTED-II, and STRING. AraNet, and GeneMANIA are functional association networks, ATTED-II is a gene coexpression database, and STRING is a protein-protein interaction database. AraNet, and ATTED-II are plant-specific databases that can analyze multiple plant species, whereas GeneMANIA builds networks for Arabidopsis thaliana and non-plant species, and STRING for multiple species. Finally, we compare the performance of the four databases in predicting known and probable gene functions of the A. thaliana Nuclear Factor-Y (NF-Y) genes. We conclude that plant biologists have an invaluable resource in these databases and discuss how users can decide which type of database to use depending on their research question.more » « less
-
All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at https://tree.opentreeoflife.org, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining this information straightforward for students as well as researchers (e.g. https://mctavishlab.github.io/BIO144/labs/rotl-rgbif.html). Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases.more » « less
An official website of the United States government
