skip to main content


Title: The Deep-Time Digital Earth program: data-driven discovery in geosciences
Abstract Current barriers hindering data-driven discoveries in deep-time Earth (DE) include: substantial volumes of DE data are not digitized; many DE databases do not adhere to FAIR (findable, accessible, interoperable and reusable) principles; we lack a systematic knowledge graph for DE; existing DE databases are geographically heterogeneous; a significant fraction of DE data is not in open-access formats; tailored tools are needed. These challenges motivate the Deep-Time Digital Earth (DDE) program initiated by the International Union of Geological Sciences and developed in cooperation with national geological surveys, professional associations, academic institutions and scientists around the world. DDE’s mission is to build on previous research to develop a systematic DE knowledge graph, a FAIR data infrastructure that links existing databases and makes dark data visible, and tailored tools for DE data, which are universally accessible. DDE aims to harmonize DE data, share global geoscience knowledge and facilitate data-driven discovery in the understanding of Earth's evolution.  more » « less
Award ID(s):
1835717
NSF-PAR ID:
10299877
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
National Science Review
Volume:
8
Issue:
9
ISSN:
2095-5138
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Data‐driven discovery in geoscience requires an enormous amount of FAIR (findable, accessible, interoperable and reusable) data derived from a multitude of sources. Many geology resources include data based on the geologic time scale, a system of dating that relates layers of rock (strata) to times in Earth history. The terminology of this geologic time scale, including the names of the strata and time intervals, is heterogeneous across data resources, hindering effective and efficient data integration. To address that issue, we created a deep‐time knowledge base that consists of knowledge graphs correlating international and regional geologic time scales, an online service of the knowledge graphs, and an R package to access the service. The knowledge base uses temporal topology to enable comparison and reasoning between various intervals and points in the geologic time scale. This work unifies and allows the querying of age‐related geologic information across the entirety of Earth history, resulting in a platform from which researchers can address complex deep‐time questions spanning numerous types of data and fields of study.

     
    more » « less
  2. It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. 
    more » « less
  3. Abstract

    Graph databases capture richly linked domain knowledge by integrating heterogeneous data and metadata into a unified representation. Here, we present the use of bespoke, interactive data graphics (bar charts, scatter plots, etc.) for visual exploration of a knowledge graph. By modeling a chart as a set of metadata that describes semantic context (SPARQL query) separately from visual context (Vega-Lite specification), we leverage the high-level, declarative nature of the SPARQL and Vega-Lite grammars to concisely specify web-based, interactive data graphics synchronized to a knowledge graph. Resources with dereferenceable URIs (uniform resource identifiers) can employ the hyperlink encoding channel or image marks in Vega-Lite to amplify the information content of a given data graphic, and published charts populate a browsable gallery of the database. We discuss design considerations that arise in relation to portability, persistence, and performance. Altogether, this pairing of SPARQL and Vega-Lite—demonstrated here in the domain of polymer nanocomposite materials science—offers an extensible approach to FAIR (findable, accessible, interoperable, reusable) scientific data visualization within a knowledge graph framework.

     
    more » « less
  4. Abstract Motivation

    Biodiversity in many areas is rapidly declining because of global change. As such, there is an urgent need for new tools and strategies to help identify, monitor and conserve biodiversity hotspots. This is especially true for frugivores, species consuming fruit, because of their important role in seed dispersal and maintenance of forest structure and health. One way to identify these areas is by quantifying functional diversity, which measures the unique roles of species within a community and is valuable for conservation because of its relationship with ecosystem functioning. Unfortunately, the functional trait information required for these studies can be sparse for certain taxa and specific traits and difficult to harmonize across disparate data sources, especially in biodiversity hotspots. To help fill this need, we compiled Frugivoria, a trait database containing ecological, life‐history, morphological and geographical traits for mammals and birds exhibiting frugivory. Frugivoria encompasses species in contiguous moist montane forests and adjacent moist lowland forests of Central and South America—the latter specifically focusing on the Andean states. Compared with existing trait databases, Frugivoria harmonizes existing trait databases, adds new traits, extends traits originally only available for mammals to birds also and fills gaps in trait categories from other databases. Furthermore, we create a cross‐taxa subset of shared traits to aid in analysis of mammals and birds. In total, Frugivoria adds 8662 new trait values for mammals and 14,999 for birds and includes a total of 45,216 trait entries with only 11.37% being imputed. Frugivoria also contains an open workflow that harmonizes trait and taxonomic data from disparate sources and enables users to analyse traits in space. As such, this open‐access database, which aligns with FAIR data principles, fills a major knowledge gap, enabling more comprehensive trait‐based studies of species in this ecologically important region.

    Main Types of Variable Contained

    Ecological, life‐history, morphological and geographical traits.

    Spatial Location and Grain

    Neotropical countries (Mexico, Guatemala, Costa Rica, Panama, El Salvador, Belize, Nicaragua, Ecuador, Colombia, Peru, Bolivia, Argentina, Venezuela and Chile) with contiguous montane regions.

    Time Period and Grain

    IUCN spatial data: obtained February 2023, spanning range maps collated from 1998 to 2022. IUCN species data: obtained June 2019–September 2022. Newly included traits: span 1924 to 2023.

    Major Taxa and Level of Measurement

    Classes Mammalia and Aves; 40,074 species‐level traits; 5142 imputed traits for 1733 species (mammals: 582; birds: 1147) and 16 sub‐species (mammals).

    Software Format

    .csv; R.

     
    more » « less
  5. A series of international workshops held in 2014, 2017, 2019, and 2022 focused on improving tephra studies from field collection through publication and encouraging FAIR (findable, accessible, interoperable, reusable) data practices for tephra data and metadata. Two consensus needs for tephra studies emerged from the 2014 and 2017 workshops: (a) standardization of tephra field data collection, geochemical analysis, correlation, and data reporting, and (b) development of next generation computer tools and databases to facilitate information access across multidisciplinary communities. To achieve (a), we developed a series of recommendations for best practices in tephra studies, from sample collection through analysis and data reporting (https://zenodo.org/record/3866266). A 4-part virtual workshop series (https://tephrochronology.org/cot/Tephra2022/) was held in February and March, 2022, to update the tephra community on these developments, to get community feedback, to learn of unmet needs, and to plan a future roadmap for open and FAIR tephra data. More than 230 people from 25 nations registered for the workshop series. The community strongly emphasized the need for better computer systems, including physical infrastructure (repositories and servers), digital infrastructure (software and tools) and human infrastructure (people, training, and professional assistance), to store, manage and serve global tephra datasets. Some desired attributes of improved computer systems include: 1) user friendliness 2) ability to easily ingest multiparameter tephra data (using best practice recommended data fields); 3) interoperability with existing data repositories; 4) development of tool add-ons (plotting and statistics); 5) improved searchability 6) development of a tephra portal with access to distributed data systems, and 7) commitments to long-term support from funding agencies, publishers and the cyberinfrastructure community. 
    more » « less