skip to main content


Title: A FAIR and modular image‐based workflow for knowledge discovery in the emerging field of imageomics
Abstract

Image‐based machine learning tools are an ascendant ‘big data’ research avenue. Citizen science platforms, like iNaturalist, and museum‐led initiatives provide researchers with an abundance of data and knowledge to extract. These include extraction of metadata, species identification, and phenomic data. Ecological and evolutionary biologists are increasingly using complex, multi‐step processes on data. These processes often include machine learning techniques, often built by others, that are difficult to reuse by other members in a collaboration.

We present a conceptual workflow model for machine learning applications using image data to extract biological knowledge in the emerging field of imageomics. We derive an implementation of this conceptual workflow for a specific imageomics application that adheres to FAIR principles as a formal workflow definition that allows fully automated and reproducible execution, and consists of reusable workflow components.

We outline technologies and best practices for creating an automated, reusable and modular workflow, and we show how they promote the reuse of machine learning models and their adaptation for new research questions. This conceptual workflow can be adapted: it can be semi‐automated, contain different components than those presented here, or have parallel components for comparative studies.

We encourage researchers—both computer scientists and biologists—to build upon this conceptual workflow that combines machine learning tools on image data to answer novel scientific questions in their respective fields.

 
more » « less
Award ID(s):
2217817 2118240 2022042
PAR ID:
10502004
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Methods in Ecology and Evolution
ISSN:
2041-210X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Methods for inferring geographic origin from the stable isotope composition of animal tissues are widely used in movement ecology, but few computational tools and standards for data interpretation are available.

    We introduce theassignRrpackage, which provides a structured, flexible toolkit for isotope‐based migration data analysis and interpretation using a widely adopted semi‐parametric Bayesian inversion method.

    assignRbundles data resources and functions that support data interpretation, hypothesis‐testing and quality assessment, allowing end‐to‐end data analysis with only a few lines of code. Tools for post hoc analysis offer robust, standardized methods for aggregating information from multiple individuals, assignment of individuals to a sub‐region of the study area and comparison of potential regions of origin using odds ratios. Assessment tools quantify the quality and power of the isotopic assignments and can be used to test prototype study designs.

    TheassignRpackage should increase the accessibility of isotopic geolocation methods.assignRsupports flexible data sources and analysis decisions, making it suitable for a wide range of applications, but also promotes standardization that will help foster increased consistency and comparability among studies and a more holistic understanding of animal migration. Lastly,assignRcan help make isotope‐based geolocation research more efficient by helping researchers plan projects to be optimally aligned with their research questions.

     
    more » « less
  2. Abstract

    Recent advances in generative artificial intelligence (AI) and multimodal learning analytics (MMLA) have allowed for new and creative ways of leveraging AI to support K12 students' collaborative learning in STEM+C domains. To date, there is little evidence of AI methods supporting students' collaboration in complex, open‐ended environments. AI systems are known to underperform humans in (1) interpreting students' emotions in learning contexts, (2) grasping the nuances of social interactions and (3) understanding domain‐specific information that was not well‐represented in the training data. As such, combined human and AI (ie, hybrid) approaches are needed to overcome the current limitations of AI systems. In this paper, we take a first step towards investigating how a human‐AI collaboration between teachers and researchers using an AI‐generated multimodal timeline can guide and support teachers' feedback while addressing students' STEM+C difficulties as they work collaboratively to build computational models and solve problems. In doing so, we present a framework characterizing the human component of our human‐AI partnership as a collaboration between teachers and researchers. To evaluate our approach, we present our timeline to a high school teacher and discuss the key insights gleaned from our discussions. Our case study analysis reveals the effectiveness of an iterative approach to using human‐AI collaboration to address students' STEM+C challenges: the teacher can use the AI‐generated timeline to guide formative feedback for students, and the researchers can leverage the teacher's feedback to help improve the multimodal timeline. Additionally, we characterize our findings with respect to two events of interest to the teacher: (1) when the students cross adifficulty threshold,and (2) thepoint of intervention, that is, when the teacher (or system) should intervene to provide effective feedback. It is important to note that the teacher explained that there should be a lag between (1) and (2) to give students a chance to resolve their own difficulties. Typically, such a lag is not implemented in computer‐based learning environments that provide feedback.

     
    more » « less
  3. Abstract

    Biodiversity studies rely heavily on estimates of species' distributions often obtained through ecological niche modelling. Numerous software packages exist that allow users to model ecological niches using machine learning and statistical methods. However, no existing package with a graphical user interface allows users to perform model calibration and selection based on convex forms such as ellipsoids, which may match fundamental ecological niche shapes better, incorporating tools for exploring, modelling, and evaluating niches and distributions that are intuitive for both novice and proficient users.

    Here we describe anrpackage, NicheToolBox(ntbox), that allows users to conduct all processing steps involved in ecological niche modelling: downloading and curating occurrence data, obtaining and transforming environmental data layers, selecting environmental variables, exploring relationships between geographic and environmental spaces, calibrating and selecting ellipsoid models, evaluating models using binomial and partial ROC tests, assessing extrapolation risk, and performing geographic information system operations via a graphical user interface. A summary of the entire workflow is produced for use as a stand‐alone algorithm or as part of research reports.

    The method is explained in detail and tested via modelling the threatened feline speciesLeopardus wiedii. Georeferenced occurrence data for this species are queried to display both point occurrences and the IUCN extent of occurrence polygon (IUCN, 2007). This information is used to illustrate tools available for accessing, processing and exploring biodiversity data (e.g. number of occurrences and chronology of collecting) and transforming environmental data (e.g. a summary PCA for 19 bioclimatic layers). Visualizations of three‐dimensional ecological niches modelled as minimum volume ellipsoids are developed with ancillary statistics. This niche model is then projected to geographic space, to represent a corresponding potential suitability map.

    Usingntboxallows a fast and straightforward means by which to retrieve and manipulate occurrence and environmental data, which can then be implemented in model calibration, projection and evaluation for assessing distributions of species in geographic space and their corresponding environmental combinations.

     
    more » « less
  4. Abstract

    Conceptual models are necessary to synthesize what is known about a topic, identify gaps in knowledge and improve understanding. The process of developing conceptual models that summarize the literature using ad hoc approaches has high potential to be incomplete due to the challenges of tracking information and hypotheses across the literature.

    We present a novel, systematic approach to conceptual model development through qualitative synthesis and graphical analysis of hypotheses already present in the scientific literature. Our approach has five stages: researchers explicitly define the scope of the question, conduct a systematic review, extract hypotheses from prior studies, assemble hypotheses into a single network model and analyse trends in the model through network analysis.

    The resulting network can be analysed to identify shifts in thinking over time, variation in the application of ideas over different axes of investigation (e.g. geography, taxonomy, ecosystem type) and the most important hypotheses based on the network structure. To illustrate the approach, we present examples from a case study that applied the method to synthesize decades of research on the effects of forest fragmentation on birds.

    This approach can be used to synthesize scientific thinking across any field of research, guide future research to fill knowledge gaps efficiently and help researchers systematically build conceptual models representing alternative hypotheses.

     
    more » « less
  5. Abstract

    The Molecular Sciences Software Institute's (MolSSI) Quantum Chemistry Archive (QCArchive) project is an umbrella name that covers both a central server hosted by MolSSI for community data and the Python‐based software infrastructure that powers automated computation and storage of quantum chemistry (QC) results. The MolSSI‐hosted central server provides the computational molecular sciences community a location to freely access tens of millions of QC computations for machine learning, methodology assessment, force‐field fitting, and more through a Python interface. Facile, user‐friendly mining of the centrally archived quantum chemical data also can be achieved through web applications found athttps://qcarchive.molssi.org. The software infrastructure can be used as a standalone platform to compute, structure, and distribute hundreds of millions of QC computations for individuals or groups of researchers at any scale. The QCArchiveInfrastructureis open‐source (BSD‐3C), code repositories can be found athttps://github.com/MolSSI, and releases can be downloaded via PyPI and Conda.

    This article is categorized under:

    Electronic Structure Theory > Ab Initio Electronic Structure Methods

    Software > Quantum Chemistry

    Data Science > Computer Algorithms and Programming

     
    more » « less