skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data reuse in agricultural genomics research: challenges and recommendations
Abstract The scientific community has long benefited from the opportunities provided by data reuse. Recognizing the need to identify the challenges and bottlenecks to reuse in the agricultural research community and propose solutions for them, the data reuse working group was started within the AgBioData consortium framework. Here, we identify the limitations of data standards, metadata deficiencies, data interoperability, data ownership, data availability, user skill level, resource availability, and equity issues, with a specific focus on agricultural genomics research. We propose possible solutions stakeholders could implement to mitigate and overcome these challenges and provide an optimistic perspective on the future of genomics and transcriptomics data reuse.  more » « less
Award ID(s):
2126334
PAR ID:
10566196
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
GigaScience
Volume:
14
ISSN:
2047-217X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The scientific community has long benefited from the opportunities provided by data reuse. Recognizing the need to identify the challenges and bottlenecks to reuse in the agricultural research community and propose solutions for them, the data reuse working group was started within the AgBioData consortium (https://www.agbiodata.org/) framework. Here, we identify the limitations of data standards, metadata deficiencies, data interoperability, data ownership, data availability, user skill level, resource availability, and equity issues, with a specific focus on agricultural genomics research. We propose possible solutions stakeholders could implement to mitigate and overcome these challenges and provide an optimistic perspective on the future of genomics and transcriptomics data reuse. 
    more » « less
  2. Harris, T (Ed.)
    Abstract The rapid increase in the number of reference-quality genome assemblies presents significant new opportunities for genomic research. However, the absence of standardized naming conventions for genome assemblies and annotations across datasets creates substantial challenges. Inconsistent naming hinders the identification of correct assemblies, complicates the integration of bioinformatics pipelines, and makes it difficult to link assemblies across multiple resources. To address this, we developed a specification for standardizing the naming of reference genome assemblies, to improve consistency across datasets and facilitate interoperability. This specification was created with FAIR (Findable, Accessible, Interoperable, and Reusable) practices in mind, ensuring that reference assemblies are easier to locate, access, and reuse across research communities. Additionally, it has been designed to comply with primary genomic data repositories, including members of the International Nucleotide Sequence Database Collaboration consortium, ensuring compatibility with widely used databases. While initially tailored to the agricultural genomics community, the specification is adaptable for use across different taxa. Widespread adoption of this standardized nomenclature would streamline assembly management, better enable cross-species analyses, and improve the reproducibility of research. It would also enhance natural language processing applications that depend on consistent reference assembly names in genomic literature, promoting greater integration and automated analysis of genomic data. This is a good time to consider more consistent genomic data nomenclature as many research communities and data resources are now finding themselves juggling multiple datasets from multiple data providers. 
    more » « less
  3. The opaque relationship between biology and behavior is an intractable problem for psychiatry, and it increasingly challenges longstanding diagnostic categorizations. While various big data sciences have been repeatedly deployed as potential solutions, they have so far complicated more than they have managed to disentangle. Attending to categorical misalignment, this article proposes one reason why this is the case: Datasets have to instantiate clinical categories in order to make biological sense of them, and they do so in different ways. Here, I use mixed methods to examine the role of the reuse of big data in recent genomic research on autism spectrum disorder (ASD). I show how divergent regimes of psychiatric categorization are innately encoded within commonly used datasets from MSSNG and 23andMe, contributing to a rippling disjuncture in the accounts of autism that this body of research has produced. Beyond the specific complications this dynamic introduces for the category of autism, this paper argues for the necessity of critical attention to the role of dataset reuse and recombination across human genomics and beyond. 
    more » « less
  4. IntroductionAdvancements in machine learning (ML) algorithms that make predictions from data without being explicitly programmed and the increased computational speeds of graphics processing units (GPUs) over the last decade have led to remarkable progress in the capabilities of ML. In many fields, including agriculture, this progress has outpaced the availability of sufficiently diverse and high-quality datasets, which now serve as a limiting factor. While many agricultural use cases appear feasible with current compute resources and ML algorithms, the lack of reusable hardware and software components, referred to as cyberinfrastructure (CI), for collecting, transmitting, cleaning, labeling, and training datasets is a major hindrance toward developing solutions to address agricultural use cases. This study focuses on addressing these challenges by exploring the collection, processing, and training of ML models using a multimodal dataset and providing a vision for agriculture-focused CI to accelerate innovation in the field. MethodsData were collected during the 2023 growing season from three agricultural research locations across Ohio. The dataset includes 1 terabyte (TB) of multimodal data, comprising Unmanned Aerial System (UAS) imagery (RGB and multispectral), as well as soil and weather sensor data. The two primary crops studied were corn and soybean, which are the state's most widely cultivated crops. The data collected and processed from this study were used to train ML models to make predictions of crop growth stage, soil moisture, and final yield. ResultsThe exercise of processing this dataset resulted in four CI components that can be used to provide higher accuracy predictions in the agricultural domain. These components included (1) a UAS imagery pipeline that reduced processing time and improved image quality over standard methods, (2) a tabular data pipeline that aggregated data from multiple sources and temporal resolutions and aligned it with a common temporal resolution, (3) an approach to adapting the model architecture for a vision transformer (ViT) that incorporates agricultural domain expertise, and (4) a data visualization prototype that was used to identify outliers and improve trust in the data. DiscussionFurther work will be aimed at maturing the CI components and implementing them on high performance computing (HPC). There are open questions as to how CI components like these can best be leveraged to serve the needs of the agricultural community to accelerate the development of ML applications in agriculture. 
    more » « less
  5. Oceanography is inherently an interdisciplinary science capable of producing highly complex, heterogeneous data that pose unique challenges for data management and reuse. Evolving instrumentation and new research methodologies are increasingly taxing current strategies and technologies for management and reuse of data. Data-related publisher and funder requirements are relatively new demands that researchers must learn to navigate. These are just some of the stressors that repositories experience in their role of curating and publishing FAIR marine-related data. In response, oceanographic repositories are adapting by leveraging community data standards, engaging in the development of new technologies and the usage of novel tools to improve data discovery and interoperability. Additionally, they are collaborating with data-related stakeholders to help shape data-related policy, and fill an education role to promote good data hygiene and bring awareness of concepts like FAIR to the oceanographic research community. This presentation will highlight some of the activities of the BCO-DMO repository that are aimed at advancing the availability and reuse of Open oceanographic data. 
    more » « less