skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Deploying Big Data to Crack the Genotype to Phenotype Code
Synopsis Mechanistically connecting genotypes to phenotypes is a longstanding and central mission of biology. Deciphering these connections will unite questions and datasets across all scales from molecules to ecosystems. Although high-throughput sequencing has provided a rich platform on which to launch this effort, tools for deciphering mechanisms further along the genome to phenome pipeline remain limited. Machine learning approaches and other emerging computational tools hold the promise of augmenting human efforts to overcome these obstacles. This vision paper is the result of a Reintegrating Biology Workshop, bringing together the perspectives of integrative and comparative biologists to survey challenges and opportunities in cracking the genotype to phenotype code and thereby generating predictive frameworks across biological scales. Key recommendations include promoting the development of minimum “best practices” for the experimental design and collection of data; fostering sustained and long-term data repositories; promoting programs that recruit, train, and retain a diversity of talent; and providing funding to effectively support these highly cross-disciplinary efforts. We follow this discussion by highlighting a few specific transformative research opportunities that will be advanced by these efforts.  more » « less
Award ID(s):
1927470
PAR ID:
10196274
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Integrative and Comparative Biology
Volume:
60
Issue:
2
ISSN:
1540-7063
Page Range / eLocation ID:
385 to 396
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Understanding ecosystem processes on our rapidly changing planet requires integration across spatial, temporal, and biological scales. We propose that spectral biology, using tools that enable near‐ to far‐range sensing by capturing the interaction of energy with matter across domains of the electromagnetic spectrum, will increasingly enable ecological insights across scales from cells to continents. Here, we focus on advances using spectroscopy in the visible to short‐wave infrared, chlorophyll fluorescence‐detecting systems, and optical laser scanning (light detection and ranging, LiDAR) to introduce the topic and special feature. Remote sensing using these tools, in conjunction with in situ measurements, can powerfully capture ecological and evolutionary processes in changing environments. These tools are amenable to capturing variation in life processes across biological scales that span physiological, evolutionary, and macroecological hierarchies. We point out key areas of spectral biology with high potential to advance understanding and monitoring of ecological processes across scales—particularly at large spatial extents—in the face of rapid global change. These include: the detection of plant and ecosystem composition, diversity, structure, and function as well as their relationships; detection of the causes and consequences of environmental stress, including disease and drought, for ecosystems; and detection of change through time in ecosystems over large spatial extents to discern variation in and mechanisms underlying their resistance, recovery, and resilience in the face of disturbance. We discuss opportunities for spectral biology to discover previously unseen variation and novel processes and to prepare the field of ecology for novel computational tools on the horizon with vast new capabilities for monitoring the ecology of our changing planet. 
    more » « less
  2. null (Ed.)
    Abstract Despite efforts to integrate research across different subdisciplines of biology, the scale of integration remains limited. We hypothesize that future generations of Artificial Intelligence (AI) technologies specifically adapted for biological sciences will help enable the reintegration of biology. AI technologies will allow us not only to collect, connect and analyze data at unprecedented scales, but also to build comprehensive predictive models that span various subdisciplines. They will make possible both targeted (testing specific hypotheses) and untargeted discoveries. AI for biology will be the cross-cutting technology that will enhance our ability to do biological research at every scale. We expect AI to revolutionize biology in the 21st century much like statistics transformed biology in the 20th century. The difficulties, however, are many, including data curation and assembly, development of new science in the form of theories that connect the subdisciplines, and new predictive and interpretable AI models that are more suited to biology than existing machine learning and AI techniques. Development efforts will require strong collaborations between biological and computational scientists. This white paper provides a vision for AI for Biology and highlights some challenges. 
    more » « less
  3. Abstract Estimating multiple sequence alignments (MSAs) and inferring phylogenies are essential for many aspects of comparative biology. Yet, many bioinformatics tools for such analyses have focused on specific clades, with greatest attention paid to plants, animals, and fungi. The rapid increase in high-throughput sequencing (HTS) data from diverse lineages now provides opportunities to estimate evolutionary relationships and gene family evolution across the eukaryotic tree of life. At the same time, these types of data are known to be error-prone (e.g., substitutions, contamination). To address these opportunities and challenges, we have refined a phylogenomic pipeline, now named PhyloToL, to allow easy incorporation of data from HTS studies, to automate production of both MSAs and gene trees, and to identify and remove contaminants. PhyloToL is designed for phylogenomic analyses of diverse lineages across the tree of life (i.e., at scales of >100 My). We demonstrate the power of PhyloToL by assessing stop codon usage in Ciliophora, identifying contamination in a taxon- and gene-rich database and exploring the evolutionary history of chromosomes in the kinetoplastid parasite Trypanosoma brucei, the causative agent of African sleeping sickness. Benchmarking PhyloToL’s homology assessment against that of OrthoMCL and a published paper on superfamilies of bacterial and eukaryotic organellar outer membrane pore-forming proteins demonstrates the power of our approach for determining gene family membership and inferring gene trees. PhyloToL is highly flexible and allows users to easily explore HTS data, test hypotheses about phylogeny and gene family evolution and combine outputs with third-party tools (e.g., PhyloChromoMap, iGTP). 
    more » « less
  4. null (Ed.)
    Abstract Deciphering gene regulatory networks (GRNs) is both a promise and challenge of systems biology. The promise lies in identifying key transcription factors (TFs) that enable an organism to react to changes in its environment. The challenge lies in validating GRNs that involve hundreds of TFs with hundreds of thousands of interactions with their genome-wide targets experimentally determined by high-throughput sequencing. To address this challenge, we developed ConnecTF, a species-independent, web-based platform that integrates genome-wide studies of TF–target binding, TF–target regulation, and other TF-centric omic datasets and uses these to build and refine validated or inferred GRNs. We demonstrate the functionality of ConnecTF by showing how integration within and across TF–target datasets uncovers biological insights. Case study 1 uses integration of TF–target gene regulation and binding datasets to uncover TF mode-of-action and identify potential TF partners for 14 TFs in abscisic acid signaling. Case study 2 demonstrates how genome-wide TF–target data and automated functions in ConnecTF are used in precision/recall analysis and pruning of an inferred GRN for nitrogen signaling. Case study 3 uses ConnecTF to chart a network path from NLP7, a master TF in nitrogen signaling, to direct secondary TF2s and to its indirect targets in a Network Walking approach. The public version of ConnecTF (https://ConnecTF.org) contains 3,738,278 TF–target interactions for 423 TFs in Arabidopsis, 839,210 TF–target interactions for 139 TFs in maize (Zea mays), and 293,094 TF–target interactions for 26 TFs in rice (Oryza sativa). The database and tools in ConnecTF will advance the exploration of GRNs in plant systems biology applications for model and crop species. 
    more » « less
  5. ABSTRACT The physical organization of DNA within the nucleus is fundamental to a wide range of biological processes. The experimental investigation of the structure of genomic DNA remains challenging due to its large size and hierarchical arrangement. These challenges present considerable opportunities for combined experimental and modeling approaches. Physics‐based computational models, in particular, have emerged as essential tools for probing chromatin structure and dynamics across a wide range of length scales. Such models must necessarily be capable of bridging scales, and each scale presents its own subtleties and intricacies. This review discusses recent methodological advances in genomic structural modeling, emphasizing the need for multiscale integration to capture the hierarchical organization and molecular mechanisms that underlie chromatin structure and function. We present an analysis of state‐of‐the‐art methods, as well as a perspective on challenges and future opportunities across length scales ranging from bare DNA to nucleosomes and chromatin fibers, up to TAD and chromosome‐scale models. We emphasize models that connect genome organization to gene expression, models that leverage emerging machine learning capabilities, and models that develop multiscale approaches. We examine gaps in experimental data that computational models are poised to address and propose directions for future research that bridge theory and experiment in DNA structural biology. 
    more » « less