skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Discovering genotype–phenotype relationships with machine learning and the Visual Physiology Opsin Database ( VPOD )
Abstract BackgroundPredicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families, including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax—the wavelength of maximum absorbance—which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype. ResultsHere, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes that we call the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for nonadditive effects of mutations on function, and identify functionally critical amino acid sites. ConclusionThe ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.  more » « less
Award ID(s):
2109688
PAR ID:
10552180
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
GigaScience
Volume:
13
ISSN:
2047-217X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundPredicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype. ResultsHere, we report a newly compiled database of all heterologously expressed opsin genes with λmaxphenotypes called the Visual Physiology Opsin Database (VPOD).VPOD_1.0contains 864 unique opsin genotypes and corresponding λmaxphenotypes collected across all animals from 73 separate publications. We useVPODdata anddeepBreaksto show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites. ConclusionThe ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly forde-novoprotein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes. Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmaxphenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes. 
    more » « less
  2. IntroductionThroughout domestication, crop plants have gone through strong genetic bottlenecks, dramatically reducing the genetic diversity in today’s available germplasm. This has also reduced the diversity in traits necessary for breeders to develop improved varieties. Many strategies have been developed to improve both genetic and trait diversity in crops, from backcrossing with wild relatives, to chemical/radiation mutagenesis, to genetic engineering. However, even with recent advances in genetic engineering we still face the rate limiting step of identifying which genes and mutations we should target to generate diversity in specific traits. MethodsHere, we apply a comparative evolutionary approach, pairing phylogenetic and expression analyses to identify potential candidate genes for diversifying soybean (Glycine max) canopy cover development via the nuclear auxin signaling gene families, while minimizing pleiotropic effects in other tissues. In soybean, rapid canopy cover development is correlated with yield and also suppresses weeds in organic cultivation. Results and discussionWe identified genes most specifically expressed during early canopy development from the TIR1/AFB auxin receptor, Aux/IAA auxin co-receptor, and ARF auxin response factor gene families in soybean, using principal component analysis. We defined Arabidopsis thaliana and model legume species orthologs for each soybean gene in these families allowing us to speculate potential soybean phenotypes based on well-characterized mutants in these model species. In future work, we aim to connect genetic and functional diversity in these candidate genes with phenotypic diversity in planta allowing for improvements in soybean rapid canopy cover, yield, and weed suppression. Further development of this and similar algorithms for defining and quantifying tissue- and phenotype-specificity in gene expression may allow expansion of diversity in valuable phenotypes in important crops. 
    more » « less
  3. Many marine organisms have a biphasic life cycle that transitions between a swimming larva with a more sedentary adult form. At the end of the first phase, larvae must identify suitable sites to settle and undergo a dramatic morphological change. Environmental factors, including photic and chemical cues, appear to influence settlement, but the sensory receptors involved are largely unknown. We targeted the protein receptor, opsin, which belongs to large superfamily of transmembrane receptors that detects environmental stimuli, hormones, and neurotransmitters. While opsins are well-known for light-sensing, including vision, a growing number of studies have demonstrated light-independent functions. We therefore examined opsin expression in the Pteriomorphia, a large, diverse clade of marine bivalves, that includes commercially important species, such as oysters, mussels, and scallops. Methods Genomic annotations combined with phylogenetic analysis show great variation of opsin abundance among pteriomorphian bivalves, including surprisingly high genomic abundance in many species that are eyeless as adults, such as mussels. Therefore, we investigated the diversity of opsin expression from the perspective of larval development. We collected opsin gene expression in four families of Pteriomorphia, across three distinct larval stages, i.e., trochophore, veliger, and pediveliger, and compared those to adult tissues.Results We found larvae express all opsin types in these bivalves, but opsin expression patterns are largely species-specific across development. Few opsins are expressed in the adult mantle, but many are highly expressed in adult eyes. Intriguingly, opsin genes such as retinochrome, xenopsins, and Go-opsins have higher levels of expression in the later larval stages when substrates for settlement are being tested, such as the pediveliger. Conclusion Investigating opsin gene expression during larval development provides crucial insights into their intricate interactions with the surroundings, which may shed light on how opsin receptors of these organisms respond to various environmental cues that play a pivotal role in their settlement process. 
    more » « less
  4. null (Ed.)
    For many species, vision is one of the most important sensory modalities for mediating essential tasks that include navigation, predation and foraging, predator avoidance, and numerous social behaviors. The vertebrate visual process begins when photons of the light interact with rod and cone photoreceptors that are present in the neural retina. Vertebrate visual photopigments are housed within these photoreceptor cells and are sensitive to a wide range of wavelengths that peak within the light spectrum, the latter of which is a function of the type of chromophore used and how it interacts with specific amino acid residues found within the opsin protein sequence. Minor differences in the amino acid sequences of the opsins are known to lead to large differences in the spectral peak of absorbance (i.e. the λmax value). In our prior studies, we developed a new approach that combined homology modeling and molecular dynamics simulations to gather structural information associated with chromophore conformation, then used it to generate statistical models for the accurate prediction of λmax values for photopigments derived from Rh1 and Rh2 amino acid sequences. In the present study, we test our novel approach to predict the λmax of phylogenetically distant Sws2 cone opsins. To build a model that can predict the λmax using our approach presented in our prior studies, we selected a spectrally-diverse set of 11 teleost Sws2 photopigments for which both amino acid sequence information and experimentally measured λmax values are known. The final first-order regression model, consisting of three terms associated with chromophore conformation, was sufficient to predict the λmax of Sws2 photopigments with high accuracy. This study further highlights the breadth of our approach in reliably predicting λmax values of Sws2 cone photopigments, evolutionary-more distant from template bovine RH1, and provided mechanistic insights into the role of known spectral tuning sites 
    more » « less
  5. Abstract In animals, opsins and cryptochromes are major protein families that transduce light signals when bound to light-absorbing chromophores. Opsins are involved in various light-dependent processes, like vision, and have been co-opted for light-independent sensory modalities. Cryptochromes are important photoreceptors in animals, generally regulating circadian rhythm, they belong to a larger protein family with photolyases, which repair UV-induced DNA damage. Mollusks are great animals to explore questions about light sensing as eyes have evolved multiple times across, and within, taxonomic classes. We used molluscan genome assemblies from 80 species to predict protein sequences and examine gene family evolution using phylogenetic approaches. We found extensive opsin family expansion and contraction, particularly in bivalve xenopsins and gastropod Go-opsins, while other opsins, like retinochrome, rarely duplicate. Bivalve and gastropod lineages exhibit fluctuations in opsin repertoire, with cephalopods having the fewest number of opsins and loss of at least 2 major opsin types. Interestingly, opsin expansions are not limited to eyed species, and the highest opsin content was seen in eyeless bivalves. The dynamic nature of opsin evolution is quite contrary to the general lack of diversification in mollusk cryptochromes, though some taxa, including cephalopods and terrestrial gastropods, have reduced repertoires of both protein families. We also found complete loss of opsins and cryptochromes in multiple, but not all, deep-sea species. These results help set the stage for connecting genomic changes, including opsin family expansion and contraction, with differences in environmental, and biological features across Mollusca. 
    more » « less