Abstract BackgroundPredicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families, including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax—the wavelength of maximum absorbance—which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype. ResultsHere, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes that we call the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for nonadditive effects of mutations on function, and identify functionally critical amino acid sites. ConclusionThe ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly for de novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.
more »
« less
Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)
Abstract BackgroundPredicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax - the wavelength of maximum absorbance - which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype. ResultsHere, we report a newly compiled database of all heterologously expressed opsin genes with λmaxphenotypes called the Visual Physiology Opsin Database (VPOD).VPOD_1.0contains 864 unique opsin genotypes and corresponding λmaxphenotypes collected across all animals from 73 separate publications. We useVPODdata anddeepBreaksto show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites. ConclusionThe ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism’s ecological niche, and may be used more broadly forde-novoprotein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes. Key PointsWe introduce the Visual Physiology Opsin Database (VPOD_1.0), which includes 864 unique animal opsin genotypes and corresponding λmaxphenotypes from 73 separate publications.We demonstrate that regression-based ML models can reliably predict λmax from gene sequence alone, predict non-additive effects of mutations on function, and identify functionally critical amino acid sites.We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.
more »
« less
- Award ID(s):
- 2109688
- PAR ID:
- 10540242
- Publisher / Repository:
- bioRxiv
- Date Published:
- Format(s):
- Medium: X
- Institution:
- bioRxiv
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)For many species, vision is one of the most important sensory modalities for mediating essential tasks that include navigation, predation and foraging, predator avoidance, and numerous social behaviors. The vertebrate visual process begins when photons of the light interact with rod and cone photoreceptors that are present in the neural retina. Vertebrate visual photopigments are housed within these photoreceptor cells and are sensitive to a wide range of wavelengths that peak within the light spectrum, the latter of which is a function of the type of chromophore used and how it interacts with specific amino acid residues found within the opsin protein sequence. Minor differences in the amino acid sequences of the opsins are known to lead to large differences in the spectral peak of absorbance (i.e. the λmax value). In our prior studies, we developed a new approach that combined homology modeling and molecular dynamics simulations to gather structural information associated with chromophore conformation, then used it to generate statistical models for the accurate prediction of λmax values for photopigments derived from Rh1 and Rh2 amino acid sequences. In the present study, we test our novel approach to predict the λmax of phylogenetically distant Sws2 cone opsins. To build a model that can predict the λmax using our approach presented in our prior studies, we selected a spectrally-diverse set of 11 teleost Sws2 photopigments for which both amino acid sequence information and experimentally measured λmax values are known. The final first-order regression model, consisting of three terms associated with chromophore conformation, was sufficient to predict the λmax of Sws2 photopigments with high accuracy. This study further highlights the breadth of our approach in reliably predicting λmax values of Sws2 cone photopigments, evolutionary-more distant from template bovine RH1, and provided mechanistic insights into the role of known spectral tuning sitesmore » « less
-
Summary A prevailing hypothesis posits that achieving higher maximum rates of leaf carbon gain and water loss is constrained by geometry and/or selection to limit the allocation of epidermal area to stomata (fS). Under this ‘stomatal‐area minimization hypothesis’, highergs,maxis associated with greater numbers of smaller stomata because this trait combination increasesgs,maxwith minimal increase infS, leading to relative conservation offSsemi‐independent ofgs,maxdue to coordination in stomatal size, density, and pore depth. An alternative hypothesis is that the evolution of highergs,maxcan be enabled by a greater epidermal area allocated to stomata, leading to positive covariation betweenfSandgs,max; we call this the ‘stomatal‐area adaptation hypothesis’. Under this hypothesis, the interspecific scaling betweengs,max, stomatal density, and stomatal size is a by‐product of selection on a moving optimalgs,max.We integrated biophysical and evolutionary quantitative genetic modeling with phylogenetic comparative analyses of a global data set of stomatal density and size from 2408 vascular forest species. The models present specific assumptions of both hypotheses and deduce predictions that can be evaluated with our empirical analyses of forest plants.There are three main results. First, neither the stomatal‐area minimization nor adaptation hypothesis is sufficient to be supported. Second, estimates of interspecific scaling from common regression methods cannot reliably distinguish between hypotheses when stomatal size is bounded. Third, we reconcile both hypotheses with the data by including an additional assumption that stomatal size is bounded by a wide range and under selection; we refer to this synthetic hypothesis as the ‘stomatal adaptation + bounded size’ hypothesis.This study advances our understanding of scaling between stomatal size and density by mathematically describing specific assumptions of competing hypotheses, demonstrating that existing hypotheses are inconsistent with observations, and reconciling these hypotheses with phylogenetic comparative analyses by postulating a synthetic model of selection ongs,max,fS, and stomatal size.more » « less
-
Summary The timing of reproduction is a critical developmental decision in the life cycle of many plant species.Fine mapping of a rapid‐flowering mutant was done using whole‐genome sequence data from bulked DNA from a segregating F2 mapping populations. The causative mutation maps to a gene orthologous with the third subunit of DNA polymerase δ (POLD3), a previously uncharacterized gene in plants. Expression analyses of POLD3 were conducted via real time qPCR to determine when and in what tissues the gene is expressed.To better understand the molecular basis of the rapid‐flowering phenotype, transcriptomic analyses were conducted in the mutant vs wild‐type. Consistent with the rapid‐flowering mutant phenotype, a range of genes involved in floral induction and flower development are upregulated in the mutant.Our results provide the first characterization of the developmental and gene expression phenotypes that result from a lesion inPOLD3in plants.more » « less
-
Summary Isogenic individuals can display seemingly stochastic phenotypic differences, limiting the accuracy of genotype‐to‐phenotype predictions. The extent of this phenotypic variation depends in part on genetic background, raising questions about the genes involved in controlling stochastic phenotypic variation.Focusing on early seedling traits inArabidopsis thaliana, we found that hypomorphs of the cuticle‐related geneLIPID TRANSFER PROTEIN 2(LTP2) greatly increased variation in seedling phenotypes, including hypocotyl length, gravitropism and cuticle permeability. Manyltp2hypocotyls were significantly shorter than wild‐type hypocotyls while others resembled the wild‐type.Differences in epidermal properties and gene expression betweenltp2seedlings with long and short hypocotyls suggest a loss of cuticle integrity as the primary determinant of the observed phenotypic variation. We identified environmental conditions that reveal or mask the increased variation inltp2hypomorphs and found that increased expression of its closest paralogLTP1is necessary forltp2phenotypes.Our results illustrate how decreased expression of a single gene can generate starkly increased phenotypic variation in isogenic individuals in response to an environmental challenge.more » « less
An official website of the United States government

