Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.
The breeder’s equation, Δz¯=Gβ , allows us to understand how genetics (the genetic covariance matrix, G) and the vector of linear selection gradients β interact to generate evolutionary trajectories. Estimation of β using multiple regression of trait values on relative fitness revolutionized the way we study selection in laboratory and wild populations. However, multicollinearity, or correlation of predictors, can lead to very high variances of and covariances between elements of β, posing a challenge for the interpretation of the parameter estimates. This is particularly relevant in the era of big data, where the number of predictors may approach or exceed the number of observations. A common approach to multicollinear predictors is to discard some of them, thereby losing any information that might be gained from those traits. Using simulations, we show how, on the one hand, multicollinearity can result in inaccurate estimates of selection, and, on the other, how the removal of correlated phenotypes from the analyses can provide a misguided view of the targets of selection. We show that regularized regression, which places data-validated constraints on the magnitudes of individual elements of β, can produce more accurate estimates of the total strength and direction of multivariate selection in the presence of multicollinearity and limited data, and often has little cost when multicollinearity is low. We also compare standard and regularized regression estimates of selection in a reanalysis of three published case studies, showing that regularized regression can improve fitness predictions in independent data. Our results suggest that regularized regression is a valuable tool that can be used as an important complement to traditional least-squares estimates of selection. In some cases, its use can lead to improved predictions of individual fitness, and improved estimates of the total strength and direction of multivariate selection.
more » « less- NSF-PAR ID:
- 10537676
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Evolution Letters
- Volume:
- 8
- Issue:
- 3
- ISSN:
- 2056-3744
- Format(s):
- Medium: X Size: p. 361-373
- Size(s):
- p. 361-373
- Sponsoring Org:
- National Science Foundation
More Like this
-
Summary -
Abstract Social interactions with conspecifics can dramatically affect an individual’s fitness. The positive or negative consequences of interacting with social partners typically depend on the value of traits that they express. These pathways of social selection connect the traits and genes expressed in some individuals to the fitness realized by others, thereby altering the total phenotypic selection on and evolutionary response of traits across the multivariate phenotype. The downstream effects of social selection are mediated by the patterns of phenotypic assortment between focal individuals and their social partners (the interactant covariance, Cij′, or the multivariate form, CI). Depending on the sign and magnitude of the interactant covariance, the direction of social selection can be reinforced, reversed, or erased. We report estimates of Cij′ from a variety of studies of forked fungus beetles to address the largely unexplored questions of consistency and plasticity of phenotypic assortment in natural populations. We found that phenotypic assortment of male beetles based on body size or horn length was highly variable among subpopulations, but that those differences also were broadly consistent from year to year. At the same time, the strength and direction of Cij′ changed quickly in response to experimental changes in resource distribution and social properties of populations. Generally, interactant covariances were more negative in contexts in which the number of social interactions was greater in both field and experimental situations. These results suggest that patterns of phenotypic assortment could be important contributors to variability in multilevel selection through their mediation of social selection gradients.
-
Abstract The measurement of uncharacterized pools of biological molecules through techniques such as metabarcoding, metagenomics, metatranscriptomics, metabolomics, and metaproteomics produces large, multivariate datasets. Analyses of these datasets have successfully been borrowed from community ecology to characterize the molecular diversity of samples ( ɑ -diversity) and to assess how these profiles change in response to experimental treatments or across gradients ( β -diversity). However, sample preparation and data collection methods generate biases and noise which confound molecular diversity estimates and require special attention. Here, we examine how technical biases and noise that are introduced into multivariate molecular data affect the estimation of the components of diversity (i.e., total number of different molecular species, or entities; total number of molecules; and the abundance distribution of molecular entities). We then explore under which conditions these biases affect the measurement of ɑ - and β -diversity and highlight how novel methods commonly used in community ecology can be adopted to improve the interpretation and integration of multivariate molecular data.more » « less
-
Neighborhood models have allowed us to test many hypotheses regarding the drivers of variation in tree growth, but require considerable computation due to the many empirically supported non-linear relationships they include. Regularized regression represents a far more efficient neighborhood modeling method, but it is unclear whether such an ecologically unrealistic model can provide accurate insights on tree growth. Rapid computation is becoming increasingly important as ecological datasets grow in size, and may be essential when using neighborhood models to predict tree growth beyond sample plots or into the future. We built a novel regularized regression model of tree growth and investigated whether it reached the same conclusions as a commonly used neighborhood model, regarding hypotheses of how tree growth is influenced by the species identity of neighboring trees. We also evaluated the ability of both models to interpolate the growth of trees not included in the model fitting dataset. Our regularized regression model replicated most of the classical model’s inferences in a fraction of the time without using high-performance computing resources. We found that both methods could interpolate out-of-sample tree growth, but the method making the most accurate predictions varied among focal species. Regularized regression is particularly efficient for comparing hypotheses because it automates the process of model selection and can handle correlated explanatory variables. This feature means that regularized regression could also be used to select among potential explanatory variables (e.g., climate variables) and thereby streamline the development of a classical neighborhood model. Both regularized regression and classical methods can interpolate out-of-sample tree growth, but future research must determine whether predictions can be extrapolated to trees experiencing novel conditions. Overall, we conclude that regularized regression methods can complement classical methods in the investigation of tree growth drivers and represent a valuable tool for advancing this field toward prediction.more » « less
-
Summary We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.