skip to main content


Title: A simple yet powerful test for assessing goodness‐of‐fit of high‐dimensional linear models

We evaluate the validity of a projection‐based test checking linear models when the number of covariates tends to infinity, and analyze two gene expression datasets. We show that the test is still consistent and derive the asymptotic distributions under the null and alternative hypotheses. The asymptotic properties are almost the same as those when the number of covariates is fixed as long asp/n → 0with additional mild assumptions. The test dramatically gains dimension reduction, and its numerical performance is remarkable.

 
more » « less
NSF-PAR ID:
10453598
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Volume:
40
Issue:
13
ISSN:
0277-6715
Page Range / eLocation ID:
p. 3153-3166
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We propose new tests for assessing whether covariates in a treatment group and matched control group are balanced in observational studies. The tests exhibit high power under a wide range of multivariate alternatives, some of which existing tests have little power for. The asymptotic permutation null distributions of the proposed tests are studied and theP‐values calculated through the asymptotic results work well in simulation studies, facilitating the application of the test to large data sets. The tests are illustrated in a study of the effect of smoking on blood lead levels. The proposed tests are implemented in anRpackageBalanceCheck.

     
    more » « less
  2. Abstract Aim

    To test the latitudinal gradient in plant species diversity for self‐similarity across taxonomic scales and amongst taxa.

    Location

    North America.

    Methods

    We used species richness data from 245 local vascular plant floras to quantify the slope and shape of the latitudinal gradients in species diversity (LGSD) across all plant species as well as within each family and order. We calculated the contribution of each family and order to the empiricalLGSD.

    Results

    We observed the canonicalLGSDwhen all plants were considered with floras at the lowest latitudes having, on average, 451 more species than floras at the highest latitudes. When considering slope alone, most orders and families showed the expected negative slope, but 31.7% of families and 27.7% of orders showed either no significant relationship between latitude and diversity or a reverseLGSD. Latitudinal patterns of family diversity account for at least 14% of thisLGSD. Most orders and families did not show the negative slope and concave‐down quadratic shape expected by the pattern for all plant species. A majority of families did not make a significant contribution in species to theLGSDwith 53% of plant families contributing little to nothing to the overall gradient. Ten families accounted for more than 70% of the gradient. Two families, the Asteraceae and Fabaceae, contributed a third of theLGSD.

    Main Conclusions

    The empiricalLGSDwe describe here is a consequence of a gradient in the number of families and diversification within relative few plant families. Macroecological studies typically aim to generate models that are general across taxa with the implicit assumption that the models are general within taxa. Our results strongly suggest that models of the latitudinal gradient in plant species richness that rely on environmental covariates (e.g. temperature, energy) are likely not general across plant taxa.

     
    more » « less
  3. Abstract

    The time compression (or time condensation) approximation (TCA) is commonly used in conjunction with an infiltration capacity equation for predicting the postponding infiltration rate, or, more generally, infiltration under time‐varying precipitation. In this paper a power function relationship for TCA between infiltration capacity and its time derivative is proposed for infiltration in the presence of a shallow water table. The results show that the exponent (β) in the power function relationship is not a constant but decreases as infiltration proceeds. The change ofβindicates that the TCA relationship changes during infiltration and further suggests the necessity of using different TCA relationships for predicting infiltration rate during different stages after ponding. We argue that the change ofβis due to the gradual dynamic change of the relative role of gravity and capillarity during infiltration. A Péclet number (Pe) is proposed for measuring the relative effect of gravity and capillarity. In the early times of infiltration when Pe < 1, with the increase ofPe,βdecreases roughly from 3.5 to 2 for clay, silty clay loam, and silty loam, and from 3 to 2 for sandy loam and sand; during the longer times whenPe > 1,βhas a linear relationship withPe. The relationship betweenPeandβprovides an objective approach to select the suitable TCA function during different infiltration stages after ponding.

     
    more » « less
  4. Abstract Aim

    This paper assesses the relative importance of environmental filtering and dispersal limitations as controls on the western range limit ofFagus grandifolia, a common mesic late‐successional tree species in the easternUnited States. We also test for differences in species–environment relationships between range‐edge populations ofF. grandifoliain eastern Wisconsin and core populations in Michigan. Because environmental conditions between the states differ moderately, while in Michigan dispersal presumably no longer limitsF. grandifoliadistributions,F. grandifoliaoffers a classic case study for biogeographers, foresters, and palaeoecologists interested in understanding processes governing species range limits.

    Location

    Wisconsin and Michigan,USA.

    Taxon

    Fagus grandifolia.

    Methods

    This study combines historical datasets ofF. grandifoliafrom the Public Land Survey, environmental covariates from soil maps and historical climate data, three spatial scenarios of dispersal limitation, and five species distribution models (SDMs). We test dispersal limitation and environmental filtering hypotheses by assessingSDMtransferability between core and edge populations, measuring the importance of dispersal and environmental predictors, and using a residual autocovariate model to test for spatial processes not represented by these predictors.

    Results

    Fagus grandifoliapresence was best predicted by total snowfall in Michigan and by dispersal, summer precipitation, and potential evapotranspiration (PET) in Wisconsin. Following the addition of dispersal as a predictor, most Wisconsin models improved and spatial autocorrelation effects largely disappeared. Transferability between core and edge populations was moderate to low.

    Main conclusions

    Both environmental and dispersal limitations appear to govern the western range limit ofF. grandifolia. Species–environment relationships differ between range‐edge and core populations, suggesting either stronger environmental filtering at the range edge or fine‐scale, spatially varying interactions between environmental factors governing moisture availability in core populations. Although lakes, like Lake Michigan, both moderate regional climates and act as dispersal barriers, these effects can be disentangled through the joint analysis ofSDMs and historic observational datasets.

     
    more » « less
  5. Abstract

    We focus on the all‐pairs minimum cut (APMC) problem, a graph partitioning problem whose solution requires finding the minimum cut for every pair of nodes in a given graph. While it is solved for undirected graphs, a solution for APMC in directed graphs still requires anO(n2)brute force approach. We show that the empirical number of distinct minimum cuts in randomly generated strongly connected directed graphs is proportional tonrather than the theoretical value ofn2, suggesting the possibility of an algorithm which finds all minimum cuts in less thanO(n2)time. We also provide an example of the strict upper bound on the number of cuts in graphs with three nodes. We model the distributions with the Generalized extreme value (GEV) distribution and enable the possibility of using a GEV distribution to predict the probability of achieving a certain number of minimum cuts, given the number of nodes and edges. Finally, we contribute to the notion of symmetric cuts by showing that there can beO(n2)symmetric cuts in graphs when node replication is allowed.

     
    more » « less