skip to main content


Search for: All records

Creators/Authors contains: "Yang, Lu"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. ABSTRACT

    Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.

     
    more » « less
  2. The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for regression models with general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (or a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify possible causes of misspecification such as overdispersion. We provide theoretical justification for the proposed residuals by establishing their asymptotic properties. Moreover, in order to assess the mean structure and identify potential covariates, we develop an ordered curve as a supplementary tool, which is based on the comparison between the partial sum of outcomes and of fitted means. Through simulation, we demonstrate empirically that the proposed tools outperform commonly used residuals for various model assessment tasks. We also illustrate the workflow of model assessment using the proposed tools in data analysis. Supplementary materials for this article are available online. 
    more » « less
    Free, publicly-accessible full text available February 14, 2025
  3. Abstract Motivation

    Emerging omics technologies have introduced a two-way grouping structure in multiple testing, as seen in single-cell omics data, where the features can be grouped by either genes or cell types. Traditional multiple testing methods have limited ability to exploit such two-way grouping structure, leading to potential power loss.

    Results

    We propose a new 2D Group Benjamini–Hochberg (2dGBH) procedure to harness the two-way grouping structure in omics data, extending the traditional one-way adaptive GBH procedure. Using both simulated and real datasets, we show that 2dGBH effectively controls the false discovery rate across biologically relevant settings, and it is more powerful than the BH or q-value procedure and more robust than the one-way adaptive GBH procedure.

    Availability and implementation

    2dGBH is available as an R package at: https://github.com/chloelulu/tdGBH. The analysis code and data are available at: https://github.com/chloelulu/tdGBH-paper.

     
    more » « less
  4. Abstract Background

    Single-cell RNA-sequencing (scRNA-seq) has become a widely used tool for both basic and translational biomedical research. In scRNA-seq data analysis, cell type annotation is an essential but challenging step. In the past few years, several annotation tools have been developed. These methods require either labeled training/reference datasets, which are not always available, or a list of predefined cell subset markers, which are subject to biases. Thus, a user-friendly and precise annotation tool is still critically needed.

    Results

    We curated a comprehensive cell marker database named scMayoMapDatabase and developed a companion R package scMayoMap, an easy-to-use single-cell annotation tool, to provide fast and accurate cell type annotation. The effectiveness of scMayoMap was demonstrated in 48 independent scRNA-seq datasets across different platforms and tissues. Additionally, the scMayoMapDatabase can be integrated with other tools and further improve their performance.

    Conclusions

    scMayoMap and scMayoMapDatabase will help investigators to define the cell types in their scRNA-seq data in a streamlined and user-friendly way.

     
    more » « less
  5. Abstract

    Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.

     
    more » « less
  6. Abstract Background Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one’s own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. Results We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. Conclusions Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery. 
    more » « less
  7. Abstract

    We are concerned with the incompressible limit of global-in-time strong solutions with arbitrary large initial velocity for the three-dimensional compressible viscoelastic equations. The incompressibility is achieved by the large value of the volume viscosity, which is different from the low Mach number limit. To obtain the uniform estimates, we establish the estimates for the potential part and the divergence-free part of the velocity. As the volume viscosity goes to infinity, the dispersion associated with the pressure waves tends to disappear, but the large volume viscosity provides a strong dissipation on the potential part of the velocity forcing the flow to be almost incompressible.

     
    more » « less
  8. Malik, Harmit S. (Ed.)
    A growing body of theoretical and experimental evidence suggests that intramolecular epistasis is a major determinant of rates and patterns of protein evolution and imposes a substantial constraint on the evolution of novel protein functions. Here, we examine the role of intramolecular epistasis in the recurrent evolution of resistance to cardiotonic steroids (CTS) across tetrapods, which occurs via specific amino acid substitutions to the α-subunit family of Na,K-ATPases (ATP1A). After identifying a series of recurrent substitutions at two key sites of ATP1A that are predicted to confer CTS resistance in diverse tetrapods, we then performed protein engineering experiments to test the functional consequences of introducing these substitutions onto divergent species backgrounds. In line with previous results, we find that substitutions at these sites can have substantial background-dependent effects on CTS resistance. Globally, however, these substitutions also have pleiotropic effects that are consistent with additive rather than background-dependent effects. Moreover, the magnitude of a substitution’s effect on activity does not depend on the overall extent of ATP1A sequence divergence between species. Our results suggest that epistatic constraints on the evolution of CTS-resistant forms of Na,K-ATPase likely depend on a small number of sites, with little dependence on overall levels of protein divergence. We propose that dependence on a limited number sites may account for the observation of convergent CTS resistance substitutions observed among taxa with highly divergent Na,K-ATPases (See S1 Text for Spanish translation). 
    more » « less
  9. Pears ( Pyrus sp.) are widely cultivated in China, and their yield accounts for more than 60% of global pear production. The fungal pathogen Valsa pyri is a major causal agent of pear canker disease, which results in enormous losses of pear production in northern China. In this study, we characterized a Zn 2 Cys 6 transcription factor that contains one GAL4 domain and a fungal-trans domain, which are present in VpxlnR. The vpxlnR gene expression was upregulated in the invasion stage of V. pyri . To investigate its functions, we constructed gene deletion mutants and complementary strains. We observed that the growth of the vpxlnR mutants was reduced on potato dextrose agar (PDA), Czapek plus glucose or sucrose compared with that of the wild-type strain. Additionally, vpxlnR mutants exhibited loss of function in fruiting body formation. Moreover, vpxlnR mutants were more susceptible to hydrogen peroxide (H 2 O 2 ) and salicylic acid (SA) and were reduced in their virulence at the early infection stage. According to a previous study, VpxlnR-interacting motifs containing NRHKGNCCGM were searched in the V. pyri genome, and we obtained 354 target genes, of which 148 genes had Clusters of Orthologous Groups (COG) terms. PHI-BLAST was used to identify virulence-related genes, and we found 28 hits. Furthermore, eight genes from the 28 PHI-BLAST hits were further assessed by yeast one-hybrid (Y1H) assays, and five target genes, salicylate hydroxylase (VP1G_09520), serine/threonine-protein kinase (VP1G_03128), alpha-xylosidase (VP1G_06369), G-protein beta subunit (VP1G_02856), and acid phosphatase (VP1G_03782), could interact with VpxlnR in vivo . Their transcript levels were reduced in one or two vpxlnR mutants. Taken together, these findings imply that VpxlnR is a key regulator of growth, development, stress, and virulence through controlling genes involved in signaling pathways and extracellular enzyme activities in V. pyri . The motifs interacting with VpxlnR also provide new insights into the molecular mechanism of xlnR proteins. 
    more » « less
  10. null (Ed.)