skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Ji, Xiang"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces.

    Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

     
    more » « less
    Free, publicly-accessible full text available May 7, 2025
  2. Abstract

    Following a duplication, the resulting paralogs tend to diverge. While mutation and natural selection can accelerate this process, they can also slow it. Here, we quantify the paralog homogenization that is caused by point mutations and interlocus gene conversion (IGC). Among 164 duplicated teleost genes, the median percentage of postduplication codon substitutions that arise from IGC rather than point mutation is estimated to be between 7% and 8%. By differentiating between the nonsynonymous codon substitutions that homogenize the protein sequences of paralogs and the nonhomogenizing nonsynonymous substitutions, we estimate the homogenizing nonsynonymous rates to be higher for 163 of the 164 teleost data sets as well as for all 14 data sets of duplicated yeast ribosomal protein-coding genes that we consider. For all 14 yeast data sets, the estimated homogenizing nonsynonymous rates exceed the synonymous rates.

     
    more » « less
  3. Chekouo, Thierry (Ed.)

    Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck—integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to studyAquilegiaflower and pollinator co-evolution.

     
    more » « less
  4. Yang, Ziheng (Ed.)
    Abstract

    Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.

     
    more » « less
  5. Memristors are promising candidates for constructing neural networks. However, their dissimilar working mechanism to that of the addressing transistors can result in a scaling mismatch, which may hinder efficient integration. Here, we demonstrate two-terminal MoS2 memristors that work with a charge-based mechanism similar to that in transistors, which enables the homogeneous integration with MoS2 transistors to realize one-transistor-one-memristor addressable cells for assembling programmable network. The homogenously integrated cells are implemented in a 2×2 network array to demonstrate the enabled addressability and programmability. The potential for assembling scalable network is evaluated in a simulated neural network using obtained realistic device parameters, which achieves over 91% pattern recognition accuracy. This study also reveals a generic mechanism and strategy that can be applied to other semiconducting devices for the engineering and homogeneous integration of memristive systems. 
    more » « less
  6. Abstract The fluorescent glutamate indicator iGluSnFR enables imaging of neurotransmission with genetic and molecular specificity. However, existing iGluSnFR variants exhibit low in vivo signal-to-noise ratios, saturating activation kinetics and exclusion from postsynaptic densities. Using a multiassay screen in bacteria, soluble protein and cultured neurons, we generated variants with improved signal-to-noise ratios and kinetics. We developed surface display constructs that improve iGluSnFR’s nanoscopic localization to postsynapses. The resulting indicator iGluSnFR3 exhibits rapid nonsaturating activation kinetics and reports synaptic glutamate release with decreased saturation and increased specificity versus extrasynaptic signals in cultured neurons. Simultaneous imaging and electrophysiology at individual boutons in mouse visual cortex showed that iGluSnFR3 transients report single action potentials with high specificity. In vibrissal sensory cortex layer 4, we used iGluSnFR3 to characterize distinct patterns of touch-evoked feedforward input from thalamocortical boutons and both feedforward and recurrent input onto L4 cortical neuron dendritic spines. 
    more » « less
  7. Abstract Summary

    Mutations sometimes increase contagiousness for evolving pathogens. During an epidemic, scientists use viral genome data to infer a shared evolutionary history and connect this history to geographic spread. We propose a model that directly relates a pathogen’s evolution to its spatial contagion dynamics—effectively combining the two epidemiological paradigms of phylogenetic inference and self-exciting process modeling—and apply this phylogenetic Hawkes process to a Bayesian analysis of 23 421 viral cases from the 2014 to 2016 Ebola outbreak in West Africa. The proposed model is able to detect individual viruses with significantly elevated rates of spatiotemporal propagation for a subset of 1610 samples that provide genome data. Finally, to facilitate model application in big data settings, we develop massively parallel implementations for the gradient and Hessian of the log-likelihood and apply our high-performance computing framework within an adaptively pre-conditioned Hamiltonian Monte Carlo routine.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  8. Ndeffo Mbah, Martial L (Ed.)
    The SARS-CoV-2 pandemic led to closure of nearly all K-12 schools in the United States of America in March 2020. Although reopening K-12 schools for in-person schooling is desirable for many reasons, officials understand that risk reduction strategies and detection of cases are imperative in creating a safe return to school. Furthermore, consequences of reclosing recently opened schools are substantial and impact teachers, parents, and ultimately educational experiences in children. To address competing interests in meeting educational needs with public safety, we compare the impact of physical separation through school cohorts on SARS-CoV-2 infections against policies acting at the level of individual contacts within classrooms. Using an age-stratified Susceptible-Exposed-Infected-Removed model, we explore influences of reduced class density, transmission mitigation, and viral detection on cumulative prevalence. We consider several scenarios over a 6-month period including (1) multiple rotating cohorts in which students cycle through in-person instruction on a weekly basis, (2) parallel cohorts with in-person and remote learning tracks, (3) the impact of a hypothetical testing program with ideal and imperfect detection, and (4) varying levels of aggregate transmission reduction. Our mathematical model predicts that reducing the number of contacts through cohorts produces a larger effect than diminishing transmission rates per contact. Specifically, the latter approach requires dramatic reduction in transmission rates in order to achieve a comparable effect in minimizing infections over time. Further, our model indicates that surveillance programs using less sensitive tests may be adequate in monitoring infections within a school community by both keeping infections low and allowing for a longer period of instruction. Lastly, we underscore the importance of factoring infection prevalence in deciding when a local outbreak of infection is serious enough to require reverting to remote learning. 
    more » « less