skip to main content


Title: Inferring HIV transmission patterns from viral deep-sequence data via latent typed point processes
ABSTRACT

Viral deep-sequencing data play a crucial role toward understanding disease transmission network flows, providing higher resolution compared to standard Sanger sequencing. To more fully utilize these rich data and account for the uncertainties in outcomes from phylogenetic analyses, we propose a spatial Poisson process model to uncover human immunodeficiency virus (HIV) transmission flow patterns at the population level. We represent pairings of individuals with viral sequence data as typed points, with coordinates representing covariates such as gender and age and point types representing the unobserved transmission statuses (linkage and direction). Points are associated with observed scores on the strength of evidence for each transmission status that are obtained through standard deep-sequence phylogenetic analysis. Our method is able to jointly infer the latent transmission statuses for all pairings and the transmission flow surface on the source-recipient covariate space. In contrast to existing methods, our framework does not require preclassification of the transmission statuses of data points, and instead learns them probabilistically through a fully Bayesian inference scheme. By directly modeling continuous spatial processes with smooth densities, our method enjoys significant computational advantages compared to previous methods that rely on discretization of the covariate space. We demonstrate that our framework can capture age structures in HIV transmission at high resolution, bringing valuable insights in a case study on viral deep-sequencing data from Southern Uganda.

 
more » « less
NSF-PAR ID:
10491330
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
80
Issue:
1
ISSN:
0006-341X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies. 
    more » « less
  2. The National Space Weather Action Plan, NERC Reliability Standard TPL-007-1,2 associated FERC Orders 779, 830 and subsequent actions, emerging standard TPL-007-3, as well as Executive Order 13744 have prepared the regulatory framework and roadmap for assessing and mitigating the impact on critical infrastructure from space weather. These actions have resulted in an emerging set of benchmarks against which the statistical probability of damage to critical components such as power transmission system high-voltage transformers can be assessed; for the first time the impacts on the intensity of geomagnetically induced currents (GICs) due to the spatial variability of the geomagnetic field and of the Earth’s electrical conductivity structure can be examined systematically. While at present not a strict requirement of the existing reliability standards, there is growing evidence that the strongly three-dimensional nature of the electrical conductivity structure of the North American crust and mantle (heretofore ‘ground conductivity’) has a first-order impact on GIC intensity, with considerable local and regional variability. The strongly location dependent ground electric field intensification and attenuation due to 3-D ground conductivity variations has an equivalent impact on assessment of risk to critical infrastructure due to HEMP (E3 phase) sources of geomagnetic disturbances (GMDs) as it does for natural GMDs. From 2006-2018, Oregon State University (OSU) under NSF EarthScope Program support, installed and acquired ground electric and magnetic field time series (magnetotelluric, or MT) data on a grid of station locations spaced ~70-km apart, at 1161 long-period MT stations covering nearly ⅔ of CONUS. The US Geological Survey completed 47 additional MT stations using functionally identical instrumentation, and the two data sets were merged and made available in the public domain. OSU and its project collaborators have also collected hundreds of wider frequency bandwidth, more densely-spaced MT station data under other project support, and these have been or will be released for public access in the near future. NSF funding was not available to make possible collection of EarthScope MT data 1/3 of CONUS in the southern tier of states, in a band from central California in the west to Alabama in the east and extending along the Gulf Coast and Deep South. OSU, with NASA support just received, plans to complete MT station installation in the remainder of California this year, and with additional support both anticipated and proposed, we hope to complete the MT array in the remainder of CONUS. For this first time this will provide national-scale 3-D electrical conductivity/MT impedance data throughout the US portion of the contiguous North American power grid. Complementary planning and proposal efforts are underway in Canada, including collaborations between OSU, Athabasca University and other Canadian academic and industry groups. In the present work, we apply algorithms we have developed to make use of real-time streams of US Geological Survey, Natural Resources Canada (and other) magnetic observatory data, and the EarthScope and other MT data sets to provide quasi-real time predictions of the geomagnetically induced voltages at high-voltage transmission system transformers/power buses. This goes beyond the statistical benchmarking process currently encapsulated in NERC reliability standards. We seek initially to provide real-time information to power utility control room operators, in the form of a heat map showing which assets are likely experiencing stress due to induced currents. These assessments will be ground-truthed against transmission system sensor data (PMUs, GIC monitors, voltage waveforms and harmonics where available), and by applying machine learning methods we hope to extend this approach to transmission systems that have sparse or non-existent GIC monitoring sensor infrastructure. Ultimately by incorporating predictive models of the geomagnetic field using satellite data as inputs rather than real-time ground magnetic field measurements, a near-term probabilistic assessment of risk to transformers may be possible, ideally providing at least a 15-minute forecast to utility operators. There has been a concerted effort by NOAA to develop a real-time geomagnetically induced ground electric field data product that makes use of our EarthScope MT data, which includes the strong impacts on GICs due to 3-D ground conductivity structure. Both OSU and the USGS have developed methods to determine the GIC-related voltages at substations by integrating the ground electric fields along power transmission line paths. Under National Science Foundation support, the present team of investigators is taking the next step, of applying the GIC-related voltages as inputs to quasi-real time power flow models of the power transmission grid in order to obtain realistic and verifiable predictions of the intensity of induced GICs, the reactive power loss due to GICs, and of GIC effects on the current and voltage waveforms, such as the harmonic distortion. As we work toward integration of predicted induced substation voltages with power flow models, we’ve modified the RTS-GMLC (Reliability Test System Grid Modernization Lab Consortium) test case (https://github.com/GridMod/RTS-GMLC) by moving the geographic location of the case to central Oregon. With the assistance of LANL we have the complete AC and DC network of the RTS-GMLC case, and we are working to integrate the complete case information into Julia (using the PowerModels and PowerModelsGMD packages of LANL), or into PowerWorld. Along a parallel track, we have performed GIC voltage calculations using our geophysical algorithm for a realistic GMD event (Halloween event) for the test case, resulting in GIC transmission line voltages that can be added into our power system model. We’ll discuss our progress in integrating the geophysical estimates of transformer voltages and our DC model using LANL's Julia and PowerModelsGMD package, for power flow simulations on the test case, and to determine the GIC flows and possible impacts on the power waveforms in the system elements. 
    more » « less
  3. The use of viral sequence data to inform public health intervention has become increasingly common in the realm of epidemiology. Such methods typically utilize multiple sequence alignments and phylogenies estimated from the sequence data. Like all estimation techniques, they are error prone, yet the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly used viral phylogenetic analysis workflows on simulated viral sequence data, modeling Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV), and Ebolavirus, and we computed multiple methods of accuracy, motivated by transmission-clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega, in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was the fastest, by orders of magnitude, and when the other tools were used to optimize branch lengths along a fixed FastTree 2 topology, the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime. 
    more » « less
  4. Phylogenetic placement, used widely in ecological analyses, seeks to add a new species to an existing tree. A deep learning approach was previously proposed to estimate the distance between query and backbone species by building a map from gene sequences to a high-dimensional space that preserves species tree distances. They then use a distance-based placement method to place the queries on that species tree. In this paper, we examine the appropriate geometry for faithfully representing tree distances while embedding gene sequences. Theory predicts that hyperbolic spaces should provide a drastic reduction in distance distortion compared to the conventional Euclidean space. Nevertheless, hyperbolic embedding imposes its own unique challenges related to arithmetic operations, exponentially-growing functions, and limited bit precision, and we address these challenges. Our results confirm that hyperbolic embeddings have substantially lower distance errors than Euclidean space. However, these better-estimated distances do not always lead to better phylogenetic placement. We then show that the deep learning framework can be used not just to place on a backbone tree but to update it to obtain a fully resolved tree. With our hyperbolic embedding framework, species trees can be updated remarkably accurately with only a handful of genes. 
    more » « less
  5. Abstract

    Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.

     
    more » « less