skip to main content

This content will become publicly available on May 2, 2023

Title: Assessing in vivo mutation frequencies and creating a high-resolution genome-wide map of fitness costs of Hepatitis C virus
Like many viruses, Hepatitis C Virus (HCV) has a high mutation rate, which helps the virus adapt quickly, but mutations come with fitness costs. Fitness costs can be studied by different approaches, such as experimental or frequency-based approaches. The frequency-based approach is particularly useful to estimate in vivo fitness costs, but this approach works best with deep sequencing data from many hosts are. In this study, we applied the frequency-based approach to a large dataset of 195 patients and estimated the fitness costs of mutations at 7957 sites along the HCV genome. We used beta regression and random forest models to better understand how different factors influenced fitness costs. Our results revealed that costs of nonsynonymous mutations were three times higher than those of synonymous mutations, and mutations at nucleotides A or T had higher costs than those at C or G. Genome location had a modest effect, with lower costs for mutations in HVR1 and higher costs for mutations in Core and NS5B. Resistance mutations were, on average, costlier than other mutations. Our results show that in vivo fitness costs of mutations can be site and virus specific, reinforcing the utility of constructing in vivo fitness cost maps of more » viral genomes. « less
; ; ; ; ; ; ;
Wahl, Lindi
Award ID(s):
Publication Date:
Journal Name:
PLOS Genetics
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background Investigation of outbreaks to identify the primary case is crucial for the interruption and prevention of transmission of infectious diseases. These individuals may have a higher risk of participating in near future transmission events when compared to the other patients in the outbreak, so directing more transmission prevention resources towards these individuals is a priority. Although the genetic characterization of intra-host viral populations can aid the identification of transmission clusters, it is not trivial to determine the directionality of transmissions during outbreaks, owing to complexity of viral evolution. Here, we present a new computational framework, PYCIVO: primary case inference in viral outbreaks. This framework expands upon our earlier work in development of QUENTIN, which builds a probabilistic disease transmission tree based on simulation of evolution of intra-host hepatitis C virus (HCV) variants between cases involved in direct transmission during an outbreak. PYCIVO improves upon QUENTIN by also adding a custom heterogeneity index and identifying the scenario when the primary case may have not been sampled. Results These approaches were validated using a set of 105 sequence samples from 11 distinct HCV transmission clusters identified during outbreak investigations, in which the primary case was epidemiologically verified. Both models canmore »detect the correct primary case in 9 out of 11 transmission clusters (81.8%). However, while QUENTIN issues erroneous predictions on the remaining 2 transmission clusters, PYCIVO issues a null output for these clusters, giving it an effective prediction accuracy of 100%. To further evaluate accuracy of the inference, we created 10 modified transmission clusters in which the primary case had been removed. In this scenario, PYCIVO was able to correctly identify that there was no primary case in 8/10 (80%) of these modified clusters. This model was validated with HCV; however, this approach may be applicable to other microbial pathogens. Conclusions PYCIVO improves upon QUENTIN by also implementing a custom heterogeneity index which empowers PYCIVO to make the important ‘No primary case’ prediction. One or more samples, possibly including the primary case, may have not been sampled, and this designation is meant to account for these scenarios.« less
  2. Abstract The cost of testing can be a substantial contributor to hepatitis C virus (HCV) elimination program costs in many low- and middle-income countries such as Georgia, resulting in the need for innovative and cost-effective strategies for testing. Our objective was to investigate the most cost-effective testing pathways for scaling-up HCV testing in Georgia. We developed a Markov-based model with a lifetime horizon that simulates the natural history of HCV, and the cost of detection and treatment of HCV. We then created an interactive online tool that uses results from the Markov-based model to evaluate the cost-effectiveness of different HCV testing pathways. We compared the current standard-of-care (SoC) testing pathway and four innovative testing pathways for Georgia. The SoC testing was cost-saving compared to no testing, but all four new HCV testing pathways further increased QALYs and decreased costs. The pathway with the highest patient follow-up, due to on-site testing, resulted in the highest discounted QALYs (123 QALY more than the SoC) and lowest costs ($127,052 less than the SoC) per 10,000 persons screened. The current testing algorithm in Georgia can be replaced with a new pathway that is more effective while being cost-saving.
  3. Background

    Metamodels can address some of the limitations of complex simulation models by formulating a mathematical relationship between input parameters and simulation model outcomes. Our objective was to develop and compare the performance of a machine learning (ML)–based metamodel against a conventional metamodeling approach in replicating the findings of a complex simulation model.


    We constructed 3 ML-based metamodels using random forest, support vector regression, and artificial neural networks and a linear regression-based metamodel from a previously validated microsimulation model of the natural history hepatitis C virus (HCV) consisting of 40 input parameters. Outcomes of interest included societal costs and quality-adjusted life-years (QALYs), the incremental cost-effectiveness (ICER) of HCV treatment versus no treatment, cost-effectiveness analysis curve (CEAC), and expected value of perfect information (EVPI). We evaluated metamodel performance using root mean squared error (RMSE) and Pearson’s R2on the normalized data.


    The R2values for the linear regression metamodel for QALYs without treatment, QALYs with treatment, societal cost without treatment, societal cost with treatment, and ICER were 0.92, 0.98, 0.85, 0.92, and 0.60, respectively. The corresponding R2values for our ML-based metamodels were 0.96, 0.97, 0.90, 0.95, and 0.49 for support vector regression; 0.99, 0.83, 0.99, 0.99, and 0.82 for artificial neural network; and 0.99,more »0.99, 0.99, 0.99, and 0.98 for random forest. Similar trends were observed for RMSE. The CEAC and EVPI curves produced by the random forest metamodel matched the results of the simulation output more closely than the linear regression metamodel.


    ML-based metamodels generally outperformed traditional linear regression metamodels at replicating results from complex simulation models, with random forest metamodels performing best.


    Decision-analytic models are frequently used by policy makers and other stakeholders to assess the impact of new medical technologies and interventions. However, complex models can impose limitations on conducting probabilistic sensitivity analysis and value-of-information analysis, and may not be suitable for developing online decision-support tools. Metamodels, which accurately formulate a mathematical relationship between input parameters and model outcomes, can replicate complex simulation models and address the above limitation. The machine learning–based random forest model can outperform linear regression in replicating the findings of a complex simulation model. Such a metamodel can be used for conducting cost-effectiveness and value-of-information analyses or developing online decision support tools.

    « less
  4. Understanding within-host evolution is critical for predicting viral evolutionary outcomes, yet such studies are currently lacking due to difficulty involving human subjects. Hepatitis C virus (HCV) is an RNA virus with high mutation rates. Its complex evolutionary dynamics and extensive genetic diversity are demonstrated in over 67 known subtypes. In this study, we analyzed within-host mutation frequency patterns of three HCV subtypes, using a large number of samples obtained from treatment-naïve participants by next-generation sequencing. We report that overall mutation frequency patterns are similar among subtypes, yet subtype 3a consistently had lower mutation frequencies and nucleotide diversity, while subtype 1a had the highest. We found that about 50% of genomic sites are highly conserved across subtypes, which are likely under strong purifying selection. We also compared within-host and between-host selective pressures, which revealed that Hyper Variable Region 1 within hosts was under positive selection, but was under slightly negative selection between hosts, which indicates that many mutations created within hosts are removed during the transmission bottleneck. Examining the natural prevalence of known resistance-associated variants showed their consistent existence in the treatment-naïve participants. These results provide insights into the differences and similarities among HCV subtypes that may be used to developmore »and improve HCV therapies.« less
  5. Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein−peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence−energetics−function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function—site-specific cleavages of the viral polyprotein—is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide varietymore »of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.

    « less