Title: Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes

There are many short-read variant-calling tools, with different strengths and weaknesses. We present a tool, Minos, which combines outputs from arbitrary variant callers, increasing recall without loss of precision. We benchmark on 62 samples from three bacterial species and an outbreak of 385Mycobacterium tuberculosissamples. Minos also enables joint genotyping; we demonstrate on a large (N=13k)M. tuberculosiscohort, building a map of non-synonymous SNPs and indels in a region where all such variants are assumed to cause rifampicin resistance. We quantify the correlation with phenotypic resistance and then replicate in a second cohort (N=10k).

Publication Date:
Journal Name:
Genome Biology
Springer Science + Business Media
Sponsoring Org:
National Science Foundation
  1. Abstract Background

    Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.


    Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction.


    Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcriptsmore »had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance.


    We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

  2. Abstract

    Tuberculosis (TB) remains a significant cause of mortality worldwide. Metagenomic next-generation sequencing has the potential to reveal biomarkers of active disease, identify coinfection, and improve detection for sputum-scarce or culture-negative cases. We conducted a large-scale comparative study of 428 plasma, urine, and oral swab samples from 334 individuals from TB endemic and non-endemic regions to evaluate the utility of a shotgun metagenomic DNA sequencing assay for tuberculosis diagnosis. We found that the composition of the control population had a strong impact on the measured performance of the diagnostic test: the use of a control population composed of individuals from a TB non-endemic region led to a test with nearly 100% specificity and sensitivity, whereas a control group composed of individuals from TB endemic regions exhibited a high background of nontuberculous mycobacterial DNA, limiting the diagnostic performance of the test.Using mathematical modeling and quantitative comparisons to matched qPCR data, we found that the burden ofMycobacterium tuberculosisDNA constitutes a very small fraction (0.04 or less) of the total abundance of DNA originating from mycobacteria in samples from TB endemic regions. Our findings suggest that the utility of a minimally invasive metagenomic sequencing assay for pulmonary tuberculosis diagnostics is limited by themore »low burden ofM. tuberculosisand an overwhelming biological background of nontuberculous mycobacterial DNA.

  3. Abstract Background

    Blood-based biomarkers for diagnosing active tuberculosis (TB), monitoring treatment response, and predicting risk of progression to TB disease have been reported. However, validation of the biomarkers across multiple independent cohorts is scarce. A robust platform to validate TB biomarkers in different populations with clinical end points is essential to the development of a point-of-care clinical test. NanoString nCounter technology is an amplification-free digital detection platform that directly measures mRNA transcripts with high specificity. Here, we determined whether NanoString could serve as a platform for extensive validation of candidate TB biomarkers.


    The NanoString platform was used for performance evaluation of existing TB gene signatures in a cohort in which signatures were previously evaluated on an RNA-seq dataset. A NanoString codeset that probes 107 genes comprising 12 TB signatures and 6 housekeeping genes (NS-TB107) was developed and applied to total RNA derived from whole blood samples of TB patients and individuals with latent TB infection (LTBI) from South India. The TBSignatureProfiler tool was used to score samples for each signature. An ensemble of machine learning algorithms was used to derive a parsimonious biomarker.


    Gene signatures present in NS-TB107 had statistically significant discriminative power for segregating TB from LTBI. Further analysis ofmore »the data yielded a NanoString 6-gene set (NANO6) that when tested on 10 published datasets was highly diagnostic for active TB.


    The NanoString nCounter system provides a robust platform for validating existing TB biomarkers and deriving a parsimonious gene signature with enhanced diagnostic performance.

  4. Abstract Background

    Over the last two decades, the scale-up of vector control and changes in the first-line anti-malarial, from chloroquine (CQ) to sulfadoxine-pyrimethamine (SP) and then to artemether-lumefantrine (AL), have resulted in significant decreases in malaria burden in western Kenya. This study evaluated the long-term effects of control interventions on molecular markers ofPlasmodium falciparumdrug resistance using parasites obtained from humans and mosquitoes at discrete time points.


    Dried blood spot samples collected in 2012 and 2017 community surveys in Asembo, Kenya were genotyped by Sanger sequencing for markers associated with resistance to SP (Pfdhfr, Pfdhps), CQ, AQ, lumefantrine (Pfcrt, Pfmdr1)and artemisinin (Pfk13).Temporal trends in the prevalence of these markers, including data from 2012 to 2017 as well as published data from 1996, 2001, 2007 from same area, were analysed. The same markers from mosquito oocysts collected in 2012 were compared with results from human blood samples.


    The prevalence of SPdhfr/dhpsquintuple mutant haplotype C50I51R59N108I164/S436G437E540A581A613increased from 19.7% in 1996 to 86.0% in 2012, while an increase in the sextuple mutant haplotype C50I51R59N108I164/H436G437E540A581A613containingPfdhps-436H was found from 10.5% in 2012 to 34.6% in 2017. ResistantPfcrt-76 T declined from 94.6% in 2007 to 18.3% in 2012 and 0.9% in 2017. MutantPfmdr1-86Y decreased across years from 74.8% in 1996 tomore »zero in 2017, mutantPfmdr1-184F and wildPfmdr1-D1246 increased from 17.9% to 58.9% in 2007 to 55.9% and 90.1% in 2017, respectively.Pfmdr1haplotype N86F184S1034N1042D1246 increased from 11.0% in 2007 to 49.6% in 2017. No resistant mutations inPfk13were found. Prevalence ofPfdhps-436H was lower while prevalence ofPfcrt-76 T was higher in mosquitoes than in human blood samples.


    This study showed an increased prevalence ofdhfr/dhpsresistant markers over 20 years with the emergenceof Pfdhps-436H mutant a decade ago in Asembo. The reversal ofPfcrtfrom CQ-resistant to CQ-sensitive genotype occurred following 19 years of CQ withdrawal. NoPfk13markers associated with artemisinin resistance were detected, but the increased haplotype ofPfmdr1N86F184S1034N1042D1246was observed. The differences in prevalence ofPfdhps-436H andPfcrt-76 T SNPs between two hosts and the role of mosquitoes in the transmission of drug resistant parasites require further investigation.

  5. Abstract

    Recent genome-wide studies have begun to identify gene variants, expression profiles, and regulators associated with neuroticism, anxiety disorders, and depression. We conducted a set of experimental cell culture studies of gene regulation by micro RNAs (miRNAs), based on genome-wide transcriptome, proteome, and miRNA expression data from twentypostmortemsamples of lateral amygdala from donors with known neuroticism scores. Using Ingenuity Pathway Analysis and TargetScan, we identified a list of mRNA–protein–miRNA sets whose expression patterns were consistent with miRNA-based translational repression, as a function of trait anxiety. Here, we focused on one gene from that list, which is of particular translational significance in Psychiatry: synaptic vesicle glycoprotein 2A (SV2A) is the binding site of the anticonvulsant drug levetiracetam ((S)-α-Ethyl-2-oxo-1-pyrrolidineacetamide), which has shown promise in anxiety disorder treatments. We confirmed thatSV2Ais associated with neuroticism or anxiety using an original GWAS of a community cohort (N = 1,706), and cross-referencing a published GWAS of multiple cohorts (Ns ranging from 340,569 to 390,278).Postmortemamygdala expression profiling implicated three putative regulatory miRNAs to targetSV2A: miR-133a, miR-138, and miR-218. Moving from association to experimental causal testing in cell culture, we used a luciferase assay to demonstrate that miR-133a and miR-218, but not miR-138, significantly decreased relative luciferase activity from theSV2Adual-luciferasemore »construct. In human neuroblastoma cells, transfection with miR-133a and miR-218 reduced both endogenousSV2AmRNA and protein levels, confirming miRNA targeting of theSV2Agene. This study illustrates the utility of combiningpostmortemgene expression data with GWAS to guide experimental cell culture assays examining gene regulatory mechanisms that may contribute to complex human traits. Identifying specific molecular mechanisms of gene regulation may be useful for future clinical applications in anxiety disorders or other forms of psychopathology.

