skip to main content

Title: An FPGA Accelerator for Genome Variant Calling
In genome analysis, it is often important to identify variants from a reference genome. However, identifying variants that occur with low frequency can be challenging, as it is computationally intensive to do so accurately. LoFreq is a widely used program that is adept at identifying low frequency variants. This paper presents an FPGA-based accelerator for LoFreq. In particular, this accelerator is targeted at virus analysis, which is particularly challenging, compared to human genome analysis, as the characteristics of the data to be analyzed are fundamentally different. This accelerator can achieve up to 120× speedups on the core computation of LoFreq and speedups of up to 32.4× across the entire program.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Page Range / eLocation ID:
1 to 9
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Radio-frequency quadrupoles (RFQs) are multi-purpose linear particle accelerators that simultaneously bunch and accelerate charged particle beams. They are ubiquitous in accelerator physics, especially as injectors to higher-energy machines, owing to their impressive efficiency. The design and optimization of these devices can be lengthy due to the need to repeatedly perform high-fidelity simulations. Several recent papers have demonstrated that machine learning can be used to build surrogate models (fast-executing replacements of computationally costly beam simulations) for order-of-magnitude computing time speedups. However, while these pilot studies are encouraging, there is room to improve their predictive accuracy. Particularly, beam summary statistics such as emittances (an important figure of merit in particle accelerator physics) have historically been challenging to predict. For the first time, we present a surrogate model trained on 200 000 samples that yields<6% mean average percent error for the predictions of all relevant beam output parameters from defining RFQ design parameters, solving the problem of poor emittance predictions by identifying and including hidden variables which were not accounted for previously. These surrogate models were made possible by using the Julia language and GPU computing; we briefly discuss both. We demonstrate the utility of surrogate modeling by performing a multi-objective optimization using our best model as a callback in the objective function to select an optimal RFQ design. We consider trade-offs in RFQ performance for various choices of Pareto-optimal design variables—common issues for any multi-objective optimization scheme. Lastly, we make recommendations for input data preparation, selection, and neural network architectures that pave the way for future development of production-capable surrogate models for RFQs and other particle accelerators.

    more » « less
  2. Background Transposable element (TE) polymorphisms are important components of population genetic variation. The functional impacts of TEs in gene regulation and generating genetic diversity have been observed in multiple species, but the frequency and magnitude of TE variation is under appreciated. Inexpensive and deep sequencing technology has made it affordable to apply population genetic methods to whole genomes with methods that identify single nucleotide and insertion/deletion polymorphisms. However, identifying TE polymorphisms, particularly transposition events or non-reference insertion sites can be challenging due to the repetitive nature of these sequences, which hamper both the sensitivity and specificity of analysis tools. Methods We have developed the tool RelocaTE2 for identification of TE insertion sites at high sensitivity and specificity. RelocaTE2 searches for known TE sequences in whole genome sequencing reads from second generation sequencing platforms such as Illumina. These sequence reads are used as seeds to pinpoint chromosome locations where TEs have transposed. RelocaTE2 detects target site duplication (TSD) of TE insertions allowing it to report TE polymorphism loci with single base pair precision. Results and Discussion The performance of RelocaTE2 is evaluated using both simulated and real sequence data. RelocaTE2 demonstrate high level of sensitivity and specificity, particularly when the sequence coverage is not shallow. In comparison to other tools tested, RelocaTE2 achieves the best balance between sensitivity and specificity. In particular, RelocaTE2 performs best in prediction of TSDs for TE insertions. Even in highly repetitive regions, such as those tested on rice chromosome 4, RelocaTE2 is able to report up to 95% of simulated TE insertions with less than 0.1% false positive rate using 10-fold genome coverage resequencing data. RelocaTE2 provides a robust solution to identify TE insertion sites and can be incorporated into analysis workflows in support of describing the complete genotype from light coverage genome sequencing. 
    more » « less
  3. We present an algorithmic framework for quantum-inspired classical algorithms on close-to-low-rank matrices, generalizing the series of results started by Tang’s breakthrough quantum-inspired algorithm for recommendation systems [STOC’19]. Motivated by quantum linear algebra algorithms and the quantum singular value transformation (SVT) framework of Gilyén et al. [STOC’19], we develop classical algorithms for SVT that run in time independent of input dimension, under suitable quantum-inspired sampling assumptions. Our results give compelling evidence that in the corresponding QRAM data structure input model, quantum SVT does not yield exponential quantum speedups. Since the quantum SVT framework generalizes essentially all known techniques for quantum linear algebra, our results, combined with sampling lemmas from previous work, suffice to generalize all prior results about dequantizing quantum machine learning algorithms. In particular, our classical SVT framework recovers and often improves the dequantization results on recommendation systems, principal component analysis, supervised clustering, support vector machines, low-rank regression, and semidefinite program solving. We also give additional dequantization results on low-rank Hamiltonian simulation and discriminant analysis. Our improvements come from identifying the key feature of the quantum-inspired input model that is at the core of all prior quantum-inspired results: ℓ2-norm sampling can approximate matrix products in time independent of their dimension. We reduce all our main results to this fact, making our exposition concise, self-contained, and intuitive.

    more » « less
  4. INTRODUCTION Thousands of genetic variants have been associated with human diseases and traits through genome-wide association studies (GWASs). Translating these discoveries into improved therapeutics requires discerning which variants among hundreds of candidates are causally related to disease risk. To date, only a handful of causal variants have been confirmed. Here, we leverage 100 million years of mammalian evolution to address this major challenge. RATIONALE We compared genomes from hundreds of mammals and identified bases with unusually few variants (evolutionarily constrained). Constraint is a measure of functional importance that is agnostic to cell type or developmental stage. It can be applied to investigate any heritable disease or trait and is complementary to resources using cell type– and time point–specific functional assays like Encyclopedia of DNA Elements (ENCODE) and Genotype-Tissue Expression (GTEx). RESULTS Using constraint calculated across placental mammals, 3.3% of bases in the human genome are significantly constrained, including 57.6% of coding bases. Most constrained bases (80.7%) are noncoding. Common variants (allele frequency ≥ 5%) and low-frequency variants (0.5% ≤ allele frequency < 5%) are depleted for constrained bases (1.85 versus 3.26% expected by chance, P < 2.2 × 10 −308 ). Pathogenic ClinVar variants are more constrained than benign variants ( P < 2.2 × 10 −16 ). The most constrained common variants are more enriched for disease single-nucleotide polymorphism (SNP)–heritability in 63 independent GWASs. The enrichment of SNP-heritability in constrained regions is greater (7.8-fold) than previously reported in mammals and is even higher in primates (11.1-fold). It exceeds the enrichment of SNP-heritability in nonsynonymous coding variants (7.2-fold) and fine-mapped expression quantitative trait loci (eQTL)–SNPs (4.8-fold). The enrichment peaks near constrained bases, with a log-linear decrease of SNP-heritability enrichment as a function of the distance to a constrained base. Zoonomia constraint scores improve functionally informed fine-mapping. Variants at sites constrained in mammals and primates have greater posterior inclusion probabilities and higher per-SNP contributions. In addition, using both constraint and functional annotations improves polygenic risk score accuracy across a range of traits. Finally, incorporating constraint information into the analysis of noncoding somatic variants in medulloblastomas identifies new candidate driver genes. CONCLUSION Genome-wide measures of evolutionary constraint can help discern which variants are functionally important. This information may accelerate the translation of genomic discoveries into the biological, clinical, and therapeutic knowledge that is required to understand and treat human disease. Using evolutionary constraint in genomic studies of human diseases. ( A ) Constraint was calculated across 240 mammal species, including 43 primates (teal line). ( B ) Pathogenic ClinVar variants ( N = 73,885) are more constrained across mammals than benign variants ( N = 231,642; P < 2.2 × 10 −16 ). ( C ) More-constrained bases are more enriched for trait-associated variants (63 GWASs). ( D ) Enrichment of heritability is higher in constrained regions than in functional annotations (left), even in a joint model with 106 annotations (right). ( E ) Fine-mapping (PolyFun) using a model that includes constraint scores identifies an experimentally validated association at rs1421085. Error bars represent 95% confidence intervals. BMI, body mass index; LF, low frequency; PIP, posterior inclusion probability. 
    more » « less
  5. Abstract

    Viruses persist in nature owing to their extreme genetic heterogeneity and large –population sizes, which enable them to evade host immune defenses, escape anti-viral drugs, and adapt to new hosts. The persistence of viruses is challenging to study because mutations affect multiple virus genes, interactions among genes in their impacts on virus growth are seldom known, and measures of viral fitness have yet to be standardized. To address these challenges, we employed a data-driven computational model of cell infection by a virus. The infection model accounted for the kinetics of viral gene expression, functional gene-gene interactions, genome replication, and allocation of host cellular resources to produce progeny of vesicular stomatitis virus (VSV), a prototype RNA virus. We used this model to computationally probe how interactions among genes carrying up to 11 deleterious mutations affect different measures of virus fitness: single-cycle growth yields and multi-cycle rates of infection spread. Individual mutations were implemented by perturbing biophysical parameters associated with individual gene functions of the wild-type model. Our analysis revealed synergistic epistasis among deleterious mutations in their effects on virus yield; so adverse effects of single deleterious mutations were amplified by interaction. For the same mutations, multi-cycle infection spread indicated weak or negligible epistasis, where single mutations act alone in their effects on infection spread. These results were robust to simulation under high and low host resource environments. Our work highlights how different types and magnitudes of epistasis can arise for genetically identical virus variants, depending on the fitness measure. More broadly, gene-gene interactions can differently affect how viruses grow and spread.

    more » « less