<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>GenoMiX: Accelerated Simultaneous Analysis of Human Genomics, Microbiome Metagenomics, and Viral Sequences</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>10/19/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10534962</idno>
					<idno type="doi">10.1109/BioCAS58349.2023.10388531</idno>
					
					<author>Tianqi Zhang</author><author>Antonio González</author><author>Niema Moshiri</author><author>Rob Knight</author><author>Tajana Rosing</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Personalized medicine tailors treatments based on the individual characteristics of each patient. Recent work targets comprehensive analysis of patient samples that contain human, viral, and bacterial genomes. Novel technologies like Simul-seq enable simultaneous analysis of such samples. However, existing workflows are slow and disjoint. In this paper, we present FPGA accelerated genomics infrastructure, GenoMiX , that supports multiple real-world analysis pipelines used in personalized medicine, including phylogenetic assignment for viral pathogens, variant calling that is key for cancer genomics, and microbiome metagenome analysis. We integrate our workflow with Qiita, an open-source framework for managing and analyzing multi-omics datasets. GenoMiX is not only up to 30× faster than comparable CPU-based tools, but it also addresseNs the challenges associated with handling reference databases of varying sizes, encompassing viruses, human genomes, and microbiomes. Our optimized infrastructure was a key component enabling success of the award-winning UCSD's "Return to Learn" program during the COVID-19 public health emergency in San Diego County.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Personalized medicine tailors treatments to the individual characteristics of each patient. Protocols such as Simulseq <ref type="bibr">[1]</ref> have enabled the comprehensive collection and analysis of patient samples that contain not just human genomes but also bacterial and viral samples. This is critical to understanding how human health is affected not only by diseases such as cancer but also by native bacteria and outside pathogens that interact with our bodies and the medicines we take. Recent work in personalized medicine seeks to understand the interaction between these various aspects. However, state of the art tools such as <ref type="bibr">[2]</ref>- <ref type="bibr">[6]</ref> are too slow to address this need.</p><p>Many accelerators have been proposed to date for various components of the analysis pipeline, such as those based on processing in memory (PIM) <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, near-data computing <ref type="bibr">[9]</ref>, ASICs <ref type="bibr">[10]</ref>, and FPGAs <ref type="bibr">[11]</ref>- <ref type="bibr">[15]</ref>. Most of these are not deployed in the actual analysis pipelines since they a) only address a few components of the pipeline, b) their accuracy may not be as good as CPU based tools, and c) they are more difficult to use and less compatible with the existing tools. For example, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref> only accelerate filtering and pairwise alignment, respectively; <ref type="bibr">[9]</ref> accelerates seeding and pre-alignment filtering but neglects computationally bounded Levenshtein distance computing; <ref type="bibr">[11]</ref> runs from seeding to pairwise alignment but doesn't support backtrace to get the compact idiosyncratic gapped alignment report (CIGAR); <ref type="bibr">[12]</ref> can get Bowtie2 <ref type="bibr">[4]</ref>-like alignment accuracy, but cannot get the mapping quality (MAPQ) needed by downstream analysis.</p><p>In this work, we design a unified and fast tool, GenoMix, that accelerates viral sequence analysis, variant calling in human genomics, and bacterial metagenomic comparative analysis. GenoMiX uses Qiita <ref type="bibr">[16]</ref>, an open-sourced framework for managing and analyzing multi-omics datasets via a userfriendly webpage, as a front end. As shown in Figure <ref type="figure">1</ref>, the three analysis pipelines share a common backbone: short read alignment and sequence trimming, which we accelerate using FPGA. In order to provide end-to-end support to medical practitioners, we integrate downstream analysis tools into GenoMiX such as DeepVariant <ref type="bibr">[17]</ref> for variant calling, Pangolin <ref type="bibr">[18]</ref> for viral analysis, and Unifrac <ref type="bibr">[19]</ref> that is targeted at microbiome. The phylogenetic assignment generates the viral concise sequence and classifies it. Variant calling detects whether the genome's mismatches or indels are variants or sequencing machine-induced errors. In metagenome community ecology, the samples, comprising multiple bacterial species, use the unifrac distance as a metric, to discover and compare a variety of microbiome communities. Run as a plugin of Qiita <ref type="bibr">[16]</ref>, GenoMiX offers users the flexibility to run all three pipelines simultaneously on the same dataset or only specific pipelines they require on separate datasets. This ensures an adaptable tool suitable for a variety of use cases, from multi-omics analysis to single-dataset investigations. To summarize, our system is:</p><p>&#8226; The first multi-in-one tool that integrates and accelerates three different analysis pipelines by up to 30&#215; speedup at state of the art accuracy. Each U55c FPGA running GenoMiX provides the same computation power of up to 475 CPUs. &#8226; Designed to efficiently handle reference databases of various sizes involving viruses, human, and microbiome, with as many as billions of base pairs; &#8226; Integrated into state of the art Qiita <ref type="bibr">[16]</ref> genomic study management platform. The platform has been used over the last three years in San Diego County as a part of UCSD's award winning "Return to Learn" program <ref type="bibr">[20]</ref> during the COVID-19 public health emergency.</p><p>TAGATCAGC !!*=:2&gt;@@ GGAAACCAG *:2&gt;A@@? AGTCGGGAG TCACGAAACA ACACTGCTT ACTGCACGG ACACTGCT ACTACAGG TTCATAACCG Quality filter Length filter Complexity filter 11 00 01 10 11 ! ! C ^ @ Pointer Table PTR1: GAC -&gt; 645 PTR2: CTA -&gt; 248 PTR3: ACA -&gt; -1 PTR4: ACT -&gt; 172 CAL1: T -&gt; 23,65 CAL2: C -&gt; 65 CAL3: C -&gt; NULL CAL4: G -&gt; 1,7,65 CAL Table Reference READ1: 65 READ2: 3553 READ3: 5673 READ4: NULL 01001101000110 10010101010100 TGATCGCCAG TGATCGCCAG 24-37 58-70 TTCA Deep Variant READ1: 65(89M2D9M,74) READ2: 3553(100M,88,54) READ3: 5673(100M,99,60) READ4: NULL TAGATCAGC !!*=:2&gt;@@ TAGATCAGC !!*=:2&gt;@@ FASTQ Quality Trim Poly-G Trim Poly-X Trim Overlap Analysis Base Correction Adapter Trim Filter Intermediate File Seeding PTR Access CAL Lookup Refinement Alignment Estimate MAPQ Backtrace Alignment Result Primer Trim BED Quality Trim BAM File Adapter Trimming &amp; QC Primer Trimming Sequence Alignment Pangolin Unifrac Cancer COVID Microbiome Prefiltering Viral Sequence Sequence Alignment (Minimap2 [3]) Primer Trimming (iVar [5]) Pile-Up &amp; Consensus Sequence Calling (Samtools [18]) Phylogenetic Assignment (Pangolin [19]) Viral Sequence Sequence Alignment (Minimap2 [3]) Primer Trimming (iVar [5]) Pile-Up &amp; Consensus Sequence Calling (Samtools [18]) Phylogenetic Assignment (Pangolin [19]) Microbiome Sample Adapter Trimming &amp; QC (Fastp [6]) Host Filtering Sequence Alignment Feature Tables Generation (Woltka [21]) Unifrac Distance Calculation (Unifrac [22]) Microbiome Sample Adapter Trimming &amp; QC (Fastp [6]) Host Filtering Sequence Alignment Feature Tables Generation (Woltka [21]) Unifrac Distance Calculation (Unifrac [22]) Phylogenetic Assignment Pipeline Microbiome Metagenome Community Analysis Pipeline Human Genome Sequence Alignment Find Variants and Detection (DeepVariant [17]) Human Genome Sequence Alignment Find Variants and Detection (DeepVariant [17]) Human Genome Sequence Alignment Find Variants and Detection (DeepVariant [17]) Variant Calling Pipeline Human Genome Sequence Alignment Find Variants and Detection (DeepVariant [17])</p><p>Variant Calling Pipeline Qiita <ref type="bibr">[16]</ref> Qiita <ref type="bibr">[16]</ref> GenoMiX GenoMiX GenoMiX GenoMiX GenoMiX Integrate GenoMix to Qiita <ref type="bibr">[16]</ref> with plugins Fig. <ref type="figure">1</ref>. Overview of GenoMiX . GenoMiX is made up with the componentto accelerate the pipeline -. Standard tools chain includes aligner <ref type="bibr">[2]</ref>- <ref type="bibr">[4]</ref>, trimmer <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, and downstream modules <ref type="bibr">[17]</ref>- <ref type="bibr">[19]</ref>, <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. GENOMIX DESIGN AND IMPLEMENTATION</head><p>Figure <ref type="figure">1</ref> provides an overview of GenoMiX which runs as a plugin in Qiita <ref type="bibr">[16]</ref>, as shown in . GenoMiX acclerates three pipelines (-) with four components (-). The webpage frontend of Qiita <ref type="bibr">[16]</ref> provides GenoMiX a unified and user-friendly interface for the three analysis pipelines that are represented in blue boxes: runs variant calling using DeepVariant <ref type="bibr">[17]</ref> on GPU; showcases viral phylogenetic assignment pipeline utilized in UCSD's "Return to Learn" Program <ref type="bibr">[20]</ref>; executes Unifrac <ref type="bibr">[19]</ref> which computes dissimilarity between microbial communities based on their evolutionary relationships. The three pipelines have common components represented in red boxes: sequence alignment (), adapter trimming and quality control (), and primer trimming (). We profiled each of the three pipelines and found that trimming and alignment, when run on CPU, is by far the slowest, with 67% -98% overhead, so GenoMiX accelerates all of these on FPGA obtaining up to 30&#215; speedup. As shown in adapter and primer trimming, which are detailed in Section II-A, remove the low-quality regions and artifact segments. The alignment component of GenoMiX , illustrated in and discussed in Section II-B, uses seeding and extension-based method with a 2-stage index table. It can not only identify mapping locations but also generates CIGAR strings and accurately estimates MAPQ values. The structure of the index table is discussed in II-C. After alignment and trimming are completed, each of the three pipelines runs their individual downstream tools: Pangolin <ref type="bibr">[18]</ref> on CPU, DeepVariant <ref type="bibr">[17]</ref>, and Unifrac <ref type="bibr">[19]</ref> on GPU, to obtain the final results that are then displayed using Qiita <ref type="bibr">[16]</ref> front end.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. GenoMiX Trimming and Quality Control (QC)</head><p>GenoMiX trimming and quality control stages enhance read alignment accuracy. This involves removing artifact DNA sequences, like adaptors and primers, and discarding low-quality reads and segments similar to Fastp <ref type="bibr">[6]</ref> and iVar <ref type="bibr">[5]</ref> which run on CPU. GenoMiX utilizes a sliding window to detect and discard adaptor parts when mismatches fall below a specific threshold for adapter trimming. It calculates moving average phred scores, which determine the sequencing quality of each base, to remove low-quality areas. This quality trimming can be set to one of three modes: cut front, cut tail, and cut right. PolyX/G trimming eliminates consecutive identical bases at the read tail that may lack relevance. For example, polyG may occur in tails from NextSeq platforms as G indicates no signal. When the input data is paired, GenoMiX can perform base correction to identify overlap regions and correct mismatched base pairs if one base's phred score is much higher than the other. GenoMiX also has many filtering options, such as discard reads with too many undefined bases, low average phred scores, insufficient length, or low complexity. Primer sequences are known segments used to amplify the target DNA fragment. Unlike adapter trimming, which requires base-by-base matching but only involves tens of sequences, primer trimming deals with hundreds of artifact primer sequences introduced by multiplex amplicon sequencing. GenoMiX handles primer trimming in similar way to adapter trimming.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. GenoMiX Alignment</head><p>In DNA sequencing, short reads are often mismatched with the reference. GenoMiX uses seeding-extension-based method to handle mismatches or indels. The seeding stage identifies several k-mers from the sequence but may indicate a large number of less optimal candidate locations. GenoMiX prefilters those to improve the alignment quality. Seeding: Seeds are extracted by intercepting K-mers with interval l. Since read direction isn't predetermined, GenoMiX stores entries of seeds corresponding to the original and its reverse complement in the same bucket to reduce the number of random accesses. Hash function converts a K-mer to a 2K + 1 bits integer, where the lower 2K bits are smaller component of the 2bits-1bp representation of the seed and its reverse complementary, and the most significant bit indicates whether the original seed or the reverse complementary seed is smaller. GenoMiX then looks up the candidate locations of seed in a 2-stage hash table. The hashed value is split into two parts: H ptr and H cal , which corresponds to PTR table and CAL table respectively. The PTR table is a conflict-free mapping: the H ptr -th item of the PTR table points to the entry of the CAL table storing this seed. The CAL table stores H cal -position pair, where the key is the rest of hashing value of the seed, and the value is the location of this seed in the reference.</p><p>Prefiltering speeds up and improves the quality of sequence alignment by removing unlikely matches. GenoMiX can identify the most likely locations where the read may align by subtracting the offset of each seed from the beginning of the read from the candidate positions of the seed in the reference genome. To account for the possibility of small indels, positions that are adjacent within a certain tolerance are treated as the same. The prefiltering will discard the positions only a few seeds point to, making it can focus on more promising candidate sequences. In addition to filtering out the less-promising candidate positions, GenoMiX 's prefilter will also calculate the Hamming distance to verify if the read is able to align to the reference perfectly with only a few mismatches and no indels. The number of mismatch tolerance is a parameter making the trade-off between speed and the possibility to find the best alignment.</p><p>Extension performs alignment using the Smith-Waterman and Needleman-Wunsch algorithms and produces CIGAR as output. Depending on the downstream applications, global, local, or semi-global alignments may be used. GenoMiX estimates the mapping quality MAPQ similar to BWA-MEM <ref type="bibr">[2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. GenoMiX Table Indexing</head><p>GenoMiX limits CAL row size to improve sequence alignment efficiency. Each CAL row has a H cal cell and a seed position cell. The seed position is a 32-bit integer to fit the whole human genome. The bucket size is also limited to avoid the CAL table search bottleneck. Bucket size is limited to 32 rows, and if a bucket exceeds this limit, its H cal are sorted by the number of positions they point to, and half of the positions are randomly discarded. This is repeated until the bucket size is less than 32. GenoMiX table structure is optimized based on the reference length. The default length H ptr and H cal is chosen for the reference as large as the human genome. When the reference genome is smaller, such as with COVID-19, there is only a small number of CAL entries that share the same H ptr . In GenoMiX , the length of H ptr is reduced to decrease the PTR table size, while the length of H CAL is extended to maintain a constant seed length. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. EVALUATION AND RESULTS</head><p>Experimental Setup: Qiita <ref type="bibr">[16]</ref> was used to drive the GenoMiX as a front end. Figure <ref type="figure">3</ref> shows the screenshot of Qiita with GenoMiX when running COVID-19 analysis, but we can run all three types of genomics data analysis -human, bacteria, and virus. The acceleration kernels are written in C++ and compiled by Xilinx Vitis HLS to run on Xilinx FPGA U55c <ref type="bibr">[23]</ref> which has 16GB high bandwidth memory (HBM) split into 32 pseudo channels interconnected by 8 AXI switch boxes. GenoMiX optimizes the architecture based on the table size to leverage intra and inter box connectivity. For example, viral reference is small and thus we store the PTR table in a single HBM channel and the CAL table in internal SRAM, with multiple table copies to enhance throughput. The CAL tables for bacterial reference genome databases, in contrast, are typically so large that they cannot fit in a single FPGA, so GenoMiX divides them into multiple 2GB subtables to ensure locality within the switch boxes. Our FPGA uses AMD EPYC 7F72 CPU with 128GB RAM as the host.</p><p>Datasets: Our design was evaluated using real-world datasets, including COVID sequencing data obtained from the UCSD's "Return to Learn" Program <ref type="bibr">[20]</ref>, human whole genome sequencing data of NA12878 from the Genome in a Bottle Project, and microbiome metagenomic sequencing data from the EMP500 Project <ref type="bibr">[24]</ref>. The size of the reference genome varied across the datasets, ranging from a single virus genome to a comprehensive microbiome reference database consisting of over 10,000 bacterial and archaeal genomes. Table <ref type="table">I</ref> provides more details on them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Varaint Calling</head><p>We tested the accuracy of GenoMiX on the cancer variant detection task using sequencing data of a female subject NA12878 with Platinum Genomes small variant truthset <ref type="bibr">[29]</ref>.</p><p>TABLE I REFERNECE GENOME Description Name Total Length PTR Table CAL Table Ref Table COVID-19 NC 045512.2 [25] 30Kbp 256MB 233KB 7.5KB Chromosome 21 NC 000021.8 [26] 48Mbp 4GB 251MB 12MB Whole human genome GRCh38 [27] 3.1Gbp 4GB 16GB 786MB Microbiome database WoL [28] 33Gbp 40GB 195GB 8.2GB</p><p>GenoMiX has 12&#215; higher throughput when running on a single FPGA as compared with the BWA-MEM <ref type="bibr">[2]</ref> running on 16 cores. As shown in Figure <ref type="figure">4</ref>, the difference of endto-end results between using GenoMiX and BWA-MEM <ref type="bibr">[2]</ref> is small. True positives refer to the number of variants that the downstream application, DeepVariant <ref type="bibr">[17]</ref>, accurately classifies. Conversely, errors refer to the number of variants that DeepVariant is able to detect the variants but misclassify them.</p><p>20,318 20,443 52,205 52,205 484 483 656 517 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 GenoMix BWA-MEM [2] # Variants False Positive True Positive Error False Negative Fig. 4. Variant calling accuracy</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Phylogenetic Assignment</head><p>We tested the COVID-19 phylogenetic assignment task with COVID sequencing data obtained from the UCSD's "Return to Learn Program" <ref type="bibr">[20]</ref>. Figure <ref type="figure">5</ref> shows the runtime of GenoMiX compared to the state of the art BWA-MEM <ref type="bibr">[2]</ref>. The baseline is running with 16 cores. GenoMiX is 3&#215; faster than CPU for alignment and 30&#215; faster for trimming. Pileup and consensus sequence generation stages are pipeline bottlenecks after GenoMiX acceleration on FPGA. Results of the accelerated pipeline are very close in terms of accuracy to the original toolchain <ref type="bibr">[30]</ref>, with the difference between the consensus sequences generated of only 0.04%. Most of the difference is caused by the insufficient coverage of sequencing, and would not influence the end-to-end results. GenoMiX is able to get exactly the same phylogenetic assignment results as the baseline state of the art tool, Minimap2 <ref type="bibr">[3]</ref>.</p><p>0 100 200 300 400 2 FPGAs 1 FPGA CPU Runtime (minute) Minimap2-Aligner [3] GenoMiX-Aligner iVar-Trimmer [5] GenoMiX-Trimmer Pile-Up &amp; Consensus Phlogenetic Assignment 14&#215; speedup 1.9&#215; speedup </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Microbiome Metagenomics Comparative Analysis</head><p>We tested the microbiome metagenomics comparative analysis workloads with common use EMP500 <ref type="bibr">[24]</ref> dataset and WoL <ref type="bibr">[28]</ref> reference genome database. As shown in Figure <ref type="figure">6</ref>, alignment is the slowest component of the whole pipeline, due to the large size of the WoL <ref type="bibr">[28]</ref> database. Bowtie2 <ref type="bibr">[4]</ref>, a state of the art tool often used for alignment of microbial data due to its high accuracy, uses 44GB of memory when running with WoL <ref type="bibr">[28]</ref> database on a CPU. The challenge of GenoMiX aligner is the scale of the microbiome reference genome database: it may contain over 10K bacterial and archaeal genomes which results in over 190 GB of index tables. However, a single U55c only has 16 GB of onboard memory, which is insufficient to store the index tables. To solve this problem, we divided the original reference into 10 segments and built 10 sets of tables. At least 2 FPGAs are required to finish the alignment. With 3 FPGAs, which contain as much memory as Bowtie2 consumes, we can achieve 4&#215; speedup with comparable accuracy. GenoMiX adapter trimmer and QC are 15&#215; faster vs Fastp <ref type="bibr">[6]</ref> on CPU. Woltka <ref type="bibr">[22]</ref>, which is used for feature table generation, has the highest throughput, as it is relatively simple. Even if it is run on CPU can lead the throughput. Unifrac is not shown as it runs in a few seconds on TB-size short-read datasets. We would need 679 CPU cores or only 29 FPGAs in order for alignment and trimming to match the throughput of Woltka <ref type="bibr">[22]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONCLUSIONS AND FUTURE WORK</head><p>In this paper, we presented an FPGA accelerated genomics infrastructure, GenoMiX , that supports multiple real-world analysis pipelines used in personalized medicine, including the phylogenetic assignment for viral pathogens, variant calling that is key for cancer genomics, and microbiome metagenome analysis. GenoMiX has been integrated with Qiita <ref type="bibr">[16]</ref>, an open-source framework for managing and analyzing multi-omics datasets. It is up to 30&#215; faster than comparable CPU-based tools. The accelerated pipeline has been deployed for COVID-19 analysis in San Diego County during the last three years, as a part of UCSD's "Return to Learn" program <ref type="bibr">[20]</ref>. It shows comparable end-to-end results with the ones generated by the mainstream tools.</p><p>As we look toward the future, we will investigate its applicability in other fields of biology and medicine, such as proteomics and transcriptomic. Besides, we aim to make GenoMiX even more modular, enabling researchers to integrate new pipeline components as the field of genomics advances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. ACKNOLEDGEMENTS</head></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on August 16,2024 at 20:15:52 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
