skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CSTs for Terabyte-Sized Data
Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enor- mously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.  more » « less
Award ID(s):
2029552
PAR ID:
10340624
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Data Compression Conference (DCC)
Page Range / eLocation ID:
93 to 102
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Urban traffic status (e.g., traffic speed and volume) is highly dynamic in nature, namely, varying across space and evolving over time. Thus, predicting such traffic dynamics is of great importance to urban development and transportation management. However, it is very challenging to solve this problem due to spatial-temporal dependencies and traffic uncertainties. In this paper, we solve the traffic dynamics prediction problem from Bayesian meta-learning perspective and propose a novel continuous spatial-temporal meta-learner (cST-ML), which is trained on a distribution of traffic prediction tasks segmented by historical traffic data with the goal of learning a strategy that can be quickly adapted to related but unseen traffic prediction tasks. cST-ML tackles the traffic dynamics prediction challenges by advancing the Bayesian black-box meta-learning framework through the following new points: 1) cST-ML captures the dynamics of traffic prediction tasks using variational inference; 2) cST-ML has novel designs in architecture, where CNN and LSTM are embedded to capture the spatial-temporal dependencies between traffic status and traffic related features; 3) novel training and testing algorithms for cST-ML are designed. We also conduct experiments on two real-world traffic datasets (taxi inflow and traffic speed) to evaluate our proposed cST-ML. The experimental results verify that cST-ML can significantly improve the urban traffic prediction performance and outperform all baseline models. 
    more » « less
  2. Urban traffic status (e.g., traffic speed and volume) is highly dynamic in nature, namely, varying across space and evolving over time. Thus, predicting such traffic dynamics is of great importance to urban development and transportation management. However, it is very challenging to solve this problem due to spatial-temporal dependencies and traffic uncertainties. In this article, we solve the traffic dynamics prediction problem from Bayesian meta-learning perspective and propose a novel continuous spatial-temporal meta-learner (cST-ML), which is trained on a distribution of traffic prediction tasks segmented by historical traffic data with the goal of learning a strategy that can be quickly adapted to related but unseen traffic prediction tasks. cST-ML tackles the traffic dynamics prediction challenges by advancing the Bayesian black-box meta-learning framework through the following new points: (1) cST-ML captures the dynamics of traffic prediction tasks using variational inference, and to better capture the temporal uncertainties within tasks, cST-ML performs as a rolling window within each task; (2) cST-ML has novel designs in architecture, where CNN and LSTM are embedded to capture the spatial-temporal dependencies between traffic status and traffic-related features; (3) novel training and testing algorithms for cST-ML are designed. We also conduct experiments on two real-world traffic datasets (taxi inflow and traffic speed) to evaluate our proposed cST-ML. The experimental results verify that cST-ML can significantly improve the urban traffic prediction performance and outperform all baseline models especially when obvious traffic dynamics and temporal uncertainties are presented. 
    more » « less
  3. In the Continuous Steiner Tree problem (CST), we are given as input a set of points (called terminals) in a metric space and ask for the minimum-cost tree connecting them. Additional points (called Steiner points) from the metric space can be introduced as nodes in the solution. In the Discrete Steiner Tree problem (DST), we are given in addition to the terminals, a set of facilities, and any solution tree connecting the terminals can only contain the Steiner points from this set of facilities. Trevisan [SICOMP'00] showed that CST and DST are APX-hard when the input lies in the $$\ell_1$$-metric (and Hamming metric). Chleb\'ik and Chleb\'ikov\'a [TCS'08] showed that DST is NP-hard to approximate to factor of $$96/95\approx 1.01$$ in the graph metric (and consequently $$\ell_\infty$$-metric). Prior to this work, it was unclear if CST and DST are APX-hard in essentially every other popular metric. In this work, we prove that DST is APX-hard in every $$\ell_p$$-metric. We also prove that CST is APX-hard in the $$\ell_{\infty}$$-metric. Finally, we relate CST and DST, showing a general reduction from CST to DST in $$\ell_p$$-metrics. Comment: Abstract shortened. Journal version for TheoretiCS 
    more » « less
  4. This paper is about detecting incorrect arcs in a dependency parse for sentences that contain grammar mistakes. Pruning these arcs results in well-formed parse fragments that can still be useful for downstream applications. We propose two automatic methods that jointly parse the ungrammatical sentence and prune the incorrect arcs: a parser retrained on a parallel corpus of ungrammatical sentences with their corrections, and a sequence-to sequence method. Experimental results show that the proposed strategies are promising for detecting incorrect syntactic dependencies as well as incorrect semantic dependencies. 
    more » « less
  5. There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain mapping, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain region the image comes from) is a crucial and open challenge. However, most existing datasets and benchmarks for neuroanatomy consider only a single downstream task at a time. To bridge this gap, we introduce a new dataset, annotations, and multiple downstream tasks that provide diverse ways to readout information about brain structure and architecture from the same image. Our multi-task neuroimaging benchmark (MTNeuro) is built on volumetric, micrometer-resolution X-ray microtomography images spanning a large thalamocortical section of mouse brain, encompassing multiple cortical and subcortical regions. We generated a number of different prediction challenges and evaluated several supervised and self-supervised models for brain-region prediction and pixel-level semantic segmentation of microstructures. Our experiments not only highlight the rich heterogeneity of this dataset, but also provide insights into how self-supervised approaches can be used to learn representations that capture multiple attributes of a single image and perform well on a variety of downstream tasks. Datasets, code, and pre-trained baseline models are provided at: https://mtneuro.github.io/ 
    more » « less