skip to main content


Title: CSTs for Terabyte-Sized Data
Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enor- mously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.  more » « less
Award ID(s):
2029552
NSF-PAR ID:
10340624
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Data Compression Conference (DCC)
Page Range / eLocation ID:
93 to 102
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory. 
    more » « less
  2. Modeling buildings' heat dynamics is a complex process which depends on various factors including weather, building thermal capacity, insulation preservation, and residents' behavior. Gray-box models offer an explanation of those dynamics, as expressed in a few parameters specific to built environments that can provide compelling insights into the characteristics of building artifacts. In this paper, we present a systematic study of Bayesian approaches to modeling buildings' parameters, and hence their thermal characteristics. We build a Bayesian state-space model that can adapt and incorporate buildings' thermal equations and postulate a generalized solution that can easily adapt prior knowledge regarding the parameters. We then show that a faster approximate approach using Variational Inference for parameter estimation can posit similar parameters' quantification as that of a more time-consuming Markov Chain Monte Carlo (MCMC) approach. We perform extensive evaluations on two datasets to understand the generative process and attest that the Bayesian approach is more interpretable. We further study the effects of prior selection on the model parameters and transfer learning, where we learn parameters from one season and reuse them to fit the model in other seasons. We perform extensive evaluations on controlled and real data traces to enumerate buildings' parameters within a 95% credible interval. 
    more » « less
  3. With recent advancements, large language models (LLMs) such as ChatGPT and Bard have shown the potential to disrupt many industries, from customer service to healthcare. Traditionally, humans interact with geospatial data through software (e.g., ArcGIS 10.3) and programming languages (e.g., Python). As a pioneer study, we explore the possibility of using an LLM as an interface to interact with geospatial datasets through natural language. To achieve this, we also propose a framework to (1) train an LLM to understand the datasets, (2) generate geospatial SQL queries based on a natural language question, (3) send the SQL query to the backend database, (4) parse the database response back to human language. As a proof of concept, a case study was conducted on real-world data to evaluate its performance on various queries. The results show that LLMs can be accurate in generating SQL code for most cases, including spatial joins, although there is still room for improvement. As all geospatial data can be stored in a spatial database, we hope that this framework can serve as a proxy to improve the efficiency of spatial data analyses and unlock the possibility of automated geospatial analytics.

     
    more » « less
  4. Zooplankton plays a major role in ocean food webs and biogeochemical cycles, and provides major ecosystem services as a main driver of the biological carbon pump and in sustaining fish communities. Zooplankton is also sensitive to its environment and reacts to its changes. To better understand the importance of zooplankton, and to inform prognostic models that try to represent them, spatially-resolved biomass estimates of key plankton taxa are desirable. In this study we predict, for the first time, the global biomass distribution of 19 zooplankton taxa (1-50 mm Equivalent Spherical Diameter) using observations with the Underwater Vision Profiler 5, a quantitative in situ imaging instrument. After classification of 466,872 organisms from more than 3,549 profiles (0-500 m) obtained between 2008 and 2019 throughout the globe, we estimated their individual biovolumes and converted them to biomass using taxa-specific conversion factors. We then associated these biomass estimates with climatologies of environmental variables (temperature, salinity, oxygen, etc.), to build habitat models using boosted regression trees. The results reveal maximal zooplankton biomass values around 60°N and 55°S as well as minimal values around the oceanic gyres. An increased zooplankton biomass is also predicted for the equator. Global integrated biomass (0-500 m) was estimated at 0.403 PgC. It was largely dominated by Copepoda (35.7%, mostly in polar regions), followed by Eumalacostraca (26.6%) Rhizaria (16.4%, mostly in the intertropical convergence zone). The machine learning approach used here is sensitive to the size of the training set and generates reliable predictions for abundant groups such as Copepoda (R2 ≈ 20-66%) but not for rare ones (Ctenophora, Cnidaria, R2 < 5%). Still, this study offers a first protocol to estimate global, spatially resolved zooplankton biomass and community composition from in situ imaging observations of individual organisms. The underlying dataset covers a period of 10 years while approaches that rely on net samples utilized datasets gathered since the 1960s. Increased use of digital imaging approaches should enable us to obtain zooplankton biomass distribution estimates at basin to global scales in shorter time frames in the future. 
    more » « less
  5. In northwest Florida, advanced manufacturing (AM) jobs far outpace the middle-skilled technician workforce, though AM constitutes almost a quarter of the region’s total employment. From 2018-2028, of the available 4.6 million manufacturing jobs, less than half are likely to be filled due to talent shortages. This widening “skills gap” is attributed to many factors that range from new technologies in the AM industry (e.g., artificial intelligence, robotics), a need for newer recruiting methods, branding, and incentives in AM educational programs. Some professionals have even indicated that manufacturing industries and AM educational programs should be aligned more to reflect the needs of the industry. Even in the wake of Covid-19, when there have been over 700,000 manufacturing jobs lost due to market conditions, many states still have jobs that go unfilled further suggesting that there are challenges in filling AM technician positions. In a time when technicians in AM are in high demand and the number of graduates are in low supply, it is critical to identify whether AM education is meeting the needs of new professionals in the workforce and what they believe can be improved in these programs. This is especially true in rural locales, where economies with manufacturing industries are much more reliant on them. In the context of a NSF Advanced Technological Education (ATE), through a multi-method approach, we sought to understand: 1) Which AM competencies skills did participants report as benefiting them in gaining employment? 2) Which competencies are needed on the job to be a successful AM technician? 3) What are the ways in which AM preparation can be improved to enhance employment outcomes? This study’s results will expand the research base and curriculum content recommendations for regional AM education, as well as build regional capacity for AM program assessment and improvement by replicating, refining, and disseminating study approaches through further research, annual AM employer and educator meetings, and annual research skill-building academies in which stakeholders transfer research findings to practices and policies that empower rural NW Florida colleges. To date, research efforts have demonstrated that competency perceptions of faculty, employers, and new professionals have notable misalignments that have opportunities for AM program curriculum revision and enhancement. This paper summarizes five years of research output, emphasizing the impactful findings and dissemination products for ASEE community members, as well as opportunities for further research. 
    more » « less