skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2
Abstract The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory.  more » « less
Award ID(s):
2029424 1763680
PAR ID:
10370769
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Genome Biology
Volume:
23
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. In many applications of graph processing, the input data is often generated from an underlying geometric point data set. However, existing high-performance graph processing frameworks assume that the input data is given as a graph. Therefore, to use these frameworks, the user must write or use external programs based on computational geometry algorithms to convert their point data set to a graph, which requires more programming effort and can also lead to performance degradation. In this paper, we present our ongoing work on the Geo- Graph framework for shared-memory multicore machines, which seamlessly supports routines for parallel geometric graph construction and parallel graph processing within the same environment. GeoGraph supports graph construction based on k-nearest neighbors, Delaunay triangulation, and b-skeleton graphs. It can then pass these generated graphs to over 25 graph algorithms. GeoGraph contains highperformance parallel primitives and algorithms implemented in C++, and includes a Python interface. We present four examples of using GeoGraph, and some experimental results showing good parallel speedups and improvements over the Higra library. We conclude with a vision of future directions for research in bridging graph and geometric data processing. 
    more » « less
  3. Abstract Biologists represent data in visual forms, such as graphs, to aid data analysis and communication. However, students struggle to construct effective graphs. Although some studies explore these difficulties, we lack a comprehensive framework of the knowledge and skills needed to construct graphs in biology. In the present article, we describe the development of the Graph Construction Competency Model for Biology (GCCM-Bio), a framework of the components and activities associated with graph construction. We identified four broad knowledge areas for graph construction in biology: data selection, data exploration, graph assembly, and graph reflection. Under each area, we identified activities undertaken when constructing graphs of biological data and refined the GCCM-Bio through focus groups with experts in biology and statistics education. We also ran a scoping literature review to verify that these activities were represented in the graphing literature. The GCCM-Bio could support instructors, curriculum developers, and researchers when designing instruction and assessment of biology graph construction. 
    more » « less
  4. Abstract Electric vehicles (EVs) have emerged as an environmentally friendly alternative to conventional fuel vehicles. Lithium-ion batteries are the major energy source for EVs, but they degrade under dynamic operating conditions. Accurate estimation of battery state of health is important for sustainability as it quantifies battery condition, influences reuse possibilities, and helps alleviate capacity degradation, which finally impacts battery lifespan and energy efficiency. In this paper, a self-attention graph neural network combined with long short-term memory (LSTM) is introduced by focusing on using temporal and spatial dependencies in battery data. The LSTM layer utilizes a sliding window to extract temporal dependencies in the battery health factors. Two different approaches to the graph construction layer are subsequently developed: health factor-based and window-based graphs. Each approach emphasizes the interconnections between individual health factors and exploits temporal features in a deeper way, respectively. The self-attention mechanism is used to compute the adjacent weight matrix, which measures the strength of interactions between nodes in the graph. The impact of the two graph structures on the model performance is discussed. The model accuracy and computational cost of the proposed model are compared with the individual LSTM and gated recurrent unit (GRU) models. 
    more » « less
  5. Abstract Over the past decade, there has been a rapid increase in the development of predictive models at the intersection of molecular ecology, genomics, and global change. The common goal of these ‘genomic forecasting’ models is to integrate genomic data with environmental and ecological data in a model to make quantitative predictions about the vulnerability of populations to climate change.Despite rapid methodological development and the growing number of systems in which genomic forecasts are made, the forecasts themselves are rarely evaluated in a rigorous manner with ground‐truth experiments. This study reviews the evaluation experiments that have been done, introduces important terminology regarding the evaluation of genomic forecasting models, and discusses important elements in the design and reporting of ground‐truth experiments.To date, experimental evaluations of genomic forecasts have found high variation in the accuracy of forecasts, but it is difficult to compare studies on a common ground due to different approaches and experimental designs. Additionally, some evaluations may be biased toward higher performance because training data and testing data are not independent. In addition to independence between training data and testing data, important elements in the design of an evaluation experiment include the construction and parameterization of the forecasting model, the choice of fitness proxies to measure for test data, the construction of the evaluation model, the choice of evaluation metric(s), the degree of extrapolation to novel environments or genotypes, and the sensitivity, uncertainty and reproducbility of forecasts.Although genomic forecasting methods are becoming more accessible, evaluating their limitations in a particular study system requires careful planning and experimentation. Meticulously designed evaluation experiments can clarify the robustness of the forecasts for application in management. Clear reporting of basic elements of experimental design will improve the rigour of evaluations, and in turn our understanding of why models work in some cases and not others. 
    more » « less