skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants
Abstract The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available athttps://github.com/Kingsford-Group/gtednewilp/.  more » « less
Award ID(s):
2232121
PAR ID:
10572288
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Algorithms for Molecular Biology
Volume:
19
Issue:
1
ISSN:
1748-7188
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Motivation: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https:// github.com/Kingsford-Group/gtedemedtest/. 
    more » « less
  2. Abstract MotivationThe size of a genome graph—the space required to store the nodes, node labels and edges—affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs. ResultsWe point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel–Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit. AvailabilityThe RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract The polarFregion ionosphere frequently exhibits sporadic variability (e.g., Meek, 1949,https://doi.org/10.1029/JZ054i004p00339; Hill, 1963,https://doi.org/10.1175/1520‐0469(1963)020<0492:SEOLII>2.0.CO;2). Recent satellite data analysis (Noja et al., 2013,https://doi.org/10.1002/rds.20033; Chartier et al., 2018,https://doi.org/10.1002/2017JA024811) showed that the high‐latitudeFregion ionosphere exhibits sporadic enhancements more frequently in January than in July in both the northern and southern hemispheres. The same pattern has been seen in statistics of the degradation and total loss of GPS service onboard low‐Earth orbit satellites (Xiong et al. 2018,https://doi.org/10.5194/angeo‐36‐679‐2018). Here, we confirm the existence of this annual pattern using ground GPS‐based images of TEC from the MIDAS algorithm. Images covering January and July 2014 confirm that the high‐latitude (>70 MLAT)Fregion exhibits a substantially larger range of values in January than in July in both the northern and southern hemispheres. The range of TEC values observed in the polar caps is 38–57 TECU (north‐south) in January versus 25–37 TECU in July. First‐principle modeling using SAMI3 reproduces this pattern, and indicates that it is caused by an asymmetry in plasma levels (30% higher in January than in July across both polar caps), as well as 17% longer O+plasma lifetimes in northern hemisphere winter, compared to southern hemisphere winter. 
    more » « less
  4. Abstract A foundational assumption in paleomagnetism is that the Earth's magnetic field behaves as a geocentric axial dipole (GAD) when averaged over sufficient timescales. Compilations of directional data averaged over the past 5 Ma yield a distribution largely compatible with GAD, but the distribution of paleointensity data over this timescale is incompatible. Reasons for the failure of GAD include: (a) Arbitrary “selection criteria” to eliminate “unreliable” data vary among studies, so the paleointensity database may include biased results. (b) The age distribution of existing paleointensity data varies with latitude, so different latitudinal averages represent different time periods. (c) The time‐averaged field could be truly non‐dipolar. Here, we present a consistent methodology for analyzing paleointensity results and comparing time‐averaged paleointensities from different studies. We apply it to data from Plio/Pleistocene Hawai'ian igneous rocks, sampled from fine‐grained, quickly cooled material (lava flow tops, dike margins and scoria cones) and subjected to the IZZI‐Thellier technique; the data were analyzed using the Bias Corrected Estimation of Paleointensity method of Cych et al. (2021,https://doi.org/10.1029/2021GC009755), which produces accurate paleointensity estimates without arbitrarily excluding specimens from the analysis. We constructed a paleointensity curve for Hawai'i over the Plio/Pleistocene using the method of Livermore et al. (2018,https://doi.org/10.1093/gji/ggy383), which accounts for the age distribution of data. We demonstrate that even with the large uncertainties associated with obtaining a mean field from temporally sparse data, our average paleointensities obtained from Hawai'i and Antarctica (reanalyzed from Asefaw et al., 2021,https://doi.org/10.1029/2020JB020834) are not GAD‐like from 0 to 1.5 Ma but may be prior to that. 
    more » « less
  5. Abstract We report on the mountain top observation of three terrestrial gamma‐ray flashes (TGFs) that occurred during the summer storm season of 2021. To our knowledge, these are the first TGFs observed in a mountaintop environment and the first published European TGFs observed from the ground. A gamma‐ray sensitive detector was located at the base of the Säntis Tower in Switzerland and observed three unique TGF events with coincident radio sferic data characteristic of TGFs seen from space. We will show an example of a “slow pulse” radio signature (Cummer et al., 2011,https://doi.org/10.1029/2011GL048099; Lu et al., 2011,https://doi.org/10.1029/2010JA016141; Pu et al., 2019,https://doi.org/10.1029/2019GL082743; Pu et al., 2020,https://doi.org/10.1029/2020GL089427), a −EIP (Lyu et al., 2016,https://doi.org/10.1002/2016GL070154; Lyu et al., 2021,https://doi.org/10.1029/2021GL093627; Wada et al., 2020,https://doi.org/10.1029/2019JD031730), and a double peak TGF associated with an extraordinarily powerful and complicated positive‐polarity sferic, where each TGF peak is possibly preceded by a short burst of stepped leader emission. 
    more » « less