 Award ID(s):
 2047856
 NSFPAR ID:
 10352695
 Date Published:
 Journal Name:
 EEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
 Page Range / eLocation ID:
 5647 to 5651
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Motivation: Intrasample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upperbound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated Tcell receptor sequences and actual Hepatitis B virus genomes, we show that the diametercorrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https:// github.com/KingsfordGroup/gtedemedtest/.more » « less

Abstract Motivation Intrasample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets.
Results We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upperbound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated Tcell receptor sequences and actual Hepatitis B virus genomes, we show that the diametercorrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.
Availability and implementation Data and source code for reproducing the experiments are available at: https://github.com/KingsfordGroup/gtedemedtest/.
Supplementary information Supplementary data are available at Bioinformatics online.

We consider the problem of designing sublinear time algorithms for estimating the cost of minimum] metric traveling salesman (TSP) tour. Specifically, given access to a n × n distance matrix D that specifies pairwise distances between n points, the goal is to estimate the TSP cost by performing only sublinear (in the size of D) queries. For the closely related problem of estimating the weight of a metric minimum spanning tree (MST), it is known that for any epsilon > 0, there exists an O^~(n/epsilon^O(1))time algorithm that returns a (1+epsilon)approximate estimate of the MST cost. This result immediately implies an O^~(n/epsilon^O(1)) time algorithm to estimate the TSP cost to within a (2 + epsilon) factor for any epsilon > 0. However, no o(n^2)time algorithms are known to approximate metric TSP to a factor that is strictly better than 2. On the other hand, there were also no known barriers that rule out existence of (1 + epsilon)approximate estimation algorithms for metric TSP with O^~ (n) time for any fixed epsilon > 0. In this paper, we make progress on both algorithms and lower bounds for estimating metric TSP cost. On the algorithmic side, we first consider the graphic TSP problem where the metric D corresponds to shortest path distances in a connected unweighted undirected graph. We show that there exists an O^~(n) time algorithm that estimates the cost of graphic TSP to within a factor of (2 − epsilon_0) for some epsilon_0 > 0. This is the first sublinear cost estimation algorithm for graphic TSP that achieves an approximation factor less than 2. We also consider another wellstudied special case of metric TSP, namely, (1, 2)TSP where all distances are either 1 or 2, and give an O^~(n ^ 1.5) time algorithm to estimate optimal cost to within a factor of 1.625. Our estimation algorithms for graphic TSP as well as for (1, 2)TSP naturally lend themselves to O^~(n) space streaming algorithms that give an 11/6approximation for graphic TSP and a 1.625approximation for (1, 2)TSP. These results motivate the natural question if analogously to metric MST, for any epsilon > 0, (1 + epsilon)approximate estimates can be obtained for graphic TSP and (1, 2)TSP using O^~ (n) queries. We answer this question in the negative – there exists an epsilon_0 > 0, such that any algorithm that estimates the cost of graphic TSP ((1, 2)TSP) to within a (1 + epsilon_0)factor, necessarily requires (n^2) queries. This lower bound result highlights a sharp separation between the metric MST and metric TSP problems. Similarly to many classical approximation algorithms for TSP, our sublinear time estimation algorithms utilize subroutines for estimating the size of a maximum matching in the underlying graph. We show that this is not merely an artifact of our approach, and that for any epsilon > 0, any algorithm that estimates the cost of graphic TSP or (1, 2)TSP to within a (1 + epsilon)factor, can also be used to estimate the size of a maximum matching in a bipartite graph to within an epsilon n additive error. This connection allows us to translate known lower bounds for matching size estimation in various models to similar lower bounds for metric TSP cost estimation.more » « less

A bstract The identification of interesting substructures within jets is an important tool for searching for new physics and probing the Standard Model at colliders. Many of these substructure tools have previously been shown to take the form of optimal transport problems, in particular the Energy Mover’s Distance (EMD). In this work, we show that the EMD is in fact the natural structure for comparing collider events, which accounts for its recent success in understanding event and jet substructure. We then present a Shape Hunting Algorithm using Parameterized Energy Reconstruction (S haper ), which is a general framework for defining and computing shapebased observables. S haper generalizes N jettiness from point clusters to any extended, parametrizable shape. This is accomplished by efficiently minimizing the EMD between events and parameterized manifolds of energy flows representing idealized shapes, implemented using the dualpotential Sinkhorn approximation of the Wasserstein metric. We show how the geometric language of observables as manifolds can be used to define novel observables with builtin infraredandcollinear safety. We demonstrate the efficacy of the S haper framework by performing empirical jet substructure studies using several examples of new shapebased observables.more » « less

Understanding generalization and robustness of machine learning models funda mentally relies on assuming an appropriate metric on the data space. Identifying such a metric is particularly challenging for nonEuclidean data such as graphs. Here, we propose a pseudometric for attributed graphs, the Tree Mover’s Distance (TMD), and study its relation to generalization. Via a hierarchical optimal transport problem, TMD reflects the local distribution of node attributes as well as the distri bution of local computation trees, which are known to be decisive for the learning behavior of graph neural networks (GNNs). First, we show that TMD captures properties relevant to graph classification: a simple TMDSVM performs competi tively with standard GNNs. Second, we relate TMD to generalization of GNNs under distribution shifts, and show that it correlates well with performance drop under such shifts.more » « less