In modern relational machine learning it is common to encounter large graphs that arise via interactions or similarities between observations in many domains. Further, in many cases the target entities for analysis are actually signals on such graphs. We propose to compare and organize such datasets of graph signals by using an earth mover’s distance (EMD) with a geodesic cost over the underlying graph. Typically, EMD is computed by optimizing over the cost of transporting one probability distribution to another over an underlying metric space. However, this is inefficient when computing the EMD between many signals. Here, we propose an unbalanced graph EMD that efficiently embeds the unbalanced EMD on an underlying graph into an L1 space, whose metric we call unbalanced diffusion earth mover’s distance (UDEMD). Next, we show how this gives distances between graph signals that are robust to noise. Finally, we apply this to organizing patients based on clinical notes, embedding cells modeled as signals on a gene graph, and organizing genes modeled as signals over a large cell graph. In each case, we show that UDEMD-based embeddings find accurate distances that are highly efficient compared to other methods.
more »
« less
A parallel method for earth mover’s distance
We propose a new algorithm to approximate the Earth Mover's distance (EMD). Our main idea is motivated by the theory of optimal transport, in which EMD can be reformulated as a familiar L1 type minimization. We use a regularization which gives us a unique solution for this L1 type problem. The new regularized minimization is very similar to problems which have been solved in the fields of compressed sensing and image processing, where several fast methods are available. In this paper, we adopt a primal-dual algorithm designed there, which uses very simple updates at each iteration and is shown to converge very rapidly. Several numerical examples are provided.
more »
« less
- Award ID(s):
- 1700202
- PAR ID:
- 10090356
- Date Published:
- Journal Name:
- Journal of scientific computing
- Volume:
- 75
- Issue:
- 1
- ISSN:
- 0885-7474
- Page Range / eLocation ID:
- 182-192
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The identification of interesting substructures within jets is an important tool for searching for new physics and probing the Standard Model at colliders. Many of these substructure tools have previously been shown to take the form of optimal transport problems, in particular the Energy Mover’s Distance (EMD). In this work, we show that the EMD is in fact the natural structure for comparing collider events, which accounts for its recent success in understanding event and jet substructure. We then present a Shape Hunting Algorithm using Parameterized Energy Reconstruction (Shaper), which is a general framework for defining and computing shape-based observables. Shaper generalizes N-jettiness from point clusters to any extended, parametrizable shape. This is accomplished by efficiently minimizing the EMD between events and parameterized manifolds of energy flows representing idealized shapes, implemented using the dual-potential Sinkhorn approximation of the Wasserstein metric. We show how the geometric language of observables as manifolds can be used to define novel observables with built-in infrared-and-collinear safety. We demonstrate the efficacy of the Shaper framework by performing empirical jet substructure studies using several examples of new shape-based observables.more » « less
-
Different to traditional clustering methods that deal with one single type of data, High-Order Co- Clustering (HOCC) aims to cluster multiple types of data simultaneously by utilizing the inter- or/and intra-type relationships across different data types. In existing HOCC methods, data points routinely enter the objective functions with squared residual errors. As a result, outlying data samples can dominate the objective functions, which may lead to incorrect clustering results. Moreover, existing methods usually suffer from soft clustering, where the probabilities to different groups can be very close. In this paper, we propose an L1 -norm symmetric nonnegative matrix tri-factorization method to solve the HOCC problem. Due to the orthogonal constraints and the symmetric L1 -norm formulation in our new objective, conventional auxiliary function approach no longer works. Thus we derive the solution algorithm using the alternating direction method of multipliers. Extensive experiments have been conducted on a real world data set, in which promising empirical results, including less time consumption, strictly orthogonal membership matrix, lower local minima etc., have demonstrated the effectiveness of our proposed method.more » « less
-
Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an L1 regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.more » « less
-
The application of empirical mode decomposition (EMD) in the analysis and processing of lightning electric field waveforms acquired by the low-frequency e-field detection array (LFEDA) in China has significantly improved the capabilities of the low-frequency/very-low-frequency (LF/VLF) time-of-arrival technique for studying the lightning discharge processes. However, the inherent mode mixing and the endpoint effect of EMD lead to certain problems, such as an inadequate noise reduction capability, the incorrect matching of multistation waveforms, and the inaccurate extraction of pulse information, which limit the further development of the LFEDA's positioning ability. To solve these problems, the advanced ensemble EMD (EEMD) technique is introduced into the analysis of LF/VLF lightning measurements, and a double-sided bidirectional mirror (DBM) extension method is proposed to overcome the endpoint effect of EMD. EEMD can effectively suppress mode mixing, and the DBM extension method proposed in this article can effectively suppress the endpoint effect, thus greatly improving the accuracy of a simulated signal after a 25-500-kHz bandpass filter. The resulting DBM_EEMD algorithm can be used in the LFEDA system to process and analyze the detected electric field signals to improve the system's lightning location capabilities, especially in terms of accurate extraction and location of weak signals from lightning discharges. In this article, a 3-D image of artificially triggered lightning obtained from an LF/VLF location system is reported for the first time, and methods for further improving the location capabilities of the LF/VLF lightning detection systems are discussed.more » « less
An official website of the United States government

