Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Coreset selection, a technique for compressing large datasets while preserving performance, is crucial for modern machine learning. This paper presents a novel method for generating high-quality Wasserstein coresets using the Sinkhorn loss, a powerful tool with computational advantages. However, existing approaches suffer from numerical instability in Sinkhorn’s algorithm. We address this by proposing stable algorithms for the computation and differentiation of the Sinkhorn optimization problem, including an analytical formula for the derivative of the Sinkhorn loss and a rigorous stability analysis of our method. Extensive experiments demonstrate that our approach significantly outperforms existing methods in terms of sample selection quality, computational efficiency, and achieving a smaller Wasserstein distance.more » « lessFree, publicly-accessible full text available February 1, 2026
-
Generative models based on latent variables, such as generative adversarial networks (GANs) and variationalauto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields.However, many data such as natural images usually do not populate the ambient Euclidean space but insteadreside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncoverthe structure of the data, possibly resulting in mismatch of latent representations and poor generativequalities. Toward addressing these problems, we propose a novel framework called the latent WassersteinGAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsicdimension of the data manifold can be adaptively learned by a modified informative latent distribution. Weprove that there exist an encoder network and a generator network in such a way that the intrinsic dimensionof the learned encoding distribution is equal to the dimension of the data manifold. We theoreticallyestablish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the datamanifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that weforce the synthetic data distribution to be similar to the real data distribution from a population perspective.Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify thecorrect intrinsic dimension under several scenarios, and simultaneously generate high-quality syntheticdata by sampling from the learned latent distribution. Supplementary materials for this article are availableonline, including a standardized description of the materials available for reproducing the work.more » « less
An official website of the United States government

Full Text Available