Many deep generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs), learn an immersion mapping from a standard normal distribution in a low-dimensional latent space into a higher-dimensional data space. As such, these mappings are only capable of producing simple data topologies, i.e., those equivalent to an immersion of Euclidean space. In this work, we demonstrate the limitations of such latent space generative models when trained on data distributions with non-trivial topologies. We do this by training these models on synthetic image datasets with known topologies (spheres, torii, etc.). We then show how this results in failures of both data generation as well as data interpolation. Next, we compare this behavior to two classes of deep generative models that in principle allow for more complex data topologies. First, we look at chart autoencoders (CAEs), which construct a smooth data manifold from multiple latent space chart mappings. Second, we explore score-based models, e.g., denoising diffusion probabilistic models, which estimate gradients of the data distribution without resorting to an explicit mapping to a latent space. Our results show that these models do demonstrate improved ability over latent space models in modeling data distributions with complex topologies, however, challenges still remain.
more »
« less
Latent generative landscapes as maps of functional diversity in protein sequence space
Abstract Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins,β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
more »
« less
- Award ID(s):
- 1943442
- PAR ID:
- 10474055
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Nature Communications
- Volume:
- 14
- Issue:
- 1
- ISSN:
- 2041-1723
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Liu, Jin (Ed.)Generative models rely on the idea that data can be represented in terms of latent variables which are uncorrelated by definition. Lack of correlation among the latent variable support is important because it suggests that the latent-space manifold is simpler to understand and manipulate than the real-space representation. Many types of generative model are used in deep learning,e.g., variational autoencoders (VAEs) and generative adversarial networks (GANs). Based on the idea that the latent space behaves like a vector space Radford et al. (2015), we ask whether we can expand the latent space representation of our data elements in terms of an orthonormal basis set. Here we propose a method to build a set of linearly independent vectors in the latent space of a trained GAN, which we call quasi-eigenvectors. These quasi-eigenvectors have two key properties: i) They span the latent space, ii) A set of these quasi-eigenvectors map to each of the labeled features one-to-one. We show that in the case of the MNIST image data set, while the number of dimensions in latent space is large by design, 98% of the data in real space map to a sub-domain of latent space of dimensionality equal to the number of labels. We then show how the quasi-eigenvectors can be used for Latent Spectral Decomposition (LSD). We apply LSD to denoise MNIST images. Finally, using the quasi-eigenvectors, we construct rotation matrices in latent space which map to feature transformations in real space. Overall, from quasi-eigenvectors we gain insight regarding the latent space topology.more » « less
-
Abstract Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the “generative capacity” of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model’s generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE’s lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.more » « less
-
Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold and distribution. However, training samples are often distributed non-uniformly on the manifold, due to the cost or convenience of collection. For example, the CelebA dataset contains a large fraction of smiling faces. These inconsistencies will be reproduced when sampling from the trained DGN, which is not always preferred, e.g., for fairness or data augmentation. In response, we develop MaGNET, a novel and theoretically motivated latent space sampler for any pre-trained DGN that produces samples uniformly distributed on the learned manifold. We perform a range of experiments on several datasets and DGNs, e.g., for the state-of-the-art StyleGAN2 trained on the FFHQ dataset, uniform sampling via MaGNET increases distribution precision by 4.1% and recall by 3.0% and decreases gender bias by 41.2%, without requiring labels or retraining. Since uniform sample distribution does not imply uniform semantic distribution, we also explore how semantic attributes of generated samples vary under MaGNET sampling. Colab and codes at bit.ly/magnet-samplingmore » « less
-
We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. The approaches learn nonlinear state-space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and other architectures. Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning reduced dimensional representations of the nonlinear Burgers Equations, Constrained Mechanical Systems, and spatial fields of Reaction-Diffusion Systems. GD-VAEs provide methods that can be used to obtain representations in manifold latent spaces for diverse learning tasks involving dynamics.more » « less
An official website of the United States government

