skip to main content


Title: Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations
We consider the task of learning a parametric Continuous Time Markov Chain (CTMC) sequence model without examples of sequences, where the training data consists entirely of aggregate steady-state statistics. Making the problem harder, we assume that the states we wish to predict are unobserved in the training data. Specifically, given a parametric model over the transition rates of a CTMC and some known transition rates, we wish to extrapolate its steady state distribution to states that are unobserved. A technical roadblock to learn a CTMC from its steady state has been that the chain rule to compute gradients will not work over the arbitrarily long sequences necessary to reach steady state —from where the aggregate statistics are sampled. To overcome this optimization challenge, we propose ∞-SGD, a principled stochastic gradient descent method that uses randomly-stopped estimators to avoid infinite sums required by the steady state computation, while learning even when only a subset of the CTMC states can be observed. We apply ∞-SGD to a real-world testbed and synthetic experiments showcasing its accuracy, ability to extrapolate the steady state distribution to unobserved states under unobserved conditions (heavy loads, when training under light loads), and succeeding in difficult scenarios where even a tailor-made extension of existing methods fails.  more » « less
Award ID(s):
1717493 1738981 1943364
NSF-PAR ID:
10182363
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
34
Issue:
04
ISSN:
2159-5399
Page Range / eLocation ID:
3922 to 3929
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Many-body dynamical models in which Boltzmann statistics can be derived directly from the underlying dynamical laws without invoking the fundamental postulates of statistical mechanics are scarce. Interestingly, one such model is found in econophysics and in chemistry classrooms: the money game, in which players exchange money randomly in a process that resembles elastic intermolecular collisions in a gas, giving rise to the Boltzmann distribution of money owned by each player. Although this model offers a pedagogical example that demonstrates the origins of Boltzmann statistics, such demonstrations usually rely on computer simulations. In fact, a proof of the exponential steady-state distribution in this model has only become available in recent years. Here, we study this random money/energy exchange model and its extensions using a simple mean-field-type approach that examines the properties of the one-dimensional random walk performed by one of its participants. We give a simple derivation of the Boltzmann steady-state distribution in this model. Breaking the time-reversal symmetry of the game by modifying its rules results in non-Boltzmann steady-state statistics. In particular, introducing ‘unfair’ exchange rules in which a poorer player is more likely to give money to a richer player than to receive money from that richer player, results in an analytically provable Pareto-type power-law distribution of the money in the limit where the number of players is infinite, with a finite fraction of players in the ‘ground state’ (i.e. with zero money). For a finite number of players, however, the game may give rise to a bimodal distribution of money and to bistable dynamics, in which a participant’s wealth jumps between poor and rich states. The latter corresponds to a scenario where the player accumulates nearly all the available money in the game. The time evolution of a player’s wealth in this case can be thought of as a ‘chemical reaction’, where a transition between ‘reactants’ (rich state) and ‘products’ (poor state) involves crossing a large free energy barrier. We thus analyze the trajectories generated from the game using ideas from the theory of transition paths and highlight non-Markovian effects in the barrier crossing dynamics.

     
    more » « less
  2. Batch Normalization (BN) is essential to effectively train state-of-the-art deep Convolutional Neural Networks (CNN). It normalizes the layer outputs during training using the statistics of each mini-batch. BN accelerates training procedure by allowing to safely utilize large learning rates and alleviates the need for careful initialization of the parameters. In this work, we study BN from the viewpoint of Fisher kernels that arise from generative probability models. We show that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means batch normalizing transform can be explained in terms of kernels that naturally emerge from the probability density function that models the generative process of the underlying data distribution. Consequently, it promises higher discrimination power for the batch-normalized mini-batch. However, given the rectifying non-linearities employed in CNN architectures, distribution of the layer outputs show an asymmetric characteristic. Therefore, in order for BN to fully benefit from the aforementioned properties, we propose approximating underlying data distribution not with one, but a mixture of Gaussian densities. Deriving Fisher vector for a Gaussian Mixture Model (GMM), reveals that batch normalization can be improved by independently normalizing with respect to the statistics of disentangled sub-populations. We refer to our proposed soft piecewise version of batch normalization as Mixture Normalization (MN). Through extensive set of experiments on CIFAR-10 and CIFAR-100, using both a 5-layers deep CNN and modern Inception-V3 architecture, we show that mixture normalization reduces required number of gradient updates to reach the maximum test accuracy of the batch normalized model by ∼31%-47% across a variety of training scenarios. Replacing even a few BN modules with MN in the 48-layers deep Inception-V3 architecture is sufficient to not only obtain considerable training acceleration but also better final test accuracy. We show that similar observations are valid for 40 and 100-layers deep DenseNet architectures as well. We complement our study by evaluating the application of mixture normalization to the Generative Adversarial Networks (GANs), where "mode collapse" hinders the training process. We solely replace a few batch normalization layers in the generator with our proposed mixture normalization. Our experiments using Deep Convolutional GAN (DCGAN) on CIFAR-10 show that mixture normalized DCGAN not only provides an acceleration of ∼58% but also reaches lower (better) "Fréchet Inception Distance" (FID) of 33.35 compared to 37.56 of its batch normalized counterpart. 
    more » « less
  3. Abstract

    Traits that have arisen multiple times yet still remain rare present a curious paradox. A number of these rare traits show a distinct tippy pattern, where they appear widely dispersed across a phylogeny, are associated with short branches and differ between recently diverged sister species. This phylogenetic pattern has classically been attributed to the trait being an evolutionary dead end, where the trait arises due to some short‐term evolutionary advantage, but it ultimately leads species to extinction. While the higher extinction rate associated with a dead end trait could produce such a tippy pattern, a similar pattern could appear if lineages with the trait speciated slower than other lineages, or if the trait was lost more often that it was gained. In this study, we quantify the degree of tippiness of red flowers in the tomato family, Solanaceae, and investigate the macroevolutionary processes that could explain the sparse phylogenetic distribution of this trait. Using a suite of metrics, we confirm that red‐flowered lineages are significantly overdispersed across the tree and form smaller clades than expected under a null model. Next, we fit 22 alternative models using HiSSE(Hidden State Speciation and Extinction), which accommodates asymmetries in speciation, extinction and transition rates that depend on observed and unobserved (hidden) character states. Results of the model fitting indicated significant variation in diversification rates across the family, which is best explained by the inclusion of hidden states. Our best fitting model differs between the maximum clade credibility tree and when incorporating phylogenetic uncertainty, suggesting that the extreme tippiness and rarity of red Solanaceae flowers makes it difficult to distinguish among different underlying processes. However, both of the best models strongly support a bias towards the loss of red flowers. The best fitting HiSSEmodel when incorporating phylogenetic uncertainty lends some support to the hypothesis that lineages with red flowers exhibit reduced diversification rates due to elevated extinction rates. Future studies employing simulations or targeting population‐level processes may allow us to determine whether red flowers in Solanaceae or other angiosperms clades are rare and tippy due to a combination of processes, or asymmetrical transitions alone.

     
    more » « less
  4. Despite —or maybe because of— their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model’s inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner’s model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data. Unlike existing invariance-driven methods for (counterfactual) extrapolations, this framework allows extrapolations from a single environment. Finally, we introduce sequence and image extrapolation tasks that validate our framework and showcase the shortcomings of traditional approaches. 
    more » « less
  5. null (Ed.)
    Despite ---or maybe because of--- their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model's inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner’s model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data. 
    more » « less