skip to main content


Title: Convex Neural Autoregressive Models: Towards Tractable, Expressive, and Theoretically-Backed Models for Sequential Forecasting and Generation
Three features are crucial for sequential forecasting and generation models: tractability, expressiveness, and theoretical backing. While neural autoregressive models are relatively tractable and offer powerful predictive and generative capabilities, they often have complex optimization landscapes, and their theoretical properties are not well understood. To address these issues, we present convex formulations of autoregressive models with one hidden layer. Specifically, we prove an exact equivalence between these models and constrained, regularized logistic regression by using semi-infinite duality to embed the data matrix onto a higher dimensional space and introducing inequality constraints. To make this formulation tractable, we approximate the constraints using a hinge loss or drop them altogether. Furthermore, we demonstrate faster training and competitive performance of these implementations compared to their neural network counterparts on a variety of data sets. Consequently, we introduce techniques to derive tractable, expressive, and theoretically-interpretable models that are nearly equivalent to neural autoregressive models.  more » « less
Award ID(s):
2037304
NSF-PAR ID:
10290937
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Page Range / eLocation ID:
3890 to 3894
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The standard approach to fitting an autoregressive spike train model is to maximize the likelihood for one-step prediction. This maximum likelihood estimation (MLE) often leads to models that perform poorly when generating samples recursively for more than one time step. Moreover, the generated spike trains can fail to capture important features of the data and even show diverging firing rates. To alleviate this, we propose to directly minimize the divergence between neural recorded and model generated spike trains using spike train kernels. We develop a method that stochastically optimizes the maximum mean discrepancy induced by the kernel. Experiments performed on both real and synthetic neural data validate the proposed approach, showing that it leads to well-behaving models. Using different combinations of spike train kernels, we show that we can control the trade-off between different features which is critical for dealing with model-mismatch. 
    more » « less
  2. Identifying the directed connectivity that underlie networked activity between different cortical areas is critical for understanding the neural mechanisms behind sensory processing. Granger causality (GC) is widely used for this purpose in functional magnetic resonance imaging analysis, but there the temporal resolution is low, making it difficult to capture the millisecond-scale interactions underlying sensory processing. Magne- toencephalography (MEG) has millisecond resolution, but only provides low-dimensional sensor-level linear mixtures of neural sources, which makes GC inference challenging. Conventional methods proceed in two stages: First, cortical sources are estimated from MEG using a source localization technique, followed by GC inference among the estimated sources. However, the spatiotemporal biases in estimating sources propagate into the subsequent GC analysis stage, may result in both false alarms and missing true GC links. Here, we introduce the Network Localized Granger Causality (NLGC) inference paradigm, which models the source dynamics as latent sparse multivariate autoregressive processes and estimates their parameters directly from the MEG measurements, integrated with source localization, and employs the resulting parameter estimates to produce a precise statistical characterization of the detected GC links. We offer several theoretical and algorithmic innovations within NLGC and further examine its utility via comprehensive simulations and application to MEG data from an auditory task involving tone processing from both younger and older participants. Our simulation studies reveal that NLGC is markedly robust with respect to model mismatch, network size, and low signal-to-noise ratio, whereas the conventional two-stage methods result in high false alarms and mis-detections. We also demonstrate the advantages of NLGC in revealing the cortical network- level characterization of neural activity during tone processing and resting state by delineating task- and age-related connectivity changes. 
    more » « less
  3. We studied the learnability of English filler-gap dependencies and the “island” constraints on them by assessing the generalizations made by autoregressive (incremental) language models that use deep learning to predict the next word given preceding context. Using factorial tests inspired by experimental psycholinguistics, we found that models acquire not only the basic contingency between fillers and gaps, but also the unboundedness and hierarchical constraints implicated in the dependency. We evaluated a model’s acquisition of island constraints by demonstrating that its expectation for a filler-gap contingency is attenuated within an island environment. Our results provide empirical evidence against the argument from the poverty of the stimulus for this particular structure.

     
    more » « less
  4. Large pretrained language models are successful at generating fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such language models, i.e., generating text that satisfies user-defined constraints, while maintaining fluency and model’s performance in a downstream task. We propose MuCoLa—a sampling procedure that combines the log-likelihood of the language model with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of this energy. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations, obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation. 
    more » « less
  5. Pierre Alquier (Ed.)
    A systematic approach to finding variational approximation in an otherwise intractable non-conjugate model is to exploit the general principle of convex duality by minorizing the marginal likelihood that renders the problem tractable. While such approaches are popular in the context of variational inference in non-conjugate Bayesian models, theoretical guarantees on statistical optimality and algorithmic convergence are lacking. Focusing on logistic regression models, we provide mild conditions on the data generating process to derive non-asymptotic upper bounds to the risk incurred by the variational optima. We demonstrate that these assumptions can be completely relaxed if one considers a slight variation of the algorithm by raising the likelihood to a fractional power. Next, we utilize the theory of dynamical systems to provide convergence guarantees for such algorithms in logistic and multinomial logit regression. In particular, we establish local asymptotic stability of the algorithm without any assumptions on the data-generating process. We explore a special case involving a semi-orthogonal design under which a global convergence is obtained. The theory is further illustrated using several numerical studies. 
    more » « less