The method of choice for integrating the timedependent Fokker–Planck equation (FPE) in highdimension is to generate samples from the solution via integration of the associated stochastic differential equation (SDE). Here, we study an alternative scheme based on integrating an ordinary differential equation that describes the flow of probability. Acting as a transport map, this equation deterministically pushes samples from the initial density onto samples from the solution at any later time. Unlike integration of the stochastic dynamics, the method has the advantage of giving direct access to quantities that are challenging to estimate from trajectories alone, such as the probability current, the density itself, and its entropy. The probability flow equation depends on the gradient of the logarithm of the solution (its ‘score’), and so is
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract apriori unknown. To resolve this dependence, we model the score with a deep neural network that is learned onthefly by propagating a set of samples according to the instantaneous probability current. We show theoretically that the proposed approach controls the Kullback–Leibler (KL) divergence from the learned solution to the target, while learning on external samples from the SDE does not control either direction of the KL divergence. Empirically, we consider several highdimensional FPEs from the physics of interacting particle systems. We find that the method accurately matches analytical solutions when they are available as well as moments computed via MonteCarlo when they are not. Moreover, the method offers compelling predictions for the global entropy production rate that outperform those obtained from learning on stochastic trajectories, and can effectively capture nonequilibrium steadystate probability currents over long time intervals. 
The modern strategy for training deep neural networks for classification tasks includes optimizing the network’s weights even after the training error vanishes to further push the training loss toward zero. Recently, a phenomenon termed “neural collapse” (NC) has been empirically observed in this training procedure. Specifically, it has been shown that the learned features (the output of the penultimate layer) of withinclass samples converge to their mean, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer’s weights. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified “unconstrained features model” (UFM) with a regularized crossentropy loss. In this paper, we further analyze and extend the UFM. First, we study the UFM for the regularized MSE loss, and show that the minimizers’ features can be more structured than in the crossentropy case. This affects also the structure of the weights. Then, we extend the UFM by adding another layer of weights as well as ReLU nonlinearity to the model and generalize our previous results. Finally, we empirically demonstrate the usefulness of our nonlinear extended UFM in modeling the NC phenomenon that occurs with practical networks.more » « less

Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationallychallenging inference tasks. In this work, we focus on the canonical task of clustering ddimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. ’20; Mao, Wein ’21; Davis, Diaz, Wang ’21) have established lower bounds against the class of lowdegree polynomial methods and the sumofsquares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statisticaltocomputational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomialtime algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statisticaltocomputational gap, despite the aforementioned lowdegree and SoS lower bounds. To achieve this, we give an algorithm based on Lenstra–Lenstra–Lovász lattice basis reduction which achieves the statisticallyoptimal sample complexity of d + 1 samples. This result extends the class of problems whose conjectured statisticaltocomputational gaps can be “closed” by “brittle” polynomialtime algorithms, highlighting the crucial but subtle role of noise in the onset of statisticaltocomputational gaps.more » « less

We study the optimization of wide neural networks (NNs) via gradient flow (GF) in setups that allow feature learning while admitting nonasymptotic global convergence guarantees. First, for wide shallow NNs under the meanfield scaling and with a general class of activation functions, we prove that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. Building upon this analysis, we study a model of wide multilayer NNs whose secondtolast layer is trained via GF, for which we also prove a linearrate convergence of the training loss to zero, but regardless of the input dimension. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multilayer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.more » « less