While cross entropy (CE) is the most commonly used loss function to train deep neural networks for classification tasks, many alternative losses have been developed to obtain better empirical performance. Among them, which one is the best to use is still a mystery, because there seem to be multiple factors affecting the answer, such as properties of the dataset, the choice of network architecture, and so on. This paper studies the choice of loss function by examining the last-layer features of deep networks, drawing inspiration from a recent line work showing that the global optimal solution of CE and mean-square-error (MSE) losses exhibits a Neural Collapse phenomenon. That is, for sufficiently large networks trained until convergence, (i) all features of the same class collapse to the corresponding class mean and (ii) the means associated with different classes are in a configuration where their pairwise distances are all equal and maximized. We extend such results and show through global solution and landscape analyses that a broad family of loss functions including commonly used label smoothing (LS) and focal loss (FL) exhibits Neural Collapse. Hence, all relevant losses (i.e., CE, LS, FL, MSE) produce equivalent features on training data. In particular, based on the unconstrained feature model assumption, we provide either the global landscape analysis for LS loss or the local landscape analysis for FL loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions either in the global scope for LS loss or in the local scope for FL loss near the optimal solution. The experiments further show that Neural Collapse features obtained from all relevant losses (i.e., CE, LS, FL, MSE) lead to largely identical performance on test data as well, provided that the network is sufficiently large and trained until convergence.
more »
« less
Deconstructing the Goldilocks Zone of Neural Network Initialization
The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship.
more »
« less
- Award ID(s):
- 1922658
- PAR ID:
- 10535860
- Publisher / Repository:
- International Conference on Machine Learning 2024
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Goldilocks quantum cellular automata (QCA) have been simulated on quantum hardware and produce emergent small-world correlation networks. In Goldilocks QCA, a single-qubit unitary is applied to each qubit in a one-dimensional chain subject to a balance constraint: a qubit is updated if its neighbors are in opposite basis states. Here, we prove that a subclass of Goldilocks QCA -- including the one implemented experimentally -- map onto free fermions and therefore can be classically simulated efficiently. We support this claim with two independent proofs, one involving a Jordan--Wigner transformation and one mapping the integrable six-vertex model to QCA. We compute local conserved quantities of these QCA and predict experimentally measurable expectation values. These calculations can be applied to test large digital quantum computers against known solutions. In contrast, typical Goldilocks QCA have equilibration properties and quasienergy-level statistics that suggest nonintegrability. Still, the latter QCA conserve one quantity useful for error mitigation. Our work provides a parametric quantum circuit with tunable integrability properties with which to test quantum hardware.more » « less
-
null (Ed.)In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15\% to 45\% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.more » « less
-
We construct asymptotically flat, scalar flat extensions of Bartnik data [Formula: see text], where [Formula: see text] is a metric of positive Gauss curvature on a two-sphere [Formula: see text], and [Formula: see text] is a function that is either positive or identically zero on [Formula: see text], such that the mass of the extension can be made arbitrarily close to the half area radius of [Formula: see text]. In the case of [Formula: see text], the result gives an analog of a theorem of Mantoulidis and Schoen [On the Bartnik mass of apparent horizons, Class. Quantum Grav. 32(20) (2015) 205002, 16 pp.], but with extensions that have vanishing scalar curvature. In the context of initial data sets in general relativity, the result produces asymptotically flat, time-symmetric, vacuum initial data with an apparent horizon [Formula: see text], for any metric [Formula: see text] with positive Gauss curvature, such that the mass of the initial data is arbitrarily close to the optimal value in the Riemannian Penrose inequality. The method we use is the Shi–Tam type metric construction from [Positive mass theorem and the boundary behaviors of compact manifolds with nonnegative scalar curvature, J. Differential Geom. 62(1) (2002) 79–125] and a refined Shi–Tam monotonicity, found by the first named author in [On a localized Riemannian Penrose inequality, Commun. Math. Phys. 292(1) (2009) 271–284].more » « less
-
We develop a model for contagion in reinsurance networks by which primary insurers’ losses are spread through the network. Our model handles general reinsurance contracts, such as typical excess of loss contracts. We show that simpler models existing in the literature—namely proportional reinsurance—greatly underestimate contagion risk. We characterize the fixed points of our model and develop efficient algorithms to compute contagion with guarantees on convergence and speed under conditions on network structure. We characterize exotic cases of problematic graph structure and nonlinearities, which cause network effects to dominate the overall payments in the system. Last, we apply our model to data on real-world reinsurance networks. Our simulations demonstrate the following. (1) Reinsurance networks face extreme sensitivity to parameters. A firm can be wildly uncertain about its losses even under small network uncertainty. (2) Our sensitivity results reveal a new incentive for firms to cooperate to prevent fraud, because even small cases of fraud can have outsized effect on the losses across the network. (3) Nonlinearities from excess of loss contracts obfuscate risks and can cause excess costs in a real-world system. This paper was accepted by Baris Ata, stochastic models and simulation.more » « less
An official website of the United States government

