skip to main content

Title: On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent
Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called “rich regimes”. However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradientflow and use it to obtain closed-form implicit regularizers for multiple cases of interest.
Authors:
; ; ; ; ; ;
Award ID(s):
1764032
Publication Date:
NSF-PAR ID:
10286846
Journal Name:
Proceedings of Machine Learning Research
Volume:
139
Page Range or eLocation-ID:
468 - 477
ISSN:
2640-3498
Sponsoring Org:
National Science Foundation
More Like this
  1. Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recentmore »work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2 .« less
  2. Abstract Compared to the Arctic, seasonal predictions of Antarctic sea ice have received relatively little attention. In this work, we utilize three coupled dynamical prediction systems developed at the Geophysical Fluid Dynamics Laboratory to assess the seasonal prediction skill and predictability of Antarctic sea ice. These systems, based on the FLOR, SPEAR_LO, and SPEAR_MED dynamical models, differ in their coupled model components, initialization techniques, atmospheric resolution, and model biases. Using suites of retrospective initialized seasonal predictions spanning 1992–2018, we investigate the role of these factors in determining Antarctic sea ice prediction skill and examine the mechanisms of regional sea ice predictability. We find that each system is capable of skillfully predicting regional Antarctic sea ice extent (SIE) with skill that exceeds a persistence forecast. Winter SIE is skillfully predicted 11 months in advance in the Weddell, Amundsen and Bellingshausen, Indian, and West Pacific sectors, whereas winter skill is notably lower in the Ross sector. Zonally advected upper ocean heat content anomalies are found to provide the crucial source of prediction skill for the winter sea ice edge position. The recently-developed SPEAR systems are more skillful than FLOR for summer sea ice predictions, owing to improvements in sea ice concentration andmore »sea ice thickness initialization. Summer Weddell SIE is skillfully predicted up to 9 months in advance in SPEAR_MED, due to the persistence and drift of initialized sea ice thickness anomalies from the previous winter. Overall, these results suggest a promising potential for providing operational Antarctic sea ice predictions on seasonal timescales.« less
  3. Children’s automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. How-ever, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children’s ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children’s speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization.
  4. Abstract As the greenhouse gas concentrations increase, a warmer climate is expected. However, numerous internal climate processes can modulate the primary radiative warming response of the climate system to rising greenhouse gas forcing. Here the particular internal climate process that we focus on is the Atlantic meridional overturning circulation (AMOC), an important global-scale feature of ocean circulation that serves to transport heat and other scalars, and we address the question of how the mean strength of AMOC can modulate the transient climate response. While the Community Earth System Model version 2 (CESM2) and the Energy Exascale Earth System Model version 1 (E3SM1) have very similar equilibrium/effective climate sensitivity, our analysis suggests that a weaker AMOC contributes in part to the higher transient climate response to a rising greenhouse gas forcing seen in E3SM1 by permitting a faster warming of the upper ocean and a concomitant slower warming of the subsurface ocean. Likewise the stronger AMOC in CESM2 by permitting a slower warming of the upper ocean leads in part to a smaller transient climate response. Thus, while the mean strength of AMOC does not affect the equilibrium/effective climate sensitivity, it is likely to play an important role in determining themore »transient climate response on the centennial time scale.« less
  5. Context. We started a multi-scale analysis of star formation in G202.3+2.5, an intertwined filamentary sub-region of the Monoceros OB1 molecular complex, in order to provide observational constraints on current theories and models that attempt to explain star formation globally. In the first paper (Paper I), we examined the distributions of dense cores and protostars and found enhanced star formation activity in the junction region of the filaments. Aims. In this second paper, we aim to unveil the connections between the core and filament evolutions, and between the filament dynamics and the global evolution of the cloud. Methods. We characterise the gas dynamics and energy balance in different parts of G202.3+2.5 using infrared observations from the Herschel and WISE telescopes and molecular tracers observed with the IRAM 30-m and TRAO 14-m telescopes. The velocity field of the cloud is examined and velocity-coherent structures are identified, characterised, and put in perspective with the cloud environment. Results. Two main velocity components are revealed, well separated in radial velocities in the north and merged around the location of intense N 2 H + emission in the centre of G202.3+2.5 where Paper I found the peak of star formation activity. We show that the relativemore »position of the two components along the sightline, and the velocity gradient of the N 2 H + emission imply that the components have been undergoing collision for ~10 5 yr, although it remains unclear whether the gas moves mainly along or across the filament axes. The dense gas where N 2 H + is detected is interpreted as the compressed region between the two filaments, which corresponds to a high mass inflow rate of ~1 × 10 −3 M ⊙ yr −1 and possibly leads to a significant increase in its star formation efficiency. We identify a protostellar source in the junction region that possibly powers two crossed intermittent outflows. We show that the H  II region around the nearby cluster NCG 2264 is still expanding and its role in the collision is examined. However, we cannot rule out the idea that the collision arises mostly from the global collapse of the cloud. Conclusions. The (sub-)filament-scale observables examined in this paper reveal a collision between G202.3+2.5 sub-structures and its probable role in feeding the cores in the junction region. To shed more light on this link between core and filament evolutions, one must characterise the cloud morphology, its fragmentation, and magnetic field, all at high resolution. We consider the role of the environment in this paper, but a larger-scale study of this region is now necessary to investigate the scenario of a global cloud collapse.« less