skip to main content


Title: Automatic Differentiation in Machine Learning: a Survey
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply ``autodiff'', is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other's results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names ``dynamic computational graphs'' and ``differentiable programming''. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms ``autodiff'', ``automatic differentiation'', and ``symbolic differentiation'' as these are encountered more and more in machine learning settings.  more » « less
Award ID(s):
1734938
NSF-PAR ID:
10066574
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
18
Issue:
153
ISSN:
1533-7928
Page Range / eLocation ID:
1-43
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including C/C++, Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradients of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reversemode AD tool to generate gradients of GPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients of five GPU-based HPC applications, executed on NVIDIA and AMD GPUs. All benchmarks run within an order of magnitude of the original program's execution time. Without GPU and AD-specific optimizations, gradients of GPU kernels either fail to run from a lack of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation. 
    more » « less
  2. Abstract

    We have developed a differentiable programming framework for truncated hierarchical B-splines (THB-splines), which can be used for several applications in geometry modeling, such as surface fitting and deformable image registration, and can be easily integrated with geometric deep learning frameworks. Differentiable programming is a novel paradigm that enables an algorithm to be differentiated via automatic differentiation, i.e., using automatic differentiation to compute the derivatives of its outputs with respect to its inputs or parameters. Differentiable programming has been used extensively in machine learning for obtaining gradients required in optimization algorithms such as stochastic gradient descent (SGD). While incorporating differentiable programming with traditional functions is straightforward, it is challenging when the functions are complex, such as splines. In this work, we extend the differentiable programming paradigm to THB-splines. THB-splines offer an efficient approach for complex surface fitting by utilizing a hierarchical tensor structure of B-splines, enabling local adaptive refinement. However, this approach brings challenges, such as a larger computational overhead and the non-trivial implementation of automatic differentiation and parallel evaluation algorithms. We use custom kernel functions for GPU acceleration in forward and backward evaluation that are necessary for differentiable programming of THB-splines. Our approach not only improves computational efficiency but also significantly enhances the speed of surface evaluation compared to previous methods. Our differentiable THB-splines framework facilitates faster and more accurate surface modeling with local refinement, with several applications in CAD and isogeometric analysis.

     
    more » « less
  3. Abstract

    Algorithmic differentiation (AD), also known as automatic differentiation, is a technology for accurate and efficient evaluation of derivatives of a function given as a computer model. The evaluations of such models are essential building blocks in numerous scientific computing and data analysis applications, including optimization, parameter identification, sensitivity analysis, uncertainty quantification, nonlinear equation solving, and integration of differential equations. We provide an introduction to AD and present its basic ideas and techniques, some of its most important results, the implementation paradigms it relies on, the connection it has to other domains including machine learning and parallel computing, and a few of the major open problems in the area. Topics we discuss include: forward mode and reverse mode of AD, higher‐order derivatives, operator overloading and source transformation, sparsity exploitation, checkpointing, cross‐country mode, and differentiating iterative processes.

    This article is categorized under:

    Algorithmic Development > Scalable Statistical Methods

    Technologies > Data Preprocessing

     
    more » « less
  4. Abstract

    The degree of rate control (DRC) quantitatively identifies the kinetically relevant (sometimes known as rate‐limiting) steps of a complex reaction network. This concept relies on derivatives which are commonly implemented numerically, for example, with finite differences (FDs). Numerical derivatives are tedious to implement, and can be problematic, and unstable or unreliable. In this study, we demonstrate the use of automatic differentiation (AD) in the evaluation of the DRC. AD libraries are increasingly available through modern machine learning frameworks. Compared with the FDs, AD provides solutions with higher accuracy with lower computational cost. We demonstrate applications in steady‐state and transient kinetics. Furthermore, we illustrate a hybrid local‐global sensitivity analysis method, the distributed evaluation of local sensitivity analysis, to assess the importance of kinetic parameters over an uncertain space. This method also benefits from AD to obtain high‐quality results efficiently.

     
    more » « less
  5. null (Ed.)
    The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor. 
    more » « less