skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automatic Differentiation in Machine Learning: a Survey
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply ``autodiff'', is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other's results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names ``dynamic computational graphs'' and ``differentiable programming''. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms ``autodiff'', ``automatic differentiation'', and ``symbolic differentiation'' as these are encountered more and more in machine learning settings.  more » « less
Award ID(s):
1734938
PAR ID:
10066574
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
18
Issue:
153
ISSN:
1533-7928
Page Range / eLocation ID:
1-43
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Aldrich, Jonathan; Salvaneschi, Guido (Ed.)
    Reverse-mode automatic differentiation (autodiff) has been popularized by deep learning, but its ability to compute gradients is also valuable for interactive use cases such as bidirectional computer-aided design, embedded physics simulations, visualizing causal inference, and more. Unfortunately, the web is ill-served by existing autodiff frameworks, which use autodiff strategies that perform poorly on dynamic scalar programs, and pull in heavy dependencies that would result in unacceptable webpage sizes. This work introduces Rose, a lightweight autodiff framework for the web using a new hybrid approach to reverse-mode autodiff, blending conventional tracing and transformation techniques in a way that uses the host language for metaprogramming while also allowing the programmer to explicitly define reusable functions that comprise a larger differentiable computation. We demonstrate the value of the Rose design by porting two differentiable physics simulations, and evaluate its performance on an optimization-based diagramming application, showing Rose outperforming the state-of-the-art in web-based autodiff by multiple orders of magnitude. 
    more » « less
  2. Computing derivatives is key to many algorithms in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse-mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including C/C++, Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradients of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reversemode AD tool to generate gradients of GPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients of five GPU-based HPC applications, executed on NVIDIA and AMD GPUs. All benchmarks run within an order of magnitude of the original program's execution time. Without GPU and AD-specific optimizations, gradients of GPU kernels either fail to run from a lack of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation. 
    more » « less
  3. Abstract We have developed a differentiable programming framework for truncated hierarchical B-splines (THB-splines), which can be used for several applications in geometry modeling, such as surface fitting and deformable image registration, and can be easily integrated with geometric deep learning frameworks. Differentiable programming is a novel paradigm that enables an algorithm to be differentiated via automatic differentiation, i.e., using automatic differentiation to compute the derivatives of its outputs with respect to its inputs or parameters. Differentiable programming has been used extensively in machine learning for obtaining gradients required in optimization algorithms such as stochastic gradient descent (SGD). While incorporating differentiable programming with traditional functions is straightforward, it is challenging when the functions are complex, such as splines. In this work, we extend the differentiable programming paradigm to THB-splines. THB-splines offer an efficient approach for complex surface fitting by utilizing a hierarchical tensor structure of B-splines, enabling local adaptive refinement. However, this approach brings challenges, such as a larger computational overhead and the non-trivial implementation of automatic differentiation and parallel evaluation algorithms. We use custom kernel functions for GPU acceleration in forward and backward evaluation that are necessary for differentiable programming of THB-splines. Our approach not only improves computational efficiency but also significantly enhances the speed of surface evaluation compared to previous methods. Our differentiable THB-splines framework facilitates faster and more accurate surface modeling with local refinement, with several applications in CAD and isogeometric analysis. 
    more » « less
  4. Abstract The degree of rate control (DRC) quantitatively identifies the kinetically relevant (sometimes known as rate‐limiting) steps of a complex reaction network. This concept relies on derivatives which are commonly implemented numerically, for example, with finite differences (FDs). Numerical derivatives are tedious to implement, and can be problematic, and unstable or unreliable. In this study, we demonstrate the use of automatic differentiation (AD) in the evaluation of the DRC. AD libraries are increasingly available through modern machine learning frameworks. Compared with the FDs, AD provides solutions with higher accuracy with lower computational cost. We demonstrate applications in steady‐state and transient kinetics. Furthermore, we illustrate a hybrid local‐global sensitivity analysis method, the distributed evaluation of local sensitivity analysis, to assess the importance of kinetic parameters over an uncertain space. This method also benefits from AD to obtain high‐quality results efficiently. 
    more » « less
  5. null (Ed.)
    The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor. 
    more » « less