skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: Automatic Differentiation of Programs with Discrete Randomness
Automatic differentiation (AD), a technique for constructing new programs which compute the derivative of an original program, has become ubiquitous throughout scientific computing and deep learning due to the improved performance afforded by gradient-based optimization. However, AD systems have been restricted to the subset of programs that have a continuous dependence on parameters. Programs that have discrete stochastic behaviors governed by distribution parameters, such as flipping a coin with probability p of being heads, pose a challenge to these systems because the connection between the result (heads vs tails) and the parameters ( p ) is fundamentally discrete. In this paper we develop a new reparameterization-based methodology that allows for generating programs whose expectation is the derivative of the expectation of the original program. We showcase how this method gives an unbiased and low-variance estimator which is as automated as traditional AD mechanisms. We demonstrate unbiased forward-mode AD of discrete-time Markov chains, agent-based models such as Conway's Game of Life, and unbiased reverse-mode AD of a particle filter. Our code package is available at https://github.com/gaurav-arya/StochasticAD.jl.  more » « less
Award ID(s):
1835443
PAR ID:
10491375
Author(s) / Creator(s):
Publisher / Repository:
Curran Associates, Inc.
Date Published:
Journal Name:
Advances in Neural Information Processing Systems
Volume:
35
Page Range / eLocation ID:
10435--10447
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have a low, constant factor more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional parameters and thus can benefit from parallel implementations on graphics processing units (GPUs). Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling. We demonstrate the advantages of parallel gradient computations on GPUs with Clad. We explain how to bring forth a new layer of optimization and a proportional speed up by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. The library can be easily integrated into existing frameworks or used interactively. Furthermore, we demonstrate the achieved application performance improvements, including (~10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs. 
    more » « less
  2. In this paper, we study the “decoding” problem for discrete-time, stochastic hybrid systems with linear dynamics in each mode. Given an output trace of the system, the decoding problem seeks to construct a sequence of modes and states that yield a trace “as close as possible” to the original output trace. The decoding problem generalizes the state estimation problem, and is applicable to hybrid systems with non-determinism. The decoding problem is NP-complete, and can be reduced to solving a mixed-integer linear program (MILP). In this paper, we decompose the decoding problem into two parts: (a) finding a sequence of discrete modes and transitions; and (b) finding corresponding continuous states for the mode/transition sequence. In particular, once a sequence of modes/transitions is fixed, the problem of “filling in” the continuous states is performed by a linear programming problem. In order to support the decomposition, we “cover” the set of all possible mode/transition sequences by a finite subset. We use well-known probabilistic arguments to justify a choice of cover with high confidence and design randomized algorithms for finding such covers. Our approach is demonstrated on a series of benchmarks, wherein we observe that relatively tiny fraction of the possible mode/transition sequences can be used as a cover. Furthermore, we show that the resulting linear programs can be solved rapidly by exploiting the tree structure of the set cover. 
    more » « less
  3. Abstract Automatic differentiation (AD) is a technique for augmenting computer programs to compute derivatives. The essence of AD in its forward accumulation mode is to attach perturbations to each number, and propagate these through the computation by overloading the arithmetic operators. When derivatives are nested, the distinct derivative calculations, and their associated perturbations, must be distinguished. This is typically accomplished by creating a unique tag for each derivative calculation and tagging the perturbations. We exhibit a subtle bug, present in fielded implementations which support derivatives of higher-order functions, in which perturbations are confused despite the tagging machinery, leading to incorrect results. The essence of the bug is as follows: a unique tag is needed for each derivative calculation, but in existing implementations unique tags are created when taking the derivative of a function at a point. When taking derivatives of higher-order functions, these need not correspond! We exhibit a simple example: a higher-order function f whose derivative at a point x , namely f ′( x ), is itself a function which calculates a derivative. This situation arises naturally when taking derivatives of curried functions. Two potential solutions are presented, and their deficiencies discussed. One uses eta expansion to delay the creation of fresh tags in order to put them into one-to-one correspondence with derivative calculations. The other wraps outputs of derivative operators with tag substitution machinery. Both solutions seem very difficult to implement without violating the desirable complexity guarantees of forward AD. 
    more » « less
  4. Semidefinite programs (SDP) are a staple of today’s systems theory, with applications ranging from robust control to systems identification. However, current state-of-the art solution methods have poor scaling properties, and thus are limited to relatively moderate size problems. Recently, several approximations have been proposed where the original SDP is relaxed to a sequence of lower complexity problems (such as linear programs (LPs) or second order cone programs (SOCPs)). While successful in many cases, there is no guarantee that these relaxations converge to the global optimum of the original program. Indeed, examples exists where these relaxations "get stuck" at suboptimal solutions. To circumvent this difficulty in this paper we propose an algorithm to solve SDPs based on solving a sequence of LPs or SOCPs, guaranteed to converge in a finite number of steps to an ε-suboptimal solution of the original problem. We further provide a bound on the number of steps required, as a function of ε and the problem data. 
    more » « less
  5. Hermenegildo, M.V. (Ed.)
    We describe a new concrete approach to giving predictable error locations for sequential (flow-sensitive) effect systems. Prior implementations of sequential effect systems rely on either computing a bottom-up effect and comparing it to a declaration (e.g., method annotation) or leaning on constraint-based type inference. These approaches do not necessarily report program locations that precisely indicate where a program may “go wrong” at runtime. Instead of relying on constraint solving, we draw on the notion of a residual from literature on ordered algebraic structures. Applying these to effect quantales (a large class of sequential effect systems) yields an implementation approach which accepts exactly the same program as an original effect quantale, but for effect-incorrect programs is guaranteed to fail type-checking with predictable error locations tied to evaluation order. We have implemented this idea in a generic effect system implementation framework for Java, and report on experiences applying effect systems from the literature and novel effect systems to Java programs. We find that the reported error locations with our technique are significantly closer to the program points that lead to failed effect checks. 
    more » « less