skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SoD2: Statically Optimizing Dynamic Neural Network Execution
Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution are becoming common. This paper presents SoD2, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains.  more » « less
Award ID(s):
2333899
PAR ID:
10520362
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Today's data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily. We present Tuplex, a new data analytics framework that just in-time compiles developers' natural Python UDFs into efficient, end-to-end optimized native code. Tuplex introduces a novel dual-mode execution model that compiles an optimized fast path for the common case, and falls back on slower exception code paths for data that fail to match the fast path's assumptions. Dual-mode execution is crucial to making end-to-end optimizing compilation tractable: by focusing on the common case, Tuplex keeps the code simple enough to apply aggressive optimizations. Thanks to dual-mode execution, Tuplex pipelines always complete even if exceptions occur, and Tuplex's post-facto exception handling simplifies debugging. We evaluate Tuplex with data science pipelines over real-world datasets. Compared to Spark and Dask, Tuplex improves end-to-end pipeline runtime by 5-91x and comes within 1.1-1.7x of a hand-optimized C++ baseline. Tuplex outperforms other Python compilers by 6x and competes with prior, more limited query compilers. Optimizations enabled by dual-mode processing improve runtime by up to 3x, and Tuplex performs well in a distributed setting on serverless functions. 
    more » « less
  2. Intermittently-powered, energy-harvesting devices operate on energy collected from their environment and must operate intermittently as energy is available. Runtime systems for such devices often rely on checkpoints or redo-logs to save execution state between power cycles, causing arbitrary code regions to re-execute on reboot. Anynon-idempotentprogram behavior—behavior that can change on each execution—can lead to incorrect results. This work investigates non-idempotent behavior caused by repeating I/O operations, not addressed by prior work. If such operations affect a control statement or address of a memory update, they can cause programs to take different paths or write to different memory locations on re-executions, resulting in inconsistent memory states. We provide the first characterization of input-dependent idempotence bugs and develop IBIS-S, a program analysis tool for detecting such bugs at compile time, and IBIS-D, a dynamic information flow tracker to detect bugs at runtime. These tools use taint propagation to determine the reach of input. IBIS-S searches for code patterns leading to inconsistent memory updates, while IBIS-D detects concrete memory inconsistencies. We evaluate IBIS on embedded system drivers and applications. IBIS can detect I/O-dependent idempotence bugs, giving few (IBIS-S) or no (IBIS-D) false positives and providing actionable bug reports. These bugs are common in sensor-driven applications and are not fixed by existing intermittent systems. 
    more » « less
  3. It is appealing but challenging to achieve real-time deep neural network (DNN) inference on mobile devices because even the powerful modern mobile devices are considered “resource-constrained” when executing large-scale DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN weight sparsity, and it is desirable to design a new DNN weight sparsity scheme that can facilitate real-time inference on mobile devices while preserving a high sparse model accuracy. This paper designs a novel mobile inference acceleration framework GRIM that is General to both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) and that achieves Real-time execution and high accuracy, leveraging fine-grained structured sparse model Inference and compiler optimizations for Mobiles. We start by proposing a new fine-grained structured sparsity scheme through the Block-based Column-Row (BCR) pruning. Based on this new fine-grained structured sparsity, our GRIM framework consists of two parts: (a) the compiler optimization and code generation for real-time mobile inference; and (b) the BCR pruning optimizations for determining pruning hyperparameters and performing weight pruning. We compare GRIM with Alibaba MNN, TVM, TensorFlow-Lite, a sparse implementation based on CSR, PatDNN, and ESE (a representative FPGA inference acceleration framework for RNNs), and achieve up to 14.08× speedup. 
    more » « less
  4. Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory. 
    more » « less
  5. Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU. The framework is open-sourced at https://github.com/gulang2019/ED-Batch.git. 
    more » « less