Convolutional neural networks (CNNs) play an important role in today's mobile and edge computing systems for vision-based tasks like object classification and detection. However, state-of-the-art methods on CNN acceleration are trapped in either limited practical latency speed-up on general computing platforms or latency speed-up with severe accuracy loss. In this paper, we propose a spatial-based dynamic CNN acceleration framework, NeuLens, for mobile and edge platforms. Specially, we design a novel dynamic inference mechanism, assemble region-aware convolution (ARAC) supernet, that peels off redundant operations inside CNN models as many as possible based on spatial redundancy and channel slicing. In ARAC supernet, the CNN inference flow is split into multiple independent micro-flows, and the computational cost of each can be autonomously adjusted based on its tiled-input content and application requirements. These micro-flows can be loaded into hardware like GPUs as single models. Consequently, its operation reduction can be well translated into latency speed-up and is compatible with hardware-level accelerations. Moreover, the inference accuracy can be well preserved by identifying critical regions on images and processing them in the original resolution with large micro-flow. Based on our evaluation, NeuLens outperforms baseline methods by up to 58% latency reduction with the same accuracy and by up to 67.9% accuracy improvement under the same latency/memory constraints.
more »
« less
Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications
Tiny machine learning (TinyML) applications increasingly operate in dynamically changing deployment scenarios, requiring optimization for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space, which is insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware–software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel field-programmable gate array implementation and a software scheduler that controls which SubNets to serve and which SubGraph to cache in real time. SUSHI yields up to a 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% off-chip energy saved across several neural network architectures.
more »
« less
- Award ID(s):
- 2029004
- PAR ID:
- 10510165
- Publisher / Repository:
- IEEE Computer Society
- Date Published:
- Journal Name:
- IEEE Micro
- Volume:
- 43
- Issue:
- 6
- ISSN:
- 0272-1732
- Page Range / eLocation ID:
- 93 to 101
- Subject(s) / Keyword(s):
- Training, Real Time Systems, Optimization, Neural Networks, System On Chip, Software, Tiny Machine Learning, Machine Learning, Microcontrollers, Software Design, Machine Learning Applications, Tiny Machine Learning, Neural Network, Deep Neural Network, Search Space, Autonomous Vehicles, Design Space, Self Driving, Temporal Localization, Caching, Data Reuse, Neural Architecture Search, Deployment Phase, Improvement In Latency, Blue Dots, Dot Product, Load Data, Latency Reduction, Cache Hit, Shared Weights
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems.more » « less
-
Neural-network-enabled data analysis in real-time scientific applications imposes stringent requirements on inference latency. Meanwhile, recent deep learning (DL) model design trends to replace a single branch with multiple branches for high prediction accuracy and robustness, which makes interoperator parallelization become an effective approach to improve inference latency. However, existing inter-operator parallelization techniques for inference acceleration are mainly focused on utilization optimization in a single GPU. With the data size of an input sample and the scale of a DL model ever-growing, the limited resource of a single GPU is insufficient to support the parallel execution of large operators. In order to break this limitation, we study hybrid inter-operator parallelism both among multiple GPUs and in each GPU. In this paper, we design and implement a hierarchical inter-operator scheduler (HIOS) to automatically distribute large operators onto different GPUs and group small operators in the same GPU for parallel execution. Particularly, we propose a novel scheduling algorithm, named HIOS-LP, which consists of inter-GPU operator parallelization through iterative longest-path (LP) mapping and intra-GPU operator parallelization based on a sliding window. In addition to extensive simulation results, experiments with modern convolutional neural network benchmarks demonstrate that our HIOS-LP outperforms the state-of-the-art inter-operator scheduling algorithm IOS by up to 17% in real systems.more » « less
-
Motivated by the transformative capabilities of large language models (LLMs) across various natural language tasks, there has been a growing demand to deploy these models effectively across diverse real-world applications and platforms. However, the challenge of efficiently deploying LLMs has become increasingly pronounced due to the varying application-specific performance requirements and the rapid evolution of computational platforms, which feature diverse resource constraints and deployment flows. These varying requirements necessitate LLMs that can adapt their structures (depth and width) for optimal efficiency across different platforms and application specifications. To address this critical gap, we propose AmoebaLLM, a novel framework designed to enable the instant derivation of LLM subnets of arbitrary shapes, which achieve the accuracyefficiency frontier and can be extracted immediately after a one-time fine-tuning. In this way, AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications. Specifically, AmoebaLLM integrates three innovative components: (1) a knowledge-preserving subnet selection strategy that features a dynamic-programming approach for depth shrinking and an importancedriven method for width shrinking; (2) a shape-aware mixture of LoRAs to mitigate gradient conflicts among subnets during fine-tuning; and (3) an in-place distillation scheme with loss-magnitude balancing as the fine-tuning objective. Extensive experiments validate that AmoebaLLM not only sets new standards in LLM adaptability but also successfully delivers subnets that achieve stateof-the-art trade-offs between accuracy and efficiency. Our code is available at https://github.com/GATECH-EIC/AmoebaLLM.more » « less
-
Sketches are single-pass small-space data summaries that can quickly estimate the cardinality of join queries. However, sketches are not directly applicable to join queries with dynamic filter conditions --- where arbitrary selection predicate(s) are applied --- since a sketch is limited to a fixed selection. While multiple sketches for various selections can be used in combination, they each incur individual storage and maintenance costs. Alternatively, exact sketches can be built during runtime for every selection. To make this process scale, a high-degree of parallelism --- available in hardware accelerators such as GPUs --- is required. Therefore, sketch usage for cardinality estimation in query optimization is limited. Following recent work that applies transformers to cardinality estimation, we design a novel learning-based method to approximate the sketch of any arbitrary selection, enabling sketches for join queries with filter conditions. We train a transformer on each table to estimate the sketch of any subset of the table, i.e., any arbitrary selection. Transformers achieve this by learning the joint distribution amongst table attributes, which is equivalent to a multidimensional sketch. Subsequently, transformers can approximate any sketch, enabling sketches for join cardinality estimation. In turn, estimating joins via approximate sketches allows tables to be modeled individually and thus scales linearly with the number of tables. We evaluate the accuracy and efficacy of approximate sketches on queries with selection predicates consisting of conjunctions of point and range conditions. Approximate sketches achieve similar accuracy to exact sketches with at least one order of magnitude less overhead.more » « less
An official website of the United States government

