skip to main content


This content will become publicly available on August 21, 2024

Title: An Investigation on Hardware-Aware Vision Transformer Scaling

Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a\(\uparrow 1.9\% \)higher ImageNet top-1 accuracy under the same FLOPs and a\(\uparrow 3.7\% \)better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from\(74.6\% \)to\(76.7\% \)(\(\uparrow 2.1\% \)) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by\(\uparrow 0.7\% \)under a similar throughput on a V100 GPU.

 
more » « less
Award ID(s):
1937592
NSF-PAR ID:
10487905
Author(s) / Creator(s):
; ; ; ; ; ; ;  
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Embedded Computing Systems
ISSN:
1539-9087
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this article, we show how a single function,join, can be used to implement parallelbalanced binary search trees(BSTs) simply and efficiently. Based onjoin, our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework calledjoin-based algorithms. We show that thejoinfunction fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to asjoinabletrees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, includinginsert,delete,union,intersection,difference,split,range,filter, and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operationsunion,intersection, anddifferencehave work\( O(m\log (\frac{n}{m}+1)) \)and polylogarithmic span for input set sizes\( n \)and\( m\le n \).

    We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73\( \times \)speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.

     
    more » « less
  2. We study the problem of approximating maximum Nash social welfare (NSW) when allocatingmindivisible items amongnasymmetric agents with submodular valuations. TheNSWis a well-established notion of fairness and efficiency, defined as the weighted geometric mean of agents’ valuations. For special cases of the problem with symmetric agents and additive(-like) valuation functions, approximation algorithms have been designed using approaches customized for these specific settings, and they fail to extend to more general settings. Hence, no approximation algorithm with a factor independent ofmwas known either for asymmetric agents with additive valuations or for symmetric agents beyond additive(-like) valuations before this work.

    In this article, we extend our understanding of theNSWproblem to far more general settings. Our main contribution is two approximation algorithms for asymmetric agents with additive and submodular valuations. Both algorithms are simple to understand and involve non-trivial modifications of a greedy repeated matchings approach. Allocations of high-valued items are done separately by un-matching certain items and re-matching them by different processes in both algorithms. We show that these approaches achieve approximation factors ofO(n) andO(nlogn) for additive and submodular cases, independent of the number of items. For additive valuations, our algorithm outputs an allocation that also achieves the fairness property of envy-free up to one item (EF1).

    Furthermore, we show that theNSWproblem under submodular valuations is strictly harder than all currently known settings with an\(\frac{\mathrm{e}}{\mathrm{e}-1}\)factor of the hardness of approximation, even for constantly many agents. For this case, we provide a different approximation algorithm that achieves a factor of\(\frac{\mathrm{e}}{\mathrm{e}-1}\), hence resolving it completely.

     
    more » « less
  3. Hybrid traffic which involves both autonomous and human-driven vehicles would be the norm of the autonomous vehicles’ practice for a while. On the one hand, unlike autonomous vehicles, human-driven vehicles could exhibit sudden abnormal behaviors such as unpredictably switching to dangerous driving modes – putting its neighboring vehicles under risks; such undesired mode switching could arise from numbers of human driver factors, including fatigue, drunkenness, distraction, aggressiveness, etc. On the other hand, modern vehicle-to-vehicle (V2V) communication technologies enable the autonomous vehicles to efficiently and reliably share the scarce run-time information with each other [1]. In this paper, we propose, to the best of our knowledge, the first efficient algorithm that can (1) significantly improve trajectory prediction by effectively fusing the run-time information shared by surrounding autonomous vehicles, and can (2) accurately and quickly detect abnormal human driving mode switches or abnormal driving behavior with formal assurance without hurting human drivers’ privacy.

    To validate our proposed algorithm, we first evaluate our proposed trajectory predictor on NGSIM and Argoverse datasets and show that our proposed predictor outperforms the baseline methods. Then through extensive experiments on SUMO simulator, we show that our proposed algorithm has great detection performance in both highway and urban traffic. The best performance achieves detection rate of\(97.3\% \), average detection delay of 1.2s, and 0 false alarm.

     
    more » « less
  4. Minimum flow decomposition (MFD) is the NP-hard problem of finding a smallest decomposition of a network flow/circulationXon a directed graphGinto weighted source-to-sink paths whose weighted sum equalsX. We show that, for acyclic graphs, considering thewidthof the graph (the minimum number of paths needed to cover all of its edges) yields advances in our understanding of its approximability. For the version of the problem that uses only non-negative weights, we identify and characterise a new class ofwidth-stablegraphs, for which a popular heuristic is aO(logVal(X))-approximation (Val(X) being the total flow ofX), and strengthen its worst-case approximation ratio from\(\Omega (\sqrt {m})\)to Ω (m/logm) for sparse graphs, wheremis the number of edges in the graph. We also study a new problem on graphs with cycles, Minimum Cost Circulation Decomposition (MCCD), and show that it generalises MFD through a simple reduction. For the version allowing also negative weights, we give a (⌈ log ‖ X ‖ ⌉ +1)-approximation (‖X‖ being the maximum absolute value ofXon any edge) using a power-of-two approach, combined with parity fixing arguments and a decomposition of unitary circulations (‖X‖ ≤ 1), using a generalised notion of width for this problem. Finally, we disprove a conjecture about the linear independence of minimum (non-negative) flow decompositions posed by Kloster et al. [2018], but show that its useful implication (polynomial-time assignments of weights to a given set of paths to decompose a flow) holds for the negative version.

     
    more » « less
  5. Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed.

    This article introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constant-time operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to1 - o(1)while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are Θ (logn) bits, the space guarantees that Iceberg hashing offers, namely that it uses at most\(\log \binom{|U|}{n} + O(n \log \ \text{log} n)\)bits to storenitems from a universeU, matches a lower bound by Demaine et al. that applies to any stable hash table.

    Iceberg hashing introduces new general-purpose techniques for some of the most basic aspects of hash-table design. Notably, our indirection-free technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and very-high probability guarantees, can be applied to any hash table that makes use of the front-yard/backyard paradigm for hash table design.

     
    more » « less