The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. In this paper, we investigate---through a combination of simulations and experiments on prototype optical hardware---the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication. We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrate that optical accelerators can achieve the same (or better) perplexity as digital-electronic processors at 8-bit precision, provided that the optical hardware uses sufficiently many photons per inference, which translates directly to a requirement on optical energy per inference. We studied numerically how the requirement on optical energy per inference changes as a function of the Transformer width $$d$$ and found that the optical energy per multiply--accumulate (MAC) scales approximately as $$\frac{1}{d}$$, giving an asymptotic advantage over digital systems. We also analyze the total system energy costs for optical accelerators running Transformers, including both optical and electronic costs, as a function of model size. We predict that well-engineered, large-scale optical hardware should be able to achieve a $$100 \times$$ energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a $$>8,000\times$$ energy-efficiency advantage. Under plausible assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimate that the energy advantage for optical processors versus electronic processors operating at 300~fJ/MAC could grow to $$>100,000\times$$.
more »
« less
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
- Award ID(s):
- 1943149
- PAR ID:
- 10525624
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-0718-4
- Page Range / eLocation ID:
- 16865 to 16877
- Format(s):
- Medium: X
- Location:
- Paris, France
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper addresses the problem of creating abstract transformers automatically. The method we present automates the construction of static analyzers in a fashion similar to the wayyaccautomates the construction of parsers. Our method treats the problem as a program-synthesis problem. The user provides specifications of (i) the concrete semantics of a given operationop, (ii) the abstract domainAto be used by the analyzer, and (iii) the semantics of a domain-specific languageLin which the abstract transformer is to be expressed. As output, our method creates an abstract transformer foropin abstract domainA, expressed inL(an “L-transformer foropoverA”). Moreover, the abstract transformer obtained is a most-preciseL-transformer foropoverA; that is, there is no otherL-transformer foropoverAthat is strictly more precise. We implemented our method in a tool called AMURTH. We used AMURTH to create sets of replacement abstract transformers for those used in two existing analyzers, and obtained essentially identical performance. However, when we compared the existing transformers with the transformers obtained using AMURTH, we discovered that four of the existing transformers were unsound, which demonstrates the risk of using manually created transformers.more » « less
-
Class 2 transformers are small line-frequency transformers widely used for control systems requiring 24 VAC signaling, including residential and commercial HVAC systems, industrial controls, and much more. These transformers have large standby losses, low efficiencies, large weights, and high costs. In this work, we propose a power electronic alternative in the form of a low-power solid-state transformer. We design and test two 40 VA 120 VAC to 24 VAC solid state transformers, including a two-stage and a single-stage topology. Both converters provide higher efficiencies across all load ranges compared to the Class 2 line-frequency transformers, especially at light load (5 VA) with improvements of 19.6% and 29.9%, respectively. Standby losses for the two are 417 mW and 196 mW, compared to an average of 2.8 W standby loss for 40 VA Class 2 line-frequency transformers.more » « less
-
In batch steganography, the sender communicates a secret message by hiding it in a bag of cover objects. The adversary performs the so-called pooled steganalysis in that she inspects the entire bag to detect the presence of secrets. This is typically realized by using a detector trained to de- tect secrets within a single object, applying it to all objects in the bag, and feeding the detector outputs to a pooling function to obtain the final detection statistic. This paper deals with the problem of building the pooler while keep- ing in mind that the Warden will need to be able to de- tect steganography in variable size bags carrying variable payload. We propose a flexible machine learning solution to this challenge in the form of a Transformer Encoder Pooler, which is easily trained to be agnostic to the bag size and payload and offers a better detection accuracy than pre- viously proposed poolers.more » « less
An official website of the United States government

