skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Krishna, Tushar"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Vector Symbolic Architectures (VSAs), also known as hyperdimensional (HD) computing, are increasingly deployed in cognitive applications due to their simple and efficient operations. The widespread adoption has, in turn, spurred the development of a diverse set of hardware solutions that optimize VSA performance for embedded and edge AI systems. Despite these advances, there remains a lack of comprehensive, unified discussion on the co-design and co-evolution of VSA algorithms and hardware. This survey aims to bridge that gap by linking theoretical, software-level explorations with efficient hardware architectures and emerging technology fabrics for VSAs, providing co-design insights that are accessible to both algorithm and hardware communities. First, we introduce the principles of vector-symbolic computing, including its core mathematical operations and learning paradigms. Second, we provide an in-depth discussion on hardware technologies for VSAs, analyzing analog, mixed-signal, and digital circuit design styles. We compare hardware implementations of VSAs by carrying out detailed analysis of their performance characteristics and tradeoffs, from which we distill design guidelines that are applicable across arbitrary VSA formulations. Third, we discuss a methodology for cross-layer design of VSAs that identifies synergies across layers and explores key ingredients for hardware/software co-design of VSAs. Finally, as a concrete case study of this methodology, we present an in-memory computing hardware design for VSA-based hierarchical cognition, illustrating how the proposed co-design principles translate into efficient architectures. The paper concludes with a discussion of open research challenges and opportunities for future explorations. 
    more » « less
  2. The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal . TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [ 43 ]. 
    more » « less
  3. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers). Specifically, we investigate as to how a deep model should be divided into several parallel sub-models, each of which is executed efficiently by a worker. Since latency due to synchronization and data transfer among workers negatively impacts the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models. To achieve this goal, we propose to rearrange the neurons in the neural network, partition them (without changing the general topology of the neural network), and modify the weights such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers while minimizing its impact on the performance of the model. We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. To efficiently apply RePurpose, we propose an approach based on L0 optimization and the Munkres assignment algorithm. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation, both in terms of communication and computational complexity. 
    more » « less
  4. Tiny machine learning (TinyML) applications increasingly operate in dynamically changing deployment scenarios, requiring optimization for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space, which is insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware–software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel field-programmable gate array implementation and a software scheduler that controls which SubNets to serve and which SubGraph to cache in real time. SUSHI yields up to a 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% off-chip energy saved across several neural network architectures. 
    more » « less
  5. Song, Dawn; Carbin, Michael; Chen, T (Ed.)