skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Charm: A Language for Closed-Form High-Level Architecture Modeling
As computer architecture continues to expand beyond software-agnostic microarchitecture to data center organization, reconfigurable logic, heterogeneous systems, application-specific logic, and even radically different technologies such as quantum computing, detailed cycle-level simulation is no longer presupposed. Exploring designs under such complex interacting relationships (e.g., performance, energy, thermal, cost, voltage, frequency, cooling energy, leakage, etc.) calls for a more integrative but higher-level approach. We propose Charm, a domain specific language supporting Closed-form High-level ARchitecture Modeling. Charm enables mathematical representations of mutually dependent architectural relationships to be specified, composed, checked, evaluated and reused. The language is interpreted through a combination of symbolic evaluation (e.g., restructuring) and compiler techniques (e.g., memoization and invariant hoisting), generating executable evaluation functions and optimized analysis procedures. Further supporting reuse, a type system constrains architectural quantities and ensures models operate only in a validated domain. Through two case studies, we demonstrate that Charm allows one to define high-level architecture models concisely, maximize reusability, capture unreasonable assumptions and inputs, and significantly speedup design space exploration.  more » « less
Award ID(s):
1730449 1740352
PAR ID:
10085370
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)
Page Range / eLocation ID:
152 to 165
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Emerging embedded systems, such as autonomous robots/vehicles, demand a new system-on-a-chip (SoC) that is ultra-low power (mW or even sub-mW level) but highly robust. Such an SoC typically integrates heterogeneous building blocks for supporting a range of features, each ideally operating in an independent voltage and frequency (V/F) domain [1]. In such an architecture, a network-on-chip (NoC) has played a key role to enable high-speed and energy-efficient networking. However, it is increasingly challenging to meet a robustness target since each V/F domain uses a significantly different voltage, e.g., from nominal 1V to near-threshold voltage (NTV), and clock frequency, e.g., from hundreds of MHz to sub-MHz. Furthermore, any two clocks may have uncertain and time-varying phase and frequency relationships. These properties significantly worsen robustness, particularly metastability, in an NoC. 
    more » « less
  2. When learning vision-language models (VLM) for the fashion domain, most existing works design new architectures from vanilla BERT with additional objectives, or perform dense multi-task learning with fashion-specific tasks. Though progress has been made, their architecture or objectives are often intricate and the extendibility is limited.By contrast, with simple architecture (comprising only two unimodal encoders) and just the contrastive objective, popular pre-trained VL models (e.g., CLIP) achieve superior performance in general domains, which are further easily extended to downstream tasks.However, inheriting such benefits of CLIP in the fashion domain is non-trivial in the presence of the notable domain gap. Empirically, we find that directly finetuning on fashion data leads CLIP to frequently ignore minor yet important details such as logos and composition, which are critical in fashion tasks such as retrieval and captioning.In this work, to maintain CLIP's simple architecture and objective while explicitly attending to fashion details, we propose E2 : Easy Regional Contrastive Learning of Expressive Fashion Representations. E2 introduces only a few selection tokens and fusion blocks (just 1.9\% additional parameters in total) with only contrastive losses. Despite lightweight, in our primary focus, cross-modal retrieval, E2 notably outperforms existing fashion VLMs with various fashion-specific objectives.Moreover, thanks to CLIP's widespread use in downstream tasks in general domains (e.g., zero-shot composed image retrieval and image captioning), our model can easily extend these models from general domain to the fashion domain with notable improvement.To conduct a comprehensive evaluation, we further collect data from Amazon Reviews to build a new dataset for cross-modal retrieval in the fashion domain. 
    more » « less
  3. Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator. 
    more » « less
  4. We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the “assistant” language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model’s expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain ex- pert models. On instruction-following, domain- specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis of the learned latent decisions, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling. 
    more » « less
  5. Modern mobile devices feature ever increasing computational, sensory, and network resources, which can be shared to execute tasks on behalf of nearby devices. Mobile device clouds (MDCs) facilitate such distributed execution by exposing the collective resources of a set of nearby mobile devices through a unified programming interface. However, the true potential of MDCs remains untapped, as they fail to provide practical programming support for developers to execute distributed functionalities. To address this problem, we introduce a microservice-based Programmable MDC architecture (PMDC), highly customized for the unique features of MDC environments. PMDC conveniently provisions functionalities as microservices, which are deployed on MDC devices on demand. PMDC features a novel domain specific language that provides abstractions for concisely expressing fine-grained control over the procedures of device capability sharing and microservice execution. Furthermore, PMDC introduces a new system component-the microservice gateway, which reconciles the supply of available device capabilities and the demand for microservice execution to distribute microservices within an MDC. Our evaluation shows that MDCs, expressed by developers through the PMDC declarative programming interface, exhibit low energy consumption and high performance. 
    more » « less