Search for: All records

Award ID contains: 2217003

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1109/RTSS66672.2025.00039

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Zhang, Wei; Zhuang, Jinming; Jones, Alex K; Dong, Zheng; Zhou, Peipei (December 2025, IEEE)

Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure. To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intralayer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5 % overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA
more » « less
Free, publicly-accessible full text available December 2, 2026
AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

https://doi.org/10.1145/3712285.3759778

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexible HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88 × across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75 × speedup through efficient design and overlapping; on graph applications, AGILE reduces software cache overhead by up to 3.12 × and NVMe I/O overhead by up to 2.85 × ; AGILE also lowers per-thread register usage by up to 1.32 ×.
more » « less
Free, publicly-accessible full text available November 15, 2026
ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1145/3716368.3735215

Ji, Shixin; Chen, Xingzhen; Zhuang, Jinming; Zhang, Wei; Yang, Zhuoping; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex; Dong, Zheng; et al (June 2025, ACM)

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into “real-time guaranteed" accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.
more » « less
Free, publicly-accessible full text available June 29, 2026
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Xue, Qiyao; Yin, Xiangyu; Yang, Boyuan; Gao, Wei (June 2025, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.)

Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model’s capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models’ adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
more » « less
Free, publicly-accessible full text available June 11, 2026
Tackling Intertwined Data and Device Heterogeneities in Federated Learning with Unlimited Staleness

https://doi.org/10.1609/aaai.v39i20.35405

Wang, Haoming; Gao, Wei (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Federated Learning (FL) can be affected by data and device heterogeneities, caused by clients' different local data distributions and latencies in uploading model updates (i.e., staleness). Traditional schemes consider these heterogeneities as two separate and independent aspects, but this assumption is unrealistic in practical FL scenarios where these heterogeneities are intertwined. In these cases, traditional FL schemes are ineffective, and a better approach is to convert a stale model update into a unstale one. In this paper, we present a new FL framework that ensures the accuracy and computational efficiency of this conversion, hence effectively tackling the intertwined heterogeneities that may cause unlimited staleness in model updates. Our basic idea is to estimate the distributions of clients' local training data from their uploaded stale model updates, and use these estimations to compute unstale client model updates. In this way, our approach does not require any auxiliary dataset nor the clients' local models to be fully trained, and does not incur any additional computation or communication overhead at client devices. We compared our approach with the existing FL strategies on mainstream datasets and models, and showed that our approach can improve the trained model accuracy by up to 25% and reduce the number of required training epochs by up to 35%.
more » « less
Free, publicly-accessible full text available April 11, 2026
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

https://doi.org/10.1145/3706628.3708870

Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (February 2025, ACM)

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD's AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE's specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance. To address these limitations, we introduce ARIES, an agile multi-level intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is open-sourced on GitHub: https://github.com/arc-research-lab/Aries.
more » « less
Free, publicly-accessible full text available February 27, 2026
Towards Accelerator Customization in Real-time Safety-critical Systems

https://doi.org/10.1145/3706628.3708841

Ji, Shixin; Chen, Xingzhen; Zhang, Wei; Yang, Zhuoping; Zhuang, Jinming; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex K; Dong, Zheng; et al (February 2025, ACM)

Free, publicly-accessible full text available February 27, 2026
MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

https://doi.org/10.1109/TCAD.2025.3541486

Tang, Yue; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong (January 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while re-training the model during the night to adapt to new environments, domains, or new users. During the re-training, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multi-accelerator training algorithms select optimal scheduling strategies for data parallelism, tensor parallelism, and pipeline parallelism. However, pipeline parallelism cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying data parallelism and tensor parallelism suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multi-accelerator training scheduling strategy that transfers the training process into a multi-branch workflow, thus independent sub-operations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with 1.07x-2.21x speedup under 15 GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
more » « less
Full Text Available
Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices

https://doi.org/10.1145/3636534.3690698

Huang, Kai; Yin, Xiangyu; Gu, Tao; Gao, Wei (December 2024, in Proceedings of the 30th ACM International Conference on Mobile Computing and Networking (MobiCom), 2024.)

Image super-resolution (SR) is widely used on mobile devices to enhance user experience. However, neural networks used for SR are computationally expensive, posing challenges for mobile devices with limited computing power. A viable solution is to use heterogeneous processors on mobile devices, especially the specialized hardware AI accelerators, for SR computations, but the reduced arithmetic precision on AI accelerators can lead to degraded perceptual quality in upscaled images. To address this limitation, in this paper we present SR For Your Eyes (FYE-SR), a novel image SR technique that enhances the perceptual quality of upscaled images when using heterogeneous processors for SR computations. FYESR strategically splits the SR model and dispatches different layers to heterogeneous processors, to meet the time constraint of SR computations while minimizing the impact of AI accelerators on image quality. Experiment results show that FYE-SR outperforms the best baselines, improving perceptual image quality by up to 2x, or reducing SR computing latency by up to 5.6x with on-par image quality.
more » « less
Full Text Available
FiberFlex: Real-time FPGA-based Intelligent and Distributed Fiber Sensor System for Pedestrian Recognition

https://doi.org/10.1145/3690389

Li, Yuqi; Zhao, Kehao; Zhao, Jieru; Wang, Qirui; Zhong, Shuda; Lalam, Nageswara; Wright, Ruishu; Zhou, Peipei; Chen, Kevin_P (November 2024, ACM Transactions on Reconfigurable Technology and Systems)

In recent years, security monitoring of public places and critical infrastructure has heavily relied on the widespread use of cameras, raising concerns about personal privacy violations. To balance the need for effective security monitoring with the protection of personal privacy, we explore the potential of optical fiber sensors for this application. This article proposes FiberFlex, an intelligent and distributed fiber sensor system. Ultizing Field Programmable Gate Arrays (FPGA) high-level synthesis (HLS) acceleration, FiberFlex offers real-time pedestrian detection by co-designing the entire pipeline of optical signal acquisition, processing, and recognition networks based on the principles of optical fiber sensing. As a promising alternative to traditional camera-based monitoring systems, FiberFlex achieves pedestrian detection by analyzing the vibration patterns caused by pedestrian footsteps, enabling security monitoring while preserving individual privacy. FiberFlex comprises three modules:First, fiber-optic sensing system: A fiber-optic distributed acoustic sensing (DAS) system is built and used to measure the ground vibration waves generated by people walking.Second, algorithms: We first collect the training data by measuring the ground vibration waves, label the data, and use the data to train the neural network models to perform pedestrian recognition.Third, hardware accelerators: We use HLS tools to design hardware modules on FPGA for data collection and pre-processing and integrate them with the downstream neural network accelerators to perform in-line real-time pedestrian detection. The final detection results are sent back from FPGA to the host CPU. We implement our system FiberFlex with the in-house built DAS system and AMD/Xilinx Kintex7 FPGA KC705 board and verify the whole system using the real-world collected data. We conduct recognition tests on five test subjects of varying ages, heights, and weights in a fixed sensing area. Each subject experienced 20 real-time recognition tests using their daily walking habits, and the subjects were given adequate rest between tests. After 100 tests on five test subjects, the overall real-time recognition accuracy exceeded\(88.0\%\). The whole system uses 55 W of power, 33 W in the optical DAS system and 22 W in the FPGA. Relying on its end-to-end interdisciplinary design, FiberFlex seamlessly combines fiber-optic sensors with FPGA accelerators to enable low-power real-time security monitoring without compromising privacy, making it a valuable addition to the existing security monitoring network. According to FiberFlex, more valuable research can be conducted in the future, such as fall monitoring for the elderly, migration of identification networks between different application scenarios, and improvement of anti-interference performance in more complex environments. In future perception networks, where the “eyes” are not feasible, let’s use fiber optic touch instead.
more » « less

« Prev Next »