Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, Flux Stream Processing Unit (FluxSPU), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4
This content will become publicly available on September 23, 2025
Recognizing food types through sensor signals for unseen users remains remarkably challenging, despite extensive recent studies. The efficacy of prior machine learning techniques is dwarfed by giant variations of data collected from multiple participants, partly because users have varied chewing habits and wear sensor devices in various manners. This work treats the problem as an instance of the domain adaptation problem, where each user represents a domain. We develop the first multi-source domain adaptation (MSDA) method for food-typing recognition, which consists of three major components: stratified normalization, a multi-source domain adaptor, and adaptive ensemble learning. New techniques are developed for each component. Using a real-world dataset comprised of 15 participants, we demonstrate that our method achieves
- Award ID(s):
- 2008557
- PAR ID:
- 10559639
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM Transactions on Computing for Healthcare
- ISSN:
- 2637-8051
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
and 6.6\(\times\) , and energy savings of 189.8\(\times\) and 283.9\(\times\) , respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4\(\times\) , successfully maintaining generalizations without incurring significant overhead.\(\times\) -
FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques, such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14
in mean time to failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This article applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping.” The application of these techniques improved MTTF of the mitigated design by a factor of 1.54\(\times\) and thus provides a 22.8X\(\times\) MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.\(\times\) -
Current computing device authentication often presents accessibility barriers for people with
upper extremity impairments (UEI) . In this article, we present a framework calledAccessible image-Association-based Authentication for Computing devices (A3C) , a novel recognition-based graphical authentication framework specifically designed for people with UEI to authenticate to their computing devices. A3C requires users to provide a set of primary images the user knows that are recognizable to them and subsequently associate each primary image with a secondary image. To evaluate the efficacy of the A3C framework, we instantiated the framework by implementing a version of A3C calledA3C-FA , which uses images of faces of people the user knows as the primary image and animal images as the secondary image. We then performed three studies to evaluate A3C-FA: a shoulder-surfing attack study (N ), a close-adversary attack study ( 319\(=\) N ), and a usability study with people with UEI ( 268\(=\) N ). We found that A3C was robust against both shoulder-surfing and close-adversary attacks. We also performed a detailed study to evaluate the accessibility of A3C-FA. Our participants reported that A3C-FA was more usable and more secure than the authentication approaches with which they were familiar. Based on these findings, we suggest four areas of future research to further improve the design of the A3C framework. 14\(=\) -
In this article, we show how a single function,
join , can be used to implement parallelbalanced binary search trees (BSTs ) simply and efficiently. Based onjoin , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework calledjoin-based algorithms . We show that thejoin function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to asjoinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, includinginsert ,delete ,union ,intersection ,difference ,split ,range ,filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operationsunion ,intersection , anddifference have work and polylogarithmic span for input set sizes\( O(m\log (\frac{n}{m}+1)) \) and\( n \) .\( m\le n \) We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73
speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.\( \times \) -
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:
How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose
multiple diverse MM accelerator architectures working concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29 , 32.51\(\times\) , 1.00\(\times\) , and 1.00\(\times\) throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:\(\times\) .https://github.com/arc-research-lab/CHARM