NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FIDESlib: fully-fledged open-source FHE library for efficient CKKS on GPUs

Agullo-Domingo, Carlos; Vera-Lopez, Oscar; Guzelhan, Seyda; Daksha, Lohit; El_Jerari, Aymane; Shivdikar, Kaustubh; Agrawal, Rashmi; Kaeli, David; Joshi, Ajay; Abellan_Miguel, Jose_Luis (July 2025, IEEE International Symposium on Performance Analysis of Software and Systems)

Word-wise Fully Homomorphic Encryption (FHE) schemes, such as CKKS, are gaining significant traction due to their ability to provide post-quantum-resistant, privacy preserving approximate computing—an especially desirable feature in the Machine-Learning-as-a-Service (MLaaS) paradigm. In this work, we introduce FIDESlib, the first open-source server-side CKKS GPU library that is fully interoperable with well-established client-side OpenFHE operations. Unlike other existing open-source GPU libraries, FIDESlib provides the first implementation featuring heavily optimized GPU kernels for all CKKS primitives, including bootstrapping. Our library also integrates robust benchmarking and testing, ensuring it remains adaptable to further optimization. Comparing our scheme against Phantom (the previously top open-source CKK library, we show that FIDESlib offers superior performance and scalability. For bootstrapping, FIDESlib achieves no less than 70× speedup over the AVX-optimized OpenFHE implementation. FIDESlib is available on Github.
more » « less
Free, publicly-accessible full text available July 7, 2026
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM

https://doi.org/10.1109/HPCA61900.2025.00116

Son, Hyojun; Jonatan, Gilbert; Wu, Xiangyu; Cho, Haeyoon; Shivdikar, Kaustubh; Abellán, José L; Joshi, Ajay; Kaeli, David; Kim, John (March 2025, Proceedings)

Processing-in-memory (PIM), where compute is moved closer to memory or data, has been explored to accelerate emerging workloads. Different PIM-based systems have been announced, each offering a unique microarchitectural organization of their compute units, ranging from fixed functional units to programmable general-purpose compute cores near memory. However, one fundamental limitation of PIM is that each compute unit can only access its local memory; access to “remote” memory must occur through the host CPU – potentially limiting application performance scalability. In this work, we first characterize the scalability of real PIM architectures using the UPMEM PIM system. We analyze how the overhead of communicating through the host (instead of providing direct communication between the PIM compute units) can become a bottleneck for collective communications that are commonly used in many workloads. To overcome this inter-PIM bank communication, we propose PIMnet – a PIM interconnection network for PIM banks that provides direct connectivity between compute units and removes the overhead of communicating through the host. PIMnet exploits bandwidth parallelism where communication across the different PIM bank/chips can occur in parallel to maximize communication performance. PIMnet also matches the DRAM packaging hierarchy with a multi-tier network architecture. Unlike traditional interconnection networks, PIMnet is a PIM controlled network where communication is managed by the PIM logic, optimizing collective communications and minimizing the hardware overhead of PIMnet. Our evaluation of PIMnet shows that it provides up to 85× speedup on collective communications and achieves a 11.8× improvement on real applications compared to the baseline PIM.
more » « less
Free, publicly-accessible full text available March 1, 2026
Secure Machine Learning Hardware: Challenges and Progress

Lee, Kyungmi; Ashok, Maitreyi; Maji, Saurav; Agrawal, Rashmi; Joshi, Ajay; Yan, Mengjia; Emer, Joel S; Chandrakasan, Anantha P (February 2025, IEEE circuits and systems magazine)

With the rising adoption of deep neural networks (DNNs) for commercial and high-stakes applications that process sensitive user data and make critical decisions, security concerns are paramount. An adversary can undermine the confidentiality of user input or a DNN model, mislead a DNN to make wrong predictions, or even render a machine learning application unavailable to valid requests. While security vulnerabilities that enable such exploits can exist across multiple levels of the technology stack that supports machine learning applications, the hardware-level vulnerabilities can be particularly problematic. In this article, we provide a comprehensive review of the hardware-level vulnerabilities affecting domain-specific DNN inference accelerators and recent progress in secure hardware design to address these. As domain-specific DNN accelerators have a number of differences compared to general-purpose processors and cryptographic accelerators where the hardware-level vulnerabilities have been thoroughly investigated, there are unique challenges and opportunities for secure machine learning hardware. We first categorize the hardware-level vulnerabilities into three scenarios based on an adversary’s capability: 1) an adversary can only attack the off-chip components, such as the off-chip DRAM and the data bus; 2) an adversary can directly attack the on-chip structures in a DNN accelerator; and 3) an adversary can insert hardware trojans during the manufacturing and design process. For each category, we survey recent studies on attacks that pose practical security challenges to DNN accelerators. Then, we present recent advances in the defense solutions for DNN accelerators, addressing those security challenges with circuit-, architecture-, and algorithm-level techniques.
more » « less
Free, publicly-accessible full text available February 6, 2026
Secure Machine Learning Hardware: Challenges and Progress

Lee, Kyungmi; Ashok, Maitreyi; Maji, Saurav; Agrawal, Rashmi; Joshi, Ajay; Yan, Mengjia; Emer, Joel S; Chandrakasan, Anantha P (February 2025, IEEE circuits and systems magazine)

Free, publicly-accessible full text available February 6, 2026
IOMMU Deferred Invalidation Vulnerability: Exploit and Defense

Rajapaksha, Chathura; Delshadtehrani, Leila; Muri, Richard; Egele, Manuel; Joshi, Ajay (April 2024, Proceedings of the Design, Automation & Test in Europe Conference (DATE))

Full Text Available
Scalability Limitations of Processing-in-Memory using Real System Evaluations

https://doi.org/10.1145/3639046

Jonatan, Gilbert; Cho, Haeyoon; Son, Hyojun; Wu, Xiangyu; Livesay, Neal; Mora, Evelio; Shivdikar, Kaustubh; Abellán, José L; Joshi, Ajay; Kaeli, David; et al (February 2024, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.
more » « less
Full Text Available
Processing-in-Memory using Optically-Addressed Phase Change Memory

https://doi.org/10.1109/ISLPED58423.2023.10244409

Yang, Guowei; Demirkiran, Cansu; Kizilates, Zeynep; Ocampo, Carlos; Coskun, Ayse; Joshi, Ajay (August 2023, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED) 2023)

Today’s Deep Neural Network (DNN) inference systems contain hundreds of billions of parameters, resulting in significant latency and energy overheads during inference due to frequent data transfers between compute and memory units. Processing-in-Memory (PiM) has emerged as a viable solution to tackle this problem by avoiding the expensive data movement. PiM approaches based on electrical devices suffer from throughput and energy efficiency issues. In contrast, Optically-addressed Phase Change Memory (OPCM) operates with light and achieves much higher throughput and energy efficiency compared to its electrical counterparts. This paper introduces a system-level design that takes the OPCM programming overhead into consideration, and identifies that the programming cost dominates the DNN inference on OPCM-based PiM architectures. We explore the design space of this system and identify the most energy-efficient OPCM array size and batch size. We propose a novel thresholding and reordering technique on the weight blocks to further reduce the programming overhead. Combining these optimizations, our approach achieves up to 65.2x higher throughput than existing photonic accelerators for practical DNN workloads.
more » « less
Full Text Available
SIGFuzz: A Framework for Discovering Microarchitectural Timing Side Channels

https://doi.org/10.23919/DATE56975.2023.10136966

Rajapaksha, Chathura; Delshadtehrani, Leila; Egele, Manuel; Joshi, Ajay (April 2023, Design, Automation & Test in Europe Conference & Exhibition (DATE))

Full Text Available
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

https://doi.org/10.1145/3613424.3614279

Shivdikar, Kaustubh; Bao, Yuhui; Agrawal, Rashmi; Shen, Michael; Jonatan, Gilbert; Mora, Evelio; Ingare, Alexander; Livesay, Neal; AbellÁN, JosÉ L; Kim, John; et al (October 2023, ACM)

Full Text Available
ProcessorFuzz: Processor Fuzzing with Control and Status Registers Guidance

https://doi.org/10.1109/HOST55118.2023.10133714

Canakci, Sadullah; Rajapaksha, Chathura; Delshadtehrani, Leila; Nataraja, Anoop; Taylor, Michael Bedford; Egele, Manuel; Joshi, Ajay (May 2023, IEEE)

« Prev Next »

Search for: All records