NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AutoRAC: Automated Processing-in-Memory Accelerator Design for Recommender Systems

https://doi.org/10.1145/3716368.3735229

Cheng, Feng; Zhang, Tunhou; Zhang, Junyao; Ku, Jonathan Hao-Cheng; Wang, Yitu; Yang, Xiaoxuan; Li, Hai; Chen, Yiran (June 2025, ACM)

Free, publicly-accessible full text available June 29, 2026
FedProphet: Memory-Efficient Federated Adversarial Training via Theoretic-Robustness and Low-Inconsistency Cascade Learning

Tang, Minxue; Wang, Yitu; Zhang, Jingyang; DiValentin, Louis; Ding, Aolin; Hass, Amin; Chen, Yiran; Li, Hai (May 2025, MLSys)

Free, publicly-accessible full text available May 12, 2026
Improving the Efficiency of In-Memory-Computing Macro with a Hybrid Analog-Digital Computing Mode for Lossless Neural Network Inference

https://doi.org/10.1145/3649329.3658472

Zheng, Qilin; Li, Ziru; Ku, Jonathan; Wang, Yitu; Taylor, Brady; Fan, Deliang; Chen, Yiran (June 2024, ACM)

Full Text Available
NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing

https://doi.org/10.1109/ISCA59077.2024.00035

Wang, Yitu; Li, Shiyu; Zheng, Qilin; Song, Linghao; Li, Zongwang; Chang, Andrew; Li, Hai “Helen”; Chen, Yiran (June 2024, IEEE)

Approximate nearest neighbor search (ANNS) is a key retrieval technique for vector database and many data center applications, such as person re-identification and recommendation systems. It is also fundamental to retrieval augmented generation (RAG) for large language models (LLM) now. Among all the ANNS algorithms, graph-traversal-based ANNS achieves the highest recall rate. However, as the size of dataset increases, the graph may require hundreds of gigabytes of memory, exceeding the main memory capacity of a single workstation node. Although we can do partitioning and use solid-state drive (SSD) as the backing storage, the limited SSD I/O bandwidth severely degrades the performance of the system. To address this challenge, we present NDSEARCh, a hardware-software co-designed near-data processing (NDP) solution for ANNS processing. NDSeARCH consists of a novel in-storage computing architecture, namely, SEARSSD, that supports the ANNS kernels and leverages logic unit (LUN)-level parallelism inside the NAND flash chips. NDSEARCH also includes a processing model that is customized for NDP and cooperates with SearSSD. The processing model enables us to apply a two-level scheduling to improve the data locality and exploit the internal bandwidth in NDSearch, and a speculative searching mechanism to further accelerate the ANNS workload. Our results show that NDSEARCH improves the throughput by up to 31.7×,14.6×,7.4×, and 2.9× over CPU, GPU, a state-of-the-art SmartSSD-only design, and DeepStore, respectively. NDSEARCH also achieves two orders-of-magnitude higher energy efficiency than CPU and GPU.
more » « less
Full Text Available
NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

https://doi.org/10.1109/TC.2024.3365939

Li, Shiyu; Wang, Yitu; Hanson, Edward; Chang, Andrew; Seok_Ki, Yang; Li, Hai; Chen, Yiran (May 2024, IEEE Transactions on Computers)

Full Text Available
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

Guo, Cong; Cheng, Feng; Du, Zhixu; Kiessling, James; Ku, Jonathan; Li, Shiyu; Li, Zhixu; Ma, Mingyuan; Molom-Ochir, Tergel; Morris, Benjamin; et al (February 2025, IEEE circuits and systems magazine)

Free, publicly-accessible full text available February 6, 2026
An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference

https://doi.org/10.1145/3609384

Wang, Yitu; Li, Shiyu; Zheng, Qilin; Chang, Andrew; Li, Hai; Chen, Yiran (October 2023, ACM Transactions on Embedded Computing Systems)

Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high predictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near-data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces a low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we proposeEMS-i, an efficient memory system design that integrates Solid State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions,EMS-iachieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings.EMS-ialso saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.
more » « less
Full Text Available
Accelerating Sparse Attention with a Reconfigurable Non-volatile Processing-In-Memory Architecture

https://doi.org/10.1109/DAC56929.2023.10247908

Zheng, Qilin; Li, Shiyu; Wang, Yitu; Li, Ziru; Chen, Yiran; Li, Hai Helen (July 2023, IEEE)

Full Text Available
Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment

Zhang, Jingyang; Yang, Huanrui; Chen, Fan; Wang, Yitu; Li, Hai (December 2019, The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing co-located with NeurIPS 2019)

Emerging resistive random-access memory (ReRAM) has recently been intensively investigated to accelerate the processing of deep neural networks (DNNs). Due to the in-situ computation capability, analog ReRAM crossbars yield significant throughput improvement and energy reduction compared to traditional digital methods. However, the power hungry analog-to-digital converters (ADCs) prevent the practical deployment of ReRAM-based DNN accelerators on end devices with limited chip area and power budget. We observe that due to the limited bitdensity of ReRAM cells, DNN weights are bit sliced and correspondingly stored on multiple ReRAM bitlines. The accumulated current on bitlines resulted by weights directly dictates the overhead of ADCs. As such, bitwise weight sparsity rather than the sparsity of the full weight, is desirable for efficient ReRAM deployment. In this work, we propose bit-slice `1, the first algorithm to induce bit-slice sparsity during the training of dynamic fixed-point DNNs. Experiment results show that our approach achieves 2 sparsity improvement compared to previous algorithms. The resulting sparsity allows the ADC resolution to be reduced to 1-bit of the most significant bit-slice and down to 3-bit for the others bits, which significantly speeds up processing and reduces power and area overhead.
more » « less
Full Text Available
ReBoc: Accelerating Block-Circulant Neural Networks in ReRAM

https://doi.org/10.23919/DATE48585.2020.9116422

Wang, Yitu; Chen, Fan; Song, Linghao; Richard Shi, C. -J.; Li, Hai Helen; Chen, Yiran (March 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE))

Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing of DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant DNNs in ReRAM to reap the benefits of light-weight models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map block-circulant DNN models onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two specific techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block- circulant DNNs to reduce data access and buffer size. In REBOC, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN accelerators, REBOC achieves averagely 4.1× speedup and 2.6× energy reduction.
more » « less
Full Text Available

Search for: All records