Search for: All records

Creators/Authors contains: "Hwu, Wen-mei"

« Prev Next »

Total Resources

9

Resource Type
Conference Paper

7

Conference Proceeding

0

Dataset

0

Journal Article

2

Workshop Report

0

Availability
Full Text / Resource Available

9

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An efficient GPU implementation and scaling for higher-order 3D stencils

https://doi.org/10.1016/j.ins.2021.11.042

Anjum, Omer ; Almasri, Mohammad ; de Gonzalo, Simon Garcia ; Hwu, Wen-mei ( March 2022 , Information Sciences)

Full Text Available
Open Relation Modeling: Learning to Define Relations between Entities

https://doi.org/10.18653/v1/2022.findings-acl.26

Huang, Jie ; Chang, Kevin ; Xiong, Jinjun ; Hwu, Wen-mei ( January 2022 , Findings of the Association for Computational Linguistics: ACL 2022)

Full Text Available
Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

https://doi.org/10.1109/TCAD.2021.3093398

Zhang, Xiaofan ; Ma, Yuan ; Xiong, Jinjun ; Hwu, Wen-mei ; Kindratenko, Volodymyr ; Chen, Deming ( June 2021 , IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
null (Ed.)
Full Text Available
Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach

https://doi.org/10.18653/v1/2021.acl-long.282

Huang, Jie ; Chang, Kevin ; Xiong, JinJun ; Hwu, Wen-mei ( January 2021 , Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing)

Full Text Available
FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache

https://doi.org/10.1109/MICRO50266.2020.00021

Dhar, Ashutosh ; Wang, Xiaohao ; Franke, Hubertus ; Xiong, Jinjun ; Huang, Jian ; Hwu, Wen-mei ; Kim, Nam Sung ; Chen, Deming ( October 2020 , IEEE/ACM International Symposium on Microarchitecture (MICRO))
null (Ed.)
Full Text Available
Exploring Semantic Capacity of Terms

https://doi.org/10.18653/v1/2020.emnlp-main.684

Huang, Jie ; Wang, Zilong ; Chang, Kevin ; Hwu, Wen-mei ; Xiong, JinJun ( January 2020 , Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP))

Full Text Available
Node-Aware Stencil Communication for Heterogeneous Supercomputers

https://doi.org/10.1109/IPDPSW50202.2020.00136

Pearson, Carl ; Hidayetoglu, Mert ; Almasri, Mohammad ; Anjum, Omer ; Chung, I-Hsin ; Xiong, Jinjun ; Hwu, Wen-Mei W. ( May 2020 , 2020 IEEE International Parallel and Distributed Processing Symposium Workshops)

High-performance distributed computing systems increasingly feature nodes that have multiple CPU sockets and multiple GPUs. The communication bandwidth between these components is non-uniform. Furthermore, these systems can expose different communication capabilities between these components. For communication-heavy applications, optimally using these capabilities is challenging and essential for performance. Bespoke codes with optimized communication may be non-portable across run-time/software/hardware configurations, and existing stencil frameworks neglect optimized communication. This work presents node-aware approaches for automatic data placement and communication implementation for 3D stencil codes on multi-GPU nodes with non-homogeneous communication performance and capabilities. Benchmarking results in the Summit system show that choices in placement can result in a 20% improvement in single-node exchange, and communication specialization can yield a further 6x improvement in exchange time in a single node, and a 16% improvement at 1536 GPUs.
more » « less
Full Text Available
DeepStore: In-Storage Acceleration for Intelligent Queries

https://doi.org/10.1145/3352460.3358320

Mailthody, Vikram Sharma ; Qureshi, Zaid ; Liang, Weixin ; Feng, Ziyan ; de Gonzalo, Simon Garcia ; Li, Youjie ; Franke, Hubertus ; Xiong, Jinjun ; Huang, Jian ; Hwu, Wen-mei ( October 2019 , Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO'19))

Recent advancements in deep learning techniques facilitate intelligent-query support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve high-performance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%--90% of the query execution time. To this end, we present DeepStore, an in-storage accelerator architecture for intelligent queries. It consists of (1) energy-efficient in-storage accelerators designed specifically for supporting DNN-based intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries. DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and evaluate it with a variety of vision, text, and audio based intelligent queries. Compared with the state-of-the-art GPU+SSD approach, DeepStore improves the query performance by up to 17.7×, and energy-efficiency by up to 78.6×.
more » « less
Full Text Available
FlatFlash: Exploiting the Byte-Accessibility of SSDs within a Unified Memory-Storage Hierarchy

https://doi.org/10.1145/3297858.3304061

Abulila, Ahmed ; Mailthody, Vikram Sharma ; Qureshi, Zaid ; Huang, Jian ; Kim, Nam Sung ; Xiong, Jinjun ; Hwu, Wen-mei ( April 2019 , Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems)

Using flash-based solid state drives (SSDs) as main memory has been proposed as a practical solution towards scaling memory capacity for data-intensive applications. However, almost all existing approaches rely on the paging mechanism to move data between SSDs and host DRAM. This inevitably incurs significant performance overhead and extra I/O traffic. Thanks to the byte-addressability supported by the PCIe interconnect and the internal memory in SSD controllers, it is feasible to access SSDs in both byte and block granularity today. Exploiting the benefits of SSD's byte-accessibility in today's memory-storage hierarchy is, however, challenging as it lacks systems support and abstractions for programs. In this paper, we present FlatFlash, an optimized unified memory-storage hierarchy, to efficiently use byte-addressable SSD as part of the main memory. We extend the virtual memory management to provide a unified memory interface so that programs can access data across SSD and DRAM in byte granularity seamlessly. We propose a lightweight, adaptive page promotion mechanism between SSD and DRAM to gain benefits from both the byte-addressable large SSD and fast DRAM concurrently and transparently, while avoiding unnecessary page movements. Furthermore, we propose an abstraction of byte-granular data persistence to exploit the persistence nature of SSDs, upon which we rethink the design primitives of crash consistency of several representative software systems that require data persistence, such as file systems and databases. Our evaluation with a variety of applications demonstrates that, compared to the current unified memory-storage systems, FlatFlash improves the performance for memory-intensive applications by up to 2.3x, reduces the tail latency for latency-critical applications by up to 2.8x, scales the throughput for transactional database by up to 3.0x, and decreases the meta-data persistence overhead for file systems by up to 18.9x. FlatFlash also improves the cost-effectiveness by up to 3.8x compared to DRAM-only systems, while enhancing the SSD lifetime significantly.
more » « less
Full Text Available