TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

Guan, Jiexiong (ORCID:0000000152749169); Hu, Zhenqing (ORCID:000900074892755X); Antonopoulos, Christos D (ORCID:000000026486062X); Bellas, Nikolaos (ORCID:0000000295229136); Lalis, Spyros (ORCID:0000000322323559); Smirni, Evgenia (ORCID:000000018754581X); Zhou, Gang (ORCID:0000000244259837); Agrawal, Gagan (ORCID:0000000226091428); Ren, Bin (ORCID:0000000241165237)

doi:10.1145/3721145.3725774

Citation Details

This content will become publicly available on June 8, 2026

TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines. more »

Award ID(s):: 2403088 2230944

PAR ID:: 10638756

Author(s) / Creator(s):: Guan, Jiexiong; Hu, Zhenqing; Antonopoulos, Christos D; Bellas, Nikolaos; Lalis, Spyros; Smirni, Evgenia; Zhou, Gang; Agrawal, Gagan; Ren, Bin

Publisher / Repository:: ACM

Date Published:: 2025-06-08

Page Range / eLocation ID:: 205 to 220

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on June 8, 2026
Conference Paper:
https://doi.org/10.1145/3721145.3725774

More Like this