MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

Tang, Yue; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong

doi:10.1109/TCAD.2025.3541486

Citation Details

This content will become publicly available on January 1, 2026

MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while re-training the model during the night to adapt to new environments, domains, or new users. During the re-training, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multi-accelerator training algorithms select optimal scheduling strategies for data parallelism, tensor parallelism, and pipeline parallelism. However, pipeline parallelism cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying data parallelism and tensor parallelism suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multi-accelerator training scheduling strategy that transfers the training process into a multi-branch workflow, thus independent sub-operations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with 1.07x-2.21x speedup under 15 GB/s peer-to-peer bandwidth compared to the state-of-the-art work. more »

Award ID(s):: 2328972 2324864 2213701 2217003 2536952

PAR ID:: 10578865

Author(s) / Creator(s):: Tang, Yue; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong

Publisher / Repository:: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Date Published:: 2025-01-01

Journal Name:: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

ISSN:: 0278-0070

Page Range / eLocation ID:: 1 to 1

Subject(s) / Keyword(s):: edge server heterogeneous FPGAs CNN training

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on January 1, 2026
Journal Article:
https://doi.org/10.1109/TCAD.2025.3541486

More Like this