FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while re-training the model during the night to adapt to new environments, domains, or new users. During the re-training, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multi-accelerator training algorithms select optimal scheduling strategies for data parallelism, tensor parallelism, and pipeline parallelism. However, pipeline parallelism cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying data parallelism and tensor parallelism suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multi-accelerator training scheduling strategy that transfers the training process into a multi-branch workflow, thus independent sub-operations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with 1.07x-2.21x speedup under 15 GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
more »
« less
Usas: A Sustainable Continuous-Learning Framework for Edge Servers
Edge servers have recently become very popular for performing localized analytics, especially on video, as they reduce data traffic and protect privacy. However, due to their resource constraints, these servers often employ compressed models, which are typically prone to data drift. Consequently, for edge servers to provide cloud-comparable quality, they must also perform continuous learning to mitigate this drift. However, at expected deployment scales, performing continuous training on every edge server is not sustainable due to their aggregate power demands on grid supply and associated sustainability footprints. To address these challenges, we propose Us.as,´ an approach combining algorithmic adjustments, hardware-software co-design, and morphable acceleration hardware to enable the training of workloads on these edge servers to be powered by renewable, but intermittent, solar power that can sustainably scale alongside data sources. Our evaluation of Us.as on a real-world´ traffic dataset indicates that our continuous learning approach simultaneously improves both accuracy and efficiency: Us.as´ offers a 4.96% greater mean accuracy than prior approaches while our morphable accelerator that adapts to solar variance can save up to {234.95kWH, 2.63MWH}/year/edge-server compared to a {DNN accelerator, data center scale GPU}, respectively.
more »
« less
- Award ID(s):
- 1822923
- PAR ID:
- 10552532
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-9313-2
- Page Range / eLocation ID:
- 891 to 907
- Subject(s) / Keyword(s):
- Energy harvesting Smart cities Continuous learning
- Format(s):
- Medium: X
- Location:
- Edinburgh, United Kingdom
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Edge Computing (EC) has seen a continuous rise in its popularity as it provides a solution to the latency and communication issues associated with edge devices transferring data to remote servers. EC achieves this by bringing the cloud closer to edge devices. Even though EC does an excellent job of solving the latency and communication issues, it does not solve the privacy issues associated with users transferring personal data to the nearby edge server. Federated Learning (FL) is an approach that was introduced to solve the privacy issues associated with data transfers to distant servers. FL attempts to resolve this issue by bringing the code to the data, which goes against the traditional way of sending the data to remote servers. In FL, the data stays on the source device, and a Machine Learning (ML) model used to train the local data is brought to the end device instead. End devices train the ML model using local data and then send the model updates back to the server for aggregation. However, this process of asking random devices to train a model using its local data has potential risks such as a participant poisoning the model using malicious data for training to produce bogus parameters. In this paper, an approach to mitigate data poisoning attacks in a federated learning setting is investigated. The application of the approach is highlighted, and the practical and secure nature of this approach is illustrated as well using numerical results.more » « less
-
Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves 59× speedup and 123× energy reduction compared to embedded CPU, and 292× memory reduction over a standard full-size training.more » « less
-
We introduce the Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. We propose Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of these models. We evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware accelerator, where they demonstrate superior latency, throughput, energy efficiency, and model area compared to state-of-the-art solutions, (2) a low-power microcontroller, where they achieve preferable accuracy to XGBoost while subject to stringent memory constraints, and (3) ultra-low-cost chips, where they consistently outperform small models in both accuracy and projected hardware area. DWNs also compare favorably against leading approaches for tabular datasets, with higher average rank. Overall, our work positions DWNs as a pioneering solution for edge-compatible high-throughput neural networks.more » « less
-
Artificial Intelligence (AI) is moving towards the edge. Training an AI model for edge computing on a centralized server increases latency, and the privacy of edge users is jeopardized due to private data transfer through a less secure communication channels. Additionally, existing high-power computing systems are battling with memory and data transfer bottlenecks between the processor and memory. Federated Learning (FL) is a collaborative AI learning paradigm for distributed local devices that operates without transferring local data. Local participant devices share the updated network parameters with the central server instead of sending the original data. The central server updates the global AI model and deploys the model to the local clients. As the local data resides only on the edge, these devices need to be protected from cyberattacks. The Federated Intrusion Detection System (FIDS) could be a viable system to protect edge devices as opposed to a centralized protection system. However, on-device training of the model in resource constrained devices may suffer from excessive power drain, in addition to memory and area overhead. In this work we present a memristor based system for AI training on edge devices. Memristor devices are ideal candidates for processing in memory, as their dynamic resistance properties allow them to perform multiply-add operations in parallel in the analog domain with extreme efficiency. Alternatively, existing CMOS-based PIM systems are typically developed for edge inference based on pretrained weights, and are not equipped for on-chip training. We show the effectiveness of the system, where successful learning and recognition is achieved completely within edge devices. The classification accuracy of the memristor system shows negligible loss when compared a software implementation. To the best of our knowledge, this first demonstration of a memristor based federated learning system. We demonstrate the effectiveness of this system as an intrusion detection platform for edge devices, although given the flexibility of the learning algorithm, it could be used to enhance many types of on board leaning and classification applications.more » « less
An official website of the United States government

