Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves 59× speedup and 123× energy reduction compared to embedded CPU, and 292× memory reduction over a standard full-size training.
more »
« less
This content will become publicly available on October 21, 2026
Model Recovery at the Edge Under Resource Constraints for Physical AI
Model Recovery (MR) enables safe, explainable decision-making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODE), which are inefficient on FPGAs. Memory and energy consumption are the main concern of applying MR on edge devices for real-time running MR. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11× lower DRAM usage and 2.2× faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA’s suitability for resource-constrained, real-time MCAS. “The implementation and datasets are publicly available at github.com/ImpactLabASU/ECAI2025.”
more »
« less
- Award ID(s):
- 2436801
- PAR ID:
- 10650903
- Publisher / Repository:
- IOS Press
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Anomaly detection in real-time using autoencoders implemented on edge devices is exceedingly challenging due to limited hardware, energy, and computational resources. We show that these limitations can be addressed by designing an autoencoder with low-resolution non-volatile memory-based synapses and employing an effective quantized neural network learning algorithm. We further propose nanoscale ferromagnetic racetracks with engineered notches hosting magnetic domain walls (DW) as exemplary non-volatile memory-based autoencoder synapses, where limited state (5-state) synaptic weights are manipulated by spin orbit torque (SOT) current pulses to write different magnetoresistance states. The performance of anomaly detection of the proposed autoencoder model is evaluated on the NSL-KDD dataset. Limited resolution and DW device stochasticity aware training of the autoencoder is performed, which yields comparable anomaly detection performance to the autoencoder having floating-point precision weights. While the limited number of quantized states and the inherent stochastic nature of DW synaptic weights in nanoscale devices are typically known to negatively impact the performance, our hardware-aware training algorithm is shown to leverage these imperfect device characteristics to generate an improvement in anomaly detection accuracy (90.98%) compared to accuracy obtained with floating-point synaptic weights that are extremely memory intensive. Furthermore, our DW-based approach demonstrates a remarkable reduction of at least three orders of magnitude in weight updates during training compared to the floating-point approach, implying significant reduction in operation energy for our method. This work could stimulate the development of extremely energy efficient non-volatile multi-state synapse-based processors that can perform real-time training and inference on the edge with unsupervised data.more » « less
-
Computationally efficient, camera-based, real-time human position tracking on low-end, edge devices would enable numerous applications, including privacy-preserving video redaction and analysis. Unfortunately, running most deep neural network based models in real time requires expensive hardware, making widespread deployment difficult, particularly on edge devices. Shifting inference to the cloud increases the attack surface, generally requiring that users trust cloud servers, and increases demands on wireless networks in deployment venues. Our goal is to determine the extreme to which edge video redaction efficiency can be taken, with a particular interest in enabling, for the first time, low-cost, real-time deployments with inexpensive commodity hardware. We present an efficient solution to the human detection (and redaction) problem based on singular value decomposition (SVD) background removal and describe a novel time- and energy-efficient sensor-fusion algorithm that leverages human position information in real-world coordinates to enable real-time visual human detection and tracking at the edge. These ideas are evaluated using a prototype built from (resource-constrained) commodity hardware representative of commonly used low-cost IoT edge devices. The speed and accuracy of the system are evaluated via a deployment study, and it is compared with the most advanced relevant alternatives. The multi-modal system operates at a frame rate ranging from 20 FPS to 60 FPS, achieves awIoU0.3score (see Section 5.4) ranging from 0.71 to 0.79, and successfully performs complete redaction of privacy-sensitive pixels with a success rate of 91%–99% in human head regions and 77%–91% in upper body regions, depending on the number of individuals present in the field of view. These results demonstrate that it is possible to achieve adequate efficiency to enable real-time redaction on inexpensive, commodity edge hardware.more » « less
-
Non-Intrusive Load Monitoring (NILM) remains a critical issue in both commercial and residential energy management, with a key challenge being the requirement for individual appliance-specific deep learning models. These models often disregard the interconnected nature of loads and usage patterns, stemming from diverse user behavior. To address this, we introduce GraphNILM, an innovative end-to-end model that leverages graph neural networks to deliver appliance-level energy usage analysis for an entire home. In its initial phase, GraphNILM employs Gaussian random variables to depict the graph edges, later enhancing prediction accuracy by substituting these edges with observations of appliance interrelationships, stripping the individual load enery from the aggregated main energy all at one time, resulting in reduced memory usage, especially with more than three loads involved, thus presenting a time and space-efficient solution for real-world implementation. Comprehensive testing on popular NILM datasets confirms that our model outperforms existing benchmarks in both accuracy and memory consumption, suggesting its considerable promise for future deployment in edge devices.more » « less
-
‘‘Extreme edge”1devices, such as smart sensors, are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0–14.3 million inferences per second, improving area-normalized throughput by an average of 3.6× and steady-state energy efficiency by an average of 7.1× compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.more » « less
An official website of the United States government
