Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.
more »
« less
ART: Abstraction Refinement-Guided Training for Provably Correct Neural Networks
Artificial Neural Networks (ANNs) have demonstrated remarkable utility in various challenging machine learning applications. While formally verified properties of their behaviors are highly desired, they have proven notoriously difficult to derive and enforce. Existing approaches typically formulate this problem as a post facto analysis process. In this paper, we present a novel learning framework that ensures such formal guarantees are enforced by construction. Our technique enables training provably correct networks with respect to a broad class of safety properties, a capability that goes well-beyond existing approaches, without compromising much accuracy. Our key insight is that we can integrate an optimization-based abstraction refinement loop into the learning process and operate over dynamically constructed partitions of the input space that considers accuracy and safety objectives synergistically. The refinement procedure iteratively splits the input space from which training data is drawn, guided by the efficacy with which such partitions enable safety verification. We have implemented our approach in a tool (ART) and applied it to enforce general safety properties on unmanned aviator collision avoidance system ACAS Xu dataset and the Collision Detection dataset. Importantly, we empirically demonstrate that realizing safety does not come at the price of much accuracy. Our methodology demonstrates that an abstraction refinement methodology provides a meaningful pathway for building both accurate and correct machine learning networks.
more »
« less
- PAR ID:
- 10220417
- Editor(s):
- Ivrii, Alexander; Strichman, Ofer
- Date Published:
- Journal Name:
- Formal Methods in Computer-Aided Design
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Online deep reinforcement learning (deep RL)- based systems are being increasingly deployed in a variety of safety-critical applications. Due to the dynamic nature of the environments they work in, onboard reinforcement learning (RL) hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupts the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on the availability of large training datasets to configure resilience parameters. This is not always feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that leverages running statistics of neuron output values collected across the (real-time) RL training process to configure error detection thresholds (called checks) for the deep RL forward pass. Similarly, we formulate checks on the deep RL backward pass using running statistical thresholds on reduceddimension checksums of online learning weight updates to rapidly detect and correct errors in online deep RL training. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs or weights as erroneous. The use of running statistics allows the checks to adapt to changes caused by continual on-line RL training. Erroneous neurons are set to zero (suppressed) in the forward pass. Erroneous weight updates are frozen, allowing nonerroneous weight updates to proceed and allowing online learning without rerunning training episodes. Our approach is compared against the state of the art and validated on several RL algorithms as well as a hardware validation platform.more » « less
-
Online reinforcement learning (RL) based systems are being increasingly deployed in a variety of safety-critical applications ranging from drone control to medical robotics. These systems typically use RL onboard rather than relying on remote operation from high-performance datacenters. Due to the dynamic nature of the environments they work in, onboard RL hardware is vulnerable to soft errors from radiation, thermal effects and electrical noise that corrupt the results of computations. Existing approaches to on-line error resilience in machine learning systems have relied on availability of the large training datasets to configure resilience parameters, which is not necessarily feasible for online RL systems. Similarly, other approaches involving specialized hardware or modifications to training algorithms are difficult to implement for onboard RL applications. In contrast, we present a novel error resilience approach for online RL that makes use of running statistics collected across the (real-time) RL training process to configure error detection thresholds without the need to access a reference training dataset. In this methodology, statistical concentration bounds leveraging running statistics are used to diagnose neuron outputs as erroneous. These erroneous neurons are then set to zero (suppressed). Our approach is compared against the state of the art and validated on several RL algorithms involving the use of multiple concentration bounds on CPU as well as GPU hardware.more » « less
-
null (Ed.)Cooperatively avoiding collision is a critical functionality for robots navigating in dense human crowds, failure of which could lead to either overaggressive or overcautious behavior. A necessary condition for cooperative collision avoidance is to couple the prediction of the agents’ trajectories with the planning of the robot’s trajectory. However, it is unclear that trajectory based cooperative collision avoidance captures the correct agent attributes. In this work we migrate from trajectory based coupling to a formalism that couples agent preference distributions. In particular, we show that preference distributions (probability density functions representing agents’ intentions) can capture higher order statistics of agent behaviors, such as willingness to cooperate. Thus, coupling in distribution space exploits more information about inter-agent cooperation than coupling in trajectory space. We thus introduce a general objective for coupled prediction and planning in distribution space, and propose an iterative best response optimization method based on variational analysis with guaranteed sufficient decrease. Based on this analysis, we develop a sampling-based motion planning framework called DistNav1 that runs in real time on a laptop CPU. We evaluate our approach on challenging scenarios from both real world datasets and simulation environments, and benchmark against a wide variety of model based and machine learning based approaches. The safety and efficiency statistics of our approach outperform all other models. Finally, we find that DistNav is competitive with human safety and efficiency performance.more » « less
-
The widespread growth of additive manufacturing, a field with a complex informatic “digital thread”, has helped fuel the creation of design repositories, where multiple users can upload distribute, and download a variety of candidate designs for a variety of situations. Additionally, advancements in additive manufacturing process development, design frameworks, and simulation are increasing what is possible to fabricate with AM, further growing the richness of such repositories. Machine learning offers new opportunities to combine these design repository components’ rich geometric data with their associated process and performance data to train predictive models capable of automatically assessing build metrics related to AM part manufacturability. Although design repositories that can be used to train these machine learning constructs are expanding, our understanding of what makes a particular design repository useful as a machine learning training dataset is minimal. In this study we use a metamodel to predict the extent to which individual design repositories can train accurate convolutional neural networks. To facilitate the creation and refinement of this metamodel, we constructed a large artificial design repository, and subsequently split it into sub-repositories. We then analyzed metadata regarding the size, complexity, and diversity of the sub-repositories for use as independent variables predicting accuracy and the required training computational effort for training convolutional neural networks. The networks each predict one of three additive manufacturing build metrics: (1) part mass, (2) support material mass, and (3) build time. Our results suggest that metamodels predicting the convolutional neural network coefficient of determination, as opposed to computational effort, were most accurate. Moreover, the size of a design repository, the average complexity of its constituent designs, and the average and spread of design spatial diversity were the best predictors of convolutional neural network accuracy.more » « less
An official website of the United States government

