To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited number of Byzantine failures. In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that DETOX leads to orders of magnitude accuracy and speedup improvements over many state-of-the-art Byzantine-resilient approaches.
more »
« less
Asymptotic Network Independence and Step-Size for a Distributed Subgradient Method
We consider whether distributed subgradient methods can achieve a linear speedup over a centralized subgradient method. While it might be hoped that distributed network of n nodes that can compute n times more subgradients in parallel compared to a single node might, as a result, be n times faster, existing bounds for distributed optimization methods are often consistent with a slowdown rather than speedup compared to a single node. We show that a distributed subgradient method has this “linear speedup” property when using a class of square-summable-but-not-summable step-sizes which include 1/t^β when β ∈ (1/2,1); for such step-sizes, we show that after a transient period whose size depends on the spectral gap of the network, the method achieves a performance guarantee that does not depend on the network or the number of nodes. We also show that the same method can fail to have this “asymptotic network independence” property under the optimally decaying step-size 1/t^{1/2} and, as a consequence, can fail to provide a linear speedup compared to a single node with 1/t^{1/2} step-size.
more »
« less
- PAR ID:
- 10349575
- Date Published:
- Journal Name:
- Journal of machine learning research
- Volume:
- 23
- Issue:
- 69
- ISSN:
- 1532-4435
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper introduces DISPERSE, a distributed scalable architecture for delivery of content and services that provides resilience against node failure through location-independent storage and replication of content. Current content delivery networks (CDNs) have, at least to some degree, a centralized structure thus susceptible to a single point of failure. DISPERSE addresses this limitation by implementing a fully de-centralized structure. DISPERSE is a two-layer architecture: the first layer (front-end layer) exposes services (e.g., Web, SFTP) to clients; the second layer (back-end layer) provides reliable distributed storage of content and application state. Content in DISPERSE's back-end layer is stored and exchanged as Named Data Network (NDN) content objects. This allows DISPERSE to implement fine-grained, location-independent, fully decentralized content replication mechanisms. We validate the performance of DISPERSE under two node failure scenarios. In the first scenario, content can be stored in any DISPERSE node, and all nodes are equally likely to fail. In this scenario, we use non-linear optimization techniques to determine the optimal number of content copies under availability and latency constraints. In the second scenario, different nodes fail with different probabilities, and content is stored in nodes according to its value, node failure probability, and resource availability. This scenario is addressed as an instance of the minimum cost flow problem. Our results show that DISPERSE reduces the failure of content retrieval by five orders of magnitude compared to common CDN implementations, without significantly increasing content retrieval delay. Further, numerical results show that DISPERSE improves content availability by a factor of 1.3x-2.3x when deploying the minimum cost flow algorithm.more » « less
-
We consider the problem of finding the maximally influential node in random networks where each node influences every other node with constant yet unknown probability. We develop an online algorithm that learns the relative influences of the nodes. It relaxes the assumption in the existing literature that a central observer can monitor the influence spread globally. The proposed algorithm delegates the online updates to the nodes on the network; hence requires only local observations at the nodes. We show that using an explore-then-commit learning strategy, the cumulative regret accumulated by the algorithm over horizon T approaches O(T2/3) for a network with a large number of nodes. Additionally, we show that, for fixed T, the worst case-regret grows linearly with the number n of nodes in the graph. Numerical experiments illustrate this linear dependence for Chung-Lu models. The experiments also demonstrate that ε-greedy learning strategies can achieve similar performance to the explore-then-commit strategy on Chung-Lu models.more » « less
-
Distributed transmit beamforming (DTBF) can allow a swarm of unmanned aerial vehicles (UAVs) to send a common message to a distant target. DTBF among N nodes can provide N 2 times the received power compared to a single node and can reduce interference by confining the signal in a certain direction. However, DTBF requires time, frequency, and phase synchronization. Here, we focus on the issue of phase incoherence at the distributed transmit nodes from two sources—different local oscillators (LOs) and hovering position movement—and how to counteract their impact at the receiver via local decisions, namely, rotation. To investigate how the UAV body and its rotation can affect phase coherency, we conduct controlled in-field experiments where we control the phase offset at two distributed antennas and measure the received signal level at four antenna positions on a drone for various rotation angles. We show that significant improvements can be achieved at the receiver through rotation. We also show that there exists an optimal combination of UAV rotation angle and antenna position on the drone to mitigate the effects of phase incoherence among the distributed transmitters. Finally, we demonstrate an interesting trade-off where, due to the heterogeneous nature of the UAV body, rotation angles that yield maximum beamforming gains might not result in the best average (or minimum) beamformed signal level across all possible phase errors at the distributed transmitters.more » « less
-
Si, Hang; Shepherd, Kendrick M; Zhang, Yongjie Jessica (Ed.)Warping large volume meshes has applications in biomechanics, aerodynamics, image processing, and cardiology. However, warping large, real-world meshes is computationally expensive. Existing parallel implementations of mesh warping algorithms do not take advantage of shared-memory and one-sided communication features available in the MPI-3 standard. We describe our parallelization of the finite element-based mesh warping algorithm for tetrahedral meshes. Our implementation is portable across shared and distributed memory architectures, as it takes advantage of shared memory and one-sided communication to precompute neighbor lists in parallel. We then deform a mesh by solving a Poisson boundary value problem and the resulting linear system, which has multiple right-hand sides, in parallel. Our results demonstrate excellent efficiency and strong scalability on up to 32 cores on a single node. Furthermore, we show a 33.9% increase in speedup with 256 cores distributed uniformly across 64 nodes versus our largest single node speedup while observing sublinear speedups overall.more » « less
An official website of the United States government

