This letter studies how a stochastic gradient
algorithm (SG) can be controlled to hide the estimate of
the local stationary point from an eavesdropper. Such prob-
lems are of significant interest in distributed optimization
settings like federated learning and inventory management.
A learner queries a stochastic oracle and incentivizes the
oracle to obtain noisy gradient measurements and per-
form SG. The oracle probabilistically returns either a noisy
gradient of the function or a non-informative measure-
ment, depending on the oracle state and incentive. The
learner’s query and incentive are visible to an eavesdropper
who wishes to estimate the stationary point. This letter
formulates the problem of the learner performing covert
optimization by dynamically incentivizing the stochastic
oracle and obfuscating the eavesdropper as a finite-horizon
Markov decision process (MDP). Using conditions for
interval-dominance on the cost and transition probability
structure, we show that the optimal policy for the MDP has
a monotone threshold structure. We propose searching for
the optimal stationary policy with the threshold structure
using a stochastic approximation algorithm and a multi–
armed bandit approach. The effectiveness of our methods
is numerically demonstrated on a covert federated learning
hate-speech classification task.
more »
« less
This content will become publicly available on February 1, 2025
Controlling Federated Learning for Covertness
A learner aims to minimize a function f by repeatedly querying a distributed oracle that provides
noisy gradient evaluations. At the same time, the learner seeks to hide arg min f from a malicious
eavesdropper that observes the learner’s queries. This paper considers the problem of covert or
learner-private optimization, where the learner has to dynamically choose between learning and
obfuscation by exploiting the stochasticity. The problem of controlling the stochastic gradient
algorithm for covert optimization is modeled as a Markov decision process, and we show that the
dynamic programming operator has a supermodular structure implying that the optimal policy has a
monotone threshold structure. A computationally efficient policy gradient algorithm is proposed to
search for the optimal querying policy without knowledge of the transition probabilities. As a practical
application, our methods are demonstrated on a hate speech classification task in a federated setting
where an eavesdropper can use the optimal weights to generate toxic content, which is more easily
misclassified. Numerical results show that when the learner uses the optimal policy, an eavesdropper
can only achieve a validation accuracy of 52% with no information and 69% when it has a public
dataset with 10% positive samples compared to 83% when the learner employs a greedy policy.
more »
« less
- Award ID(s):
- 2312198
- NSF-PAR ID:
- 10518931
- Publisher / Repository:
- https://www.jmlr.org/tmlr/papers/
- Date Published:
- Journal Name:
- Transactions on machine learning research
- ISSN:
- 2835-8856
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)We consider the framework of non-stationary stochastic optimization (Besbes et al., 2015) with squared error losses and noisy gradient feedback where the dynamic regret of an online learner against a time varying comparator sequence is studied. Motivated from the theory of non-parametric regression, we introduce a new variational constraint that enforces the comparator sequence to belong to a discrete k^{th} order Total Variation ball of radius C_n. This variational constraint models comparators that have piece-wise polynomial structure which has many relevant practical applications (Tibshirani, 2014). By establishing connections to the theory of wavelet based non-parametric regression, we design a polynomial time algorithm that achieves the nearly optimal dynamic regret of ~O(n^{1/(2k+3)} C_n^{2/(2k+3)}). The proposed policy is adaptive to the unknown radius C_n. Further, we show that the same policy is minimax optimal for several other non-parametric families of interest.more » « less
-
Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an l-th order tensor in (Rd)⊗l of rank r (where r≪d), can variants of gradient descent find a rank m decomposition where m>r? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least m=Ω(dl−1), while a variant of gradient descent can find an approximate tensor when m=O∗(r2.5llogd). Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.more » « less
-
We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial to find. While Reinforcement Learning (RL) has been recognized to have great potential for learning policies in such cases, our problem has an exponentially large state space size, rendering standard RL inefficient. In this work, we propose ACHQ, an efficient policy gradient-based algorithm with a low dimensional soft threshold policy parameterization that leverages the underlying queueing structure. We provide stationary-point convergence guarantees for the general case and despite the low-dimensional parameterization prove that ACHQ converges to an approximate global optimum for the special case of two servers. Simulations demonstrate an improvement in expected response time of up to ∼30 over the greedy policy that routes to the fastest available server.more » « less
-
We consider optimal transport-based distributionally robust optimization (DRO) problems with locally strongly convex transport cost functions and affine decision rules. Under conventional convexity assumptions on the underlying loss function, we obtain structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model. These results expose a rich structure embedded in the DRO problem (e.g., strong convexity even if the non-DRO problem is not strongly convex, a suitable scaling of the Lagrangian for the DRO constraint, etc., which are crucial for the design of efficient algorithms). As a consequence of these results, one can develop efficient optimization procedures that have the same sample and iteration complexity as a natural non-DRO benchmark algorithm, such as stochastic gradient descent.more » « less