Motivated by an attempt to understand the formation and development of (human) language, we introduce a "distributed compression" problem. In our problem a sequence of pairs of players from a set of K players are chosen and tasked to communicate messages drawn from an unknown distribution Q. Arguably languages are created and evolve to compress frequently occurring messages, and we focus on this aspect. The only knowledge that players have about the distribution Q is from previously drawn samples, but these samples differ from player to player. The only common knowledge between the players is restricted to a common prior distribution P and some constant number of bits of information (such as a learning algorithm). Letting T_eps denote the number of iterations it would take for a typical player to obtain an epsapproximation to Q in total variation distance, we ask whether T_eps iterations suffice to compress the messages down roughly to their entropy and give a partial positive answer.
We show that a natural uniform algorithm can compress the communication down to an average cost per message of O(H(Q) + log (D(P  Q) + O(1)) in $\tilde{O}(T_eps)$ iterations while allowing for O(eps)error, where D(.  .) denotes the KLdivergence between distributions. For large divergences this compares favorably with the static algorithm that ignores all samples and compresses down to H(Q) + D(P  Q) bits, while not requiring (T_eps . K) iterations that it would take players to develop optimal but separate compressions for each pair of players. Along the way we introduce a "datastructural" view of the task of communicating with a natural language and show that our natural algorithm can also be implemented by an efficient data structure, whose storage is comparable to the storage requirements of Q and whose query complexity is comparable to the lengths of the message to be compressed. Our results give a plausible mathematical analogy to the mechanisms by which human languages get created and evolve, and in particular highlights the possibility of coordination towards a joint task (agreeing on a language) while engaging in distributed learning.
more »
« less
Collaborative PAC Learning
We consider a collaborative PAC learning model, in which k players attempt to
learn the same underlying concept. We ask how much more information is required
to learn an accurate classifier for all players simultaneously. We refer to
the ratio between the sample complexity of collaborative PAC learning and its
noncollaborative (singleplayer) counterpart as the overhead. We design learning
algorithms with O(ln(k)) and O(ln2
(k)) overhead in the personalized and centralized
variants our model. This gives an exponential improvement upon the naïve
algorithm that does not share information among players. We complement our
upper bounds with an Ω(ln(k)) overhead lower bound, showing that our results are
tight up to a logarithmic factor.
more »
« less
 NSFPAR ID:
 10057475
 Date Published:
 Journal Name:
 NIPS 2017
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


With the dramatic growth of data in both amount and scale, distributed machine learning has become an important tool for the massive data to finish the tasks as prediction, classification, etc. However, due to the practical physical constraints and the potential privacy leakage of data, it is infeasible to aggregate raw data from all data owners or the learning purpose. To tackle this problem, the distributed privacypreserving learning approaches are introduced to learn over all distributed data without exposing the real information. However, existing approaches have limits on the complicated distributed system. On the one hand, traditional privacypreserving learning approaches rely on heavy cryptographic primitives on training data, in which the learning speed is dramatically slowed down due to the computation overheads. On the other hand, the complicated system architecture becomes a barrier in the practical distributed system. In this paper, we propose an efficient privacypreserving machine learning scheme for hierarchical distributed systems. We modify and improve the collaborative learning algorithm. The proposed scheme not only reduces the overhead for the learning process but also provides the comprehensive protection for each layer of the hierarchical distributed system. In addition, based on the analysis of the collaborative convergency in different learning groups, we also propose an asynchronous strategy to further improve the learning efficiency of hierarchical distributed system. At the last, extensive experiments on realworld data are implemented to evaluate the privacy, efficacy, and efficiency of our proposed schemes.more » « less

We consider the problem of accurately recovering a matrix B of size M by M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problemboth in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sidesremain open. Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using O(M) samples} (where we assume the rank is a constant). This linear sample complexity is optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. Additionally, we provide an even stronger lower bound showing that distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by a wellconditioned Hidden Markov Model with two hidden states requires Omega(M) observations, while our positive results for recovering B immediately imply that Omega(M) observations suffice to learn such an HMM. This lower bound precludes sublinearsample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.more » « less

We take initial steps in studying PACMDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in realworld applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, QLearning with UCB2 exploration, is a modelfree algorithm for Hstep episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any noregret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.more » « less