skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: How Deep Sparse Networks Avoid the Curse of Dimensionality: Efficiently Computable Functions are Compositionally Sparse
The main claim of this perspective is that compositional sparsity of the target function, which corresponds to the task to be learned, is the key principle underlying machine learning. Mhaskar and Poggio (2016) proved that sparsity of the compositional target functions (when the functions are on the reals, the constituent functions are also required to be smooth), naturally leads to sparse deep networks for approximation and thus for optimization. This is the case of most CNNs in current use, in which the known sparse graph of the target function is reflected in the sparse connectivity of the network. When the graph of the target function is unknown, I conjecture that transformers are able to implement a flexible version of sparsity (selecting which input tokens interact in the MLP layer), through the self-attention layers. Surprisingly, the assumption of compositional sparsity of the target function is not restrictive in practice, since I prove here that for computable functions (if on the reals with Lipschitz continuous derivatives) compositional sparsity is equivalent to efficient computability, that is Turing computability in polynomial time.  more » « less
Award ID(s):
2134108
PAR ID:
10565465
Author(s) / Creator(s):
Publisher / Repository:
Center for Brains, Minds and Machines (CBMM)
Date Published:
Format(s):
Medium: X
Institution:
Massachusetts Institute of Technology
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks. 
    more » « less
  2. Classical results in sparse recovery guarantee the exact reconstruction of s-sparse signals under assumptions on the dictionary that are either too strong or NP-hard to check. Moreover, such results may be pessimistic in practice since they are based on a worst-case analysis. In this paper, we consider the sparse recovery of signals defined over a graph, for which the dictionary takes the form of an incidence matrix. We derive necessary and sufficient conditions for sparse recovery, which depend on properties of the cycles of the graph that can be checked in polynomial time. We also derive support-dependent conditions for sparse recovery that depend only on the intersection of the cycles of the graph with the support of the signal. Finally, we exploit sparsity properties on the measurements and the structure of incidence matrices to propose a specialized sub-graph-based recovery algorithm that outperforms the standard l1 -minimization approach. 
    more » « less
  3. Classical results in sparse representation guarantee the exact recovery of sparse signals under assumptions on the dictionary that are either too strong or NP hard to check. Moreover, such results may be too pessimistic in practice since they are based on a worst-case analysis. In this paper, we consider the sparse recovery of signals defined over a graph, for which the dictionary takes the form of an incidence matrix. We show that in this case necessary and sufficient conditions can be derived in terms of properties of the cycles of the graph, which can be checked in polynomial time. Our analysis further allows us to derive location dependent conditions for recovery that only depend on the cycles of the graph that intersect this support. Finally, we exploit sparsity properties on the measurements to a specialized sub-graph-based recovery algorithm that outperforms the standard $$l_1$$-minimization. 
    more » « less
  4. Neural networks have demonstrated impressive success in various domains, raising the question of what fundamental principles underlie the effectiveness of the best AI systems and quite possibly of human intelligence. This perspective argues that compositional sparsity, or the property that a compositional function have “few” constituent functions, each depending on only a small subset of inputs, is a key principle underlying successful learning architectures. Surprisingly, all functions that are efficiently Turing computable have a compositional sparse representation. Furthermore, deep networks that are also sparse can exploit this general property to avoid the “curse of dimensionality”. This framework suggests interesting implications about the role that machine learning may play in mathematics. 
    more » « less
  5. A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice. 
    more » « less