NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning Neural Networks with Sparse Activations

Awasthi, Pranjal; Dikkala, Nishanth; Kamath, Pritish; Meka, Raghu (June 2024, Proceedings of the 37th Conference on Learning Theory (COLT 2024))

A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
more » « less
Full Text Available
Learning Ising Models from One or Multiple Samples

Dagan, Yuval; Daskalakis, Constantinos; Dikkala, Nishanth; Kandiros, Anthimos Vardis (January 2021, the 53rd ACM Symposium on Theory of Computing (STOC))
null (Ed.)
Full Text Available
Regression from Dependent Observations

Daskalakis, Constantinos; Dikkala, Nishanth; Panageas, Ioannis (January 2019, 51st Annual ACM Symposium on the Theory of Computing)

The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates. The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables.
more » « less
Full Text Available
Generalization and Learning Under Dobrushin's Condition

Dagan, Yuval; Daskalakis, Constantinos; Dikkala, Nishanth; Jayanti, Siddhartha (January 2019, 32nd Annual Conference on Learning Theory)

Statistical learning theory has largely focused on learning and generalization given inde-pendent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures,which appropriately extend the notion of Rademacher complexity to bound the generalization error and learning rates of hypothesis classes in this setting. Rather than time-series data, our work is motivated by settings where data is sampled on a network or a spatial domain, and thus do not fit well within the framework of prior work. We provide learning and generaliza-tion bounds for data that are complexly dependent, yet their distribution satisfies the standardDobrushin’s condition. Indeed, we show that the standard complexity measures of Gaussian and Rademacher complexities and VC dimension are sufficient measures of complexity for the purposes of bounding the generalization error and learning rates of hypothesis classes in our setting. Moreover, our generalization bounds only degrade by constant factors compared to their i.i.d. analogs, and our learnability bounds degrade by log factors in the size of the trainingset.
more » « less
Full Text Available
Testing Symmetric Markov Chains from a Single Trajectory

Daskalakis, Constantinos; Dikkala, Nishanth; Gravin, Nick (July 2018, Conference on Learning Theory (COLT))

The paper’s abstract in valid LaTeX, without non-standard macros or \cite commands. Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a {\em single trajectory of a Markov Chain.} In particular, we observe a single trajectory X0,…,Xt,… of an unknown, symmetric, and finite state Markov Chain M. We do not control the starting state X0, and we cannot restart the chain. Given our single trajectory, the goal is to test whether M is identical to a model Markov Chain M′, or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain M′ is O~(n) in the size of the state space n.
more » « less
Full Text Available
From Soft Classifiers to Hard Decisions: How fair can we be?

https://doi.org/10.1145/3287560.3287561

Canetti, Ran; Cohen, Aloni; Dikkala, Nishanth; Ramnarayan, Govind; Scheffler, Sarah; Smith, Adam (January 2019, Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* '19)

Full Text Available
Testing Ising Models

Daskalakis, Constantinos; Dikkala, Nishanth; Kamath, Gautam (January 2018, Proceedings of the annual ACM-SIAM Symposium on Discrete Algorithms)

Given samples from an unknown multivariate distribution p, is it possible to distinguish whether p is the product of its marginals versus p being far from every product distribution? Similarly, is it possible to distinguish whether p equals a given distribution q versus p and q being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters [BFF+01, Pan08, VV17, ADK15, DK16]. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity. Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model.
more » « less
Full Text Available
HOGWILD!-Gibbs can be PanAccurate

Daskalakis, Constantinos; Dikkala, Nishanth; Jayanti, Siddhartha (January 2018, 32nd Annual Conference on Neural Information Processing Systems)

Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin's condition~\cite{DeSaOR16}. We investigate whether it can be used to accurately estimate expectations of functions of {\em all the variables} of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by O(τlogn), where n is the number of variables in the graphical model, and τ is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in n. Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results, we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multi-processor machine to empirically illustrate our theoretical findings.
more » « less
Full Text Available
Testing Symmetric Markov Chains from a Single Trajectory

Daskalakis, Constantinos; Dikkala, Nishanth; Gravin, Nick (January 2018, 31st Annual Conference on Learning Theory)

Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a single trajectory of a Markov Chain. In particular, we observe a single trajectory X0,...,Xt,... of an unknown, symmetric, and finite state Markov Chain M. We do not control the starting state X0, and we cannot restart the chain. Given our single trajectory, the goal is to test whether M is identical to a model Markov Chain M0 , or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [Kaz78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain M0 is O(n) in the size of the state space n.
more » « less
Full Text Available
HOGWILD!-Gibbs can be PanAccurate

Daskalakis, Constantinos; Dikkala, Nishanth; Jayanti, Siddhartha (January 2018, Journal of neural information processing)

Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin's condition~\cite{DeSaOR16}. We investigate whether it can be used to accurately estimate expectations of functions of {\em all the variables} of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by O(τlogn), where n is the number of variables in the graphical model, and τ is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in n. Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results, we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multi-processor machine to empirically illustrate our theoretical findings.
more » « less
Full Text Available

« Prev Next »

Search for: All records