skip to main content


This content will become publicly available on July 23, 2024

Title: Hiding Data Helps: On the Benefits of Masking for Sparse Coding
Sparse coding, which refers to modeling a signal as sparse linear combinations of the elements of a learned dictionary, has proven to be a successful (and interpretable) approach in applications such as signal processing, computer vision, and medical imaging. While this success has spurred much work on provable guarantees for dictionary recovery when the learned dictionary is the same size as the ground-truth dictionary, work on the setting where the learned dictionary is larger (or over-realized) with respect to the ground truth is comparatively nascent. Existing theoretical results in this setting have been constrained to the case of noise-less data. We show in this work that, in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the elements of the ground-truth dictionary in the over-realized regime, regardless of the magnitude of the signal in the data-generating process. Furthermore, drawing from the growing body of work on self-supervised learning, we propose a novel masking objective for which recovering the ground-truth dictionary is in fact optimal as the signal increases for a large class of data-generating processes. We corroborate our theoretical results with experiments across several parameter regimes showing that our proposed objective also enjoys better empirical performance than the standard reconstruction objective.  more » « less
Award ID(s):
2307106
NSF-PAR ID:
10477449
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Proceedings of Machine Learning Research
Date Published:
Journal Name:
Proceedings of the 40th International Conference on Machine Learning
Volume:
202
Page Range / eLocation ID:
5600--5615
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sparse coding refers to modeling a signal as sparse linear combinations of the elements of a learned dictionary. Sparse coding has proven to be a successful and interpretable approach in many applications, such as signal processing, computer vision, and medical imaging. While this success has spurred much work on sparse coding with provable guarantees, work on the setting where the learned dictionary is larger (or over-realized) with respect to the ground truth is comparatively nascent. Existing theoretical results in the over-realized regime are limited to the case of noise-less data. In this paper, we show that for over-realized sparse coding in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the ground-truth dictionary, regardless of the magnitude of the signal in the data-generating process. Furthermore, drawing from the growing body of work on self-supervised learning, we propose a novel masking objective and we prove that minimizing this new objective can recover the ground-truth dictionary. We corroborate our theoretical results with experiments across several parameter regimes, showing that our proposed objective enjoys better empirical performance than the standard reconstruction objective. 
    more » « less
  2. Data from the cellular network have been proved as one of the most promising way to understand large-scale human mobility for various ubiquitous computing applications due to the high penetration of cellphones and low collection cost. Existing mobility models driven by cellular network data suffer from sparse spatial-temporal observations because user locations are recorded with cellphone activities, e.g., calls, text, or internet access. In this paper, we design a human mobility recovery system called CellSense to take the sparse cellular billing data (CBR) as input and outputs dense continuous records to recover the sensing gap when using cellular networks as sensing systems to sense the human mobility. There is limited work on this kind of recovery systems at large scale because even though it is straightforward to design a recovery system based on regression models, it is very challenging to evaluate these models at large scale due to the lack of the ground truth data. In this paper, we explore a new opportunity based on the upgrade of cellular infrastructures to obtain cellular network signaling data as the ground truth data, which log the interaction between cellphones and cellular towers at signal levels (e.g., attaching, detaching, paging) even without billable activities. Based on the signaling data, we design a system CellSense for human mobility recovery by integrating collective mobility patterns with individual mobility modeling, which achieves the 35.3% improvement over the state-of-the-art models. The key application of our recovery model is to take regular sparse CBR data that a researcher already has, and to recover the missing data due to sensing gaps of CBR data to produce a dense cellular data for them to train a machine learning model for their use cases, e.g., next location prediction. 
    more » « less
  3. Abstract Many studies of brain aging and neurodegenerative disorders such as Alzheimer’s and Parkinson’s diseases require rapid counts of high signal: noise (S:N) stained brain cells such as neurons and neuroglial (microglia cells) on tissue sections. To increase throughput efficiency of this work, we have combined deep learned (DL) neural networks and computerized stereology (DL-stereology) for automatic cell counts with low error (<10%) compared to time-intensive manual counts. To date, however, this approach has been limited to sections with a single high S:N immunostain for neurons (NeuN) or microglial cells (Iba-1). The present study expands this approach to protocols that combine immunostains with counterstains, e.g., cresyl violet (CV). In our method, a stain separation technique called Sparse Non-negative Matrix Factorization (SNMF) converts a dual-stained color image to a single gray image showing only the principal immunostain. Validation testing was done using semi- and automatic stereology-based counts of sections immunostained for neurons or microglia with CV counterstaining from the neocortex of a transgenic mouse model of tauopathy (Tg4510 mouse) and controls. Cell count results with principal stain gray images show an average error rate of 16.78% and 28.47% for the semi-automatic approach and 8.51% and 9.36% for the fully-automatic DL-stereology approach for neurons and microglia, respectively, as compared to manual cell counts (ground truth). This work indicates that stain separation by SNMF can support high throughput, fully automatic DL-stereology based counts of neurons and microglia on counterstained tissue sections. 
    more » « less
  4. Unsupervised denoising is a crucial challenge in real-world imaging applications. Unsupervised deep-learning methods have demonstrated impressive performance on benchmarks based on synthetic noise. However, no metrics are available to evaluate these methods in an unsupervised fashion. This is highly problematic for the many practical applications where ground-truth clean images are not available. In this work, we propose two novel metrics: the unsupervised mean squared error (MSE) and the unsupervised peak signal-to-noise ratio (PSNR), which are computed using only noisy data. We provide a theoretical analysis of these metrics, showing that they are asymptotically consistent estimators of the supervised MSE and PSNR. Controlled numerical experiments with synthetic noise confirm that they provide accurate approximations in practice. We validate our approach on real-world data from two imaging modalities: videos in raw format and transmission electron microscopy. Our results demonstrate that the proposed metrics enable unsupervised evaluation of denoising methods based exclusively on noisy data. 
    more » « less
  5. null (Ed.)
    Ultrasound B-Mode images are created from data obtained from each element in the transducer array in a process called beamforming. The beamforming goal is to enhance signals from specified spatial locations, while reducing signal from all other locations. On clinical systems, beamforming is accomplished with the delay-and-sum (DAS) algorithm. DAS is efficient but fails in patients with high noise levels, so various adaptive beamformers have been proposed. Recently, deep learning methods have been developed for this task. With deep learning methods, beamforming is typically framed as a regression problem, where clean, ground-truth data is known, and usually simulated. For in vivo data, however, it is extremely difficult to collect ground truth information, and deep networks trained on simulated data underperform when applied to in vivo data, due to domain shift between simulated and in vivo data. In this work, we show how to correct for domain shift by learning deep network beamformers that leverage both simulated data, and unlabeled in vivo data, via a novel domain adaption scheme. A challenge in our scenario is that domain shift exists both for noisy input, and clean output. We address this challenge by extending cycle-consistent generative adversarial networks, where we leverage maps between synthetic simulation and real in vivo domains to ensure that the learned beamformers capture the distribution of both noisy and clean in vivo data. We obtain consistent in vivo image quality improvements compared to existing beamforming techniques, when applying our approach to simulated anechoic cysts and in vivo liver data. 
    more » « less