skip to main content


Title: Deep Mixture of Experts via Shallow Embedding
Larger networks generally have greater representational power at the cost of increased computational complexity. Sparsifying such networks has been an active area of research but has been generally limited to static regularization or dynamic approaches using reinforcement learning. We explore a mixture of experts (MoE) approach to deep dynamic routing, which activates certain experts in the network on a per-example basis. Our novel DeepMoE architecture increases the representational power of standard convolutional networks by adaptively sparsifying and recalibrating channel-wise features in each convolutional layer. We employ a multi-headed sparse gating network to determine the selection and scaling of channels for each input, leveraging exponential combinations of experts within a single convolutional network. Our proposed architecture is evaluated on four benchmark datasets and tasks, and we show that Deep-MoEs are able to achieve higher accuracy with lower computation than standard convolutional networks.  more » « less
Award ID(s):
1846431
NSF-PAR ID:
10166380
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Uncertainty in artificial intelligence
ISSN:
1525-3384
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Deep learning has been broadly applied to imaging in scattering applications. A common framework is to train a descattering network for image recovery by removing scattering artifacts. To achieve the best results on a broad spectrum of scattering conditions, individual “expert” networks need to be trained for each condition. However, the expert’s performance sharply degrades when the testing condition differs from the training. An alternative brute-force approach is to train a “generalist” network using data from diverse scattering conditions. It generally requires a larger network to encapsulate the diversity in the data and a sufficiently large training set to avoid overfitting. Here, we propose anadaptive learningframework, termed dynamic synthesis network (DSN), whichdynamicallyadjusts the model weights andadaptsto different scattering conditions. The adaptability is achieved by a novel “mixture of experts” architecture that enables dynamically synthesizing a network by blending multiple experts using a gating network. We demonstrate the DSN in holographic 3D particle imaging for a variety of scattering conditions. We show in simulation that our DSN provides generalization across acontinuumof scattering conditions. In addition, we show that by training the DSN entirely on simulated data, the network can generalize to experiments and achieve robust 3D descattering. We expect the same concept can find many other applications, such as denoising and imaging in scattering media. Broadly, our dynamic synthesis framework opens up a new paradigm for designing highlyadaptivedeep learning and computational imaging techniques.

     
    more » « less
  2. Topological data analysis (TDA) is a branch of computational mathematics, bridging algebraic topology and data science, that provides compact, noise-robust representations of complex structures. Deep neural networks (DNNs) learn millions of parameters associated with a series of transformations defined by the model architecture resulting in high-dimensional, difficult to interpret internal representations of input data. As DNNs become more ubiquitous across multiple sectors of our society, there is increasing recognition that mathematical methods are needed to aid analysts, researchers, and practitioners in understanding and interpreting how these models' internal representations relate to the final classification. In this paper we apply cutting edge techniques from TDA with the goal of gaining insight towards interpretability of convolutional neural networks used for image classification. We use two common TDA approaches to explore several methods for modeling hidden layer activations as high-dimensional point clouds, and provide experimental evidence that these point clouds capture valuable structural information about the model's process. First, we demonstrate that a distance metric based on persistent homology can be used to quantify meaningful differences between layers and discuss these distances in the broader context of existing representational similarity metrics for neural network interpretability. Second, we show that a mapper graph can provide semantic insight as to how these models organize hierarchical class knowledge at each layer. These observations demonstrate that TDA is a useful tool to help deep learning practitioners unlock the hidden structures of their models. 
    more » « less
  3. Transfer learning from the model trained on large datasets to customized downstream tasks has been widely used as the pre-trained model can greatly boost the generalizability. However, the increasing sizes of pre-trained models also lead to a prohibitively large memory footprints for downstream transferring, making them unaffordable for personal devices. Previous work recognizes the bottleneck of the footprint to be the activation, and hence proposes various solutions such as injecting specific lite modules. In this work, we present a novel memory-efficient transfer framework called Back Razor, that can be plug-and-play applied to any pre-trained network without changing its architecture. The key idea of Back Razor is asymmetric sparsifying: pruning the activation stored for back-propagation, while keeping the forward activation dense. It is based on the observation that the stored activation, that dominates the memory footprint, is only needed for backpropagation. Such asymmetric pruning avoids affecting the precision of forward computation, thus making more aggressive pruning possible. Furthermore, we conduct the theoretical analysis for the convergence rate of Back Razor, showing that under mild conditions, our method retains the similar convergence rate as vanilla SGD. Extensive transfer learning experiments on both Convolutional Neural Networks and Vision Transformers with classification, dense prediction, and language modeling tasks show that Back Razor could yield up to 97% sparsity, saving 9.2x memory usage, without losing accuracy. The code is available at: https://github.com/VITA-Group/BackRazor_Neurips22. 
    more » « less
  4. The brain has long been divided into distinct areas based upon its local microstructure, or patterned composition of cells, genes, and proteins. While this taxonomy is incredibly useful and provides an essential roadmap for comparing two brains, there is also immense anatomical variability within areas that must be incorporated into models of brain architecture. In this work we leverage the expressive power of deep neural networks to create a data-driven model of intra- and inter-brain area variability. To this end, we train a convolutional neural network that learns relevant microstructural features directly from brain imagery. We then extract features from the network and fit a simple classifier to them, thus creating a simple, robust, and interpretable model of brain architecture. We further propose and show preliminary results for the use of features from deep neural networks in conjunction with unsupervised learning techniques to find fine-grained structure within brain areas. We apply our methods to micron-scale X-ray microtomography images spanning multiple regions in the mouse brain and demonstrate that our deep feature-based model can reliably discriminate between brain areas, is robust to noise, and can be used to reveal anatomically relevant patterns in neural architecture that the network wasn't trained to find. 
    more » « less
  5. Deep convolutional neural networks have revolutionized many machine learning and computer vision tasks, however, some remaining key challenges limit their wider use. These challenges include improving the network's robustness to perturbations of the input image and the limited ``field of view'' of convolution operators. We introduce the IMEXnet that addresses these challenges by adapting semi-implicit methods for partial differential equations. Compared to similar explicit networks, such as residual networks, our network is more stable, which has recently shown to reduce the sensitivity to small changes in the input features and improve generalization. The addition of an implicit step connects all pixels in each channel of the image and therefore addresses the field of view problem while still being comparable to standard convolutions in terms of the number of parameters and computational complexity. We also present a new dataset for semantic segmentation and demonstrate the effectiveness of our architecture using the NYU Depth dataset. 
    more » « less