skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MoNet: Moments Embedding Network.
Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in multiple vision tasks. Different from conventional global average pooling or fully connected layer, bilinear pooling gathers 2nd order information in a translation invariant fashion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pooling methods with compact properties have been explored towards resolving this weakness. Additionally, recent results have shown that significant performance gains can be achieved by adding 1st order information and applying matrix normalization to regularize unstable higher order information. However, combining compact pooling with matrix normalization and other order information has not been explored until now. In this paper, we unify bilinear pooling and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normalize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pooling methods. Our experiments on three widely used finegrained classification datasets illustrate that our proposed architecture, MoNet, can achieve similar or better performance than with the state-of-art G2DeNet. Furthermore, when combined with compact pooling technique, MoNet obtains comparable performance with encoded features with 96% less dimensions.  more » « less
Award ID(s):
1646121
PAR ID:
10069443
Author(s) / Creator(s):
Date Published:
Journal Name:
2018 IEEE Conf. on Computer Vision and Pattern Recognition.
Page Range / eLocation ID:
3175-3183
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in mul- tiple vision tasks. Different from conventional global aver- age pooling or fully connected layer, bilinear pooling gath- ers 2nd order information in a translation invariant fash- ion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pool- ing methods with compact properties have been explored towards resolving this weakness. Additionally, recent re- sults have shown that significant performance gains can be achieved by adding 1st order information and applying ma- trix normalization to regularize unstable higher order in- formation. However, combining compact pooling with ma- trix normalization and other order information has not been explored until now. In this paper, we unify bilinear pool- ing and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normal- ize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pool- ing methods. Our experiments on three widely used fine- grained classification datasets illustrate that our proposed architecture, MoNet, can achieve similar or better perfor- mance than with the state-of-art G 2 DeNet. Furthermore, when combined with compact pooling technique, MoNet ob- tains comparable performance with encoded features with 96% less dimensions. 
    more » « less
  2. Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains. 
    more » « less
  3. Lensfree holographic microscopy is a compact and cost-effective modality for imaging large fields of view with high resolution. When combined with automated image processing, it can be used for biomolecular sensing where biochemically functionalized micro- and nano-beads are used to label biomolecules of interest. Neural networks for image feature classification provide faster and more robust sensing results than traditional image processing approaches. While neural networks have been widely applied to other types of image classification problems, and even image reconstruction in lensfree holographic microscopy, it is unclear what type of network architecture performs best for the types of small object image classification problems involved in holographic-based sensors. Here, we apply a shallow convolutional neural network to this task, and thoroughly investigate how different layers and hyperparameters affect network performance. Layers include dropout, convolutional, normalization, pooling, and activation. Hyperparameters include dropout fraction, filter number and size, stride, and padding. We ultimately achieve a network accuracy of ∼83%, and find that the choice of activation layer is most important for maximizing accuracy. We hope that these results can be helpful for researchers developing neural networks for similar classification tasks. 
    more » « less
  4. In this paper, we propose to employ a bank of modality-dedicated Convolutional Neural Networks (CNNs), fuse, train, and optimize them together for person classification tasks. A modality-dedicated CNN is used for each modality to extract modality-specific features. We demonstrate that, rather than spatial fusion at the convolutional layers, the fusion can be performed on the outputs of the fully-connected layers of the modality-specific CNNs without any loss of performance and with significant reduction in the number of parameters. We show that, using multiple CNNs with multimodal fusion at the feature-level, we significantly outperform systems that use unimodal representation. We study weighted feature, bilinear, and compact bilinear feature-level fusion algorithms for multimodal biometric person identification. Finally, We propose generalized compact bilinear fusion algorithm to deploy both the weighted feature fusion and compact bilinear schemes. We provide the results for the proposed algorithms on three challenging databases: CMU Multi-PIE, BioCop, and BIOMDATA. 
    more » « less
  5. Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the class token [CLS] and the tokens. We propose in this paper a new yet simple normalization method that separately normalizes embedding vectors respectively corresponding to normal tokens and the [CLS] token, in order to better capture their distinct characteristics and enhance downstream task performance. Our empirical study shows that the [CLS] embeddings learned with our separate normalization layer better encode the global contextual information and are distributed more uniformly in its anisotropic space. When the conventional normalization layer is replaced with a separate normalization layer, we observe an average 2.7% performance improvement in learning tasks from the image, natural language, and graph domains. 
    more » « less