skip to main content


Title: Generalized Bilinear Deep Convolutional Neural Networks for Multimodal Biometric Identification
In this paper, we propose to employ a bank of modality-dedicated Convolutional Neural Networks (CNNs), fuse, train, and optimize them together for person classification tasks. A modality-dedicated CNN is used for each modality to extract modality-specific features. We demonstrate that, rather than spatial fusion at the convolutional layers, the fusion can be performed on the outputs of the fully-connected layers of the modality-specific CNNs without any loss of performance and with significant reduction in the number of parameters. We show that, using multiple CNNs with multimodal fusion at the feature-level, we significantly outperform systems that use unimodal representation. We study weighted feature, bilinear, and compact bilinear feature-level fusion algorithms for multimodal biometric person identification. Finally, We propose generalized compact bilinear fusion algorithm to deploy both the weighted feature fusion and compact bilinear schemes. We provide the results for the proposed algorithms on three challenging databases: CMU Multi-PIE, BioCop, and BIOMDATA.  more » « less
Award ID(s):
1650474
NSF-PAR ID:
10091242
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE International Conference on Image Processing (ICIP)
Page Range / eLocation ID:
763 to 767
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a deep multimodal fusion network to fuse multiple modalities (face, iris, and fingerprint) for person identification. The proposed deep multimodal fusion algorithm consists of multiple streams of modality-specific Convolutional Neural Networks (CNNs), which are jointly optimized at multiple feature abstraction levels. Multiple features are extracted at several different convolutional layers from each modality-specific CNN for joint feature fusion, optimization, and classification. Features extracted at different convolutional layers of a modality-specific CNN represent the input at several different levels of abstract representations. We demonstrate that an efficient multimodal classification can be accomplished with a significant reduction in the number of network parameters by exploiting these multi-level abstract representations extracted from all the modality-specific CNNs. We demonstrate an increase in multimodal person identification performance by utilizing the proposed multi-level feature abstract representations in our multimodal fusion, rather than using only the features from the last layer of each modality-specific CNNs. We show that our deep multi-modal CNNs with multimodal fusion at several different feature level abstraction can significantly outperform the unimodal representation accuracy. We also demonstrate that the joint optimization of all the modality-specific CNNs excels the score and decision level fusions of independently optimized CNNs. 
    more » « less
  2. Deformable Convolutional Networks (DCN) have been proposed as a powerful tool to boost the representation power of Convolutional Neural Networks (CNN) in computer vision tasks via adaptive sampling of the input feature map. Much like vision transformers, DCNs utilize a more flexible inductive bias than standard CNNs and have also been shown to improve performance of particular models. For example, drop-in DCN layers were shown to increase the AP score of Mask RCNN by 10.6 points while introducing only 1% additional parameters and FLOPs, improving the state-of-the art model at the time of publication. However, despite evidence that more DCN layers placed earlier in the network can further improve performance, we have not seen this trend continue with further scaling of deformations in CNNs, unlike for vision transformers. Benchmarking experiments show that a realistically sized DCN layer (64H×64W, 64 in-out channel) incurs a 4× slowdown on a GPU platform, discouraging the more ubiquitous use of deformations in CNNs. These slowdowns are caused by the irregular input-dependent access patterns of the bilinear interpolation operator, which has a disproportionately low arithmetic intensity (AI) compared to the rest of the DCN. To address the disproportionate slowdown of DCNs and enable their expanded use in CNNs, we propose DefT, a series of workload-aware optimizations for DCN kernels. DefT identifies performance bottlenecks in DCNs and fuses specific operators that are observed to limit DCN AI. Our approach also uses statistical information of DCN workloads to adapt the workload tiling to the DCN layer dimensions, minimizing costly out-of-boundary input accesses. Experimental results show that DefT mitigates up to half of DCN slowdown over the current-art PyTorch implementation. This translates to a layerwise speedup of up to 134% and a reduction of normalized training time of 46% on a fully DCN-enabled ResNet model. 
    more » « less
  3. Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in multiple vision tasks. Different from conventional global average pooling or fully connected layer, bilinear pooling gathers 2nd order information in a translation invariant fashion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pooling methods with compact properties have been explored towards resolving this weakness. Additionally, recent results have shown that significant performance gains can be achieved by adding 1st order information and applying matrix normalization to regularize unstable higher order information. However, combining compact pooling with matrix normalization and other order information has not been explored until now. In this paper, we unify bilinear pooling and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normalize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pooling methods. Our experiments on three widely used finegrained classification datasets illustrate that our proposed architecture, MoNet, can achieve similar or better performance than with the state-of-art G2DeNet. Furthermore, when combined with compact pooling technique, MoNet obtains comparable performance with encoded features with 96% less dimensions. 
    more » « less
  4. Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in mul- tiple vision tasks. Different from conventional global aver- age pooling or fully connected layer, bilinear pooling gath- ers 2nd order information in a translation invariant fash- ion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pool- ing methods with compact properties have been explored towards resolving this weakness. Additionally, recent re- sults have shown that significant performance gains can be achieved by adding 1st order information and applying ma- trix normalization to regularize unstable higher order in- formation. However, combining compact pooling with ma- trix normalization and other order information has not been explored until now. In this paper, we unify bilinear pool- ing and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normal- ize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pool- ing methods. Our experiments on three widely used fine- grained classification datasets illustrate that our proposed architecture, MoNet, can achieve similar or better perfor- mance than with the state-of-art G 2 DeNet. Furthermore, when combined with compact pooling technique, MoNet ob- tains comparable performance with encoded features with 96% less dimensions. 
    more » « less
  5. Abstract

    Boiling is a high-performance heat dissipation process that is central to electronics cooling and power generation. The past decades have witnessed significantly improved and better-controlled boiling heat transfer using structured surfaces, whereas the physical mechanisms that dominate structure-enhanced boiling remain contested. Experimental characterization of boiling has been challenging due to the high dimensionality, stochasticity, and dynamicity of the boiling process. To tackle these issues, this paper presents a coupled multimodal sensing and data fusion platform to characterize boiling states and heat fluxes and identify the key transport parameters in different boiling stages. Pool boiling tests of water on multi-tier copper structures are performed under both steady-state and transient heat loads, during which multimodal, multidimensional signals are recorded, including temperature profiles, optical imaging, and acoustic signals via contact acoustic emission (AE) sensors, hydrophones immersed in the liquid pool, and condenser microphones outside the boiling chamber. The physics-based analysis is focused on i) extracting dynamic characteristics of boiling from time lags between acoustic-optical-thermal signals, ii) analyzing energy balance between thermal diffusion, bubble growth, and acoustic dissipation, and iii) decoupling the response signals for different physical processes, e.g., low-to-midfrequency range AE induced by thermal expansion of liquids and bubble ebullition. Separate multimodal sensing tests, namely a single-phase liquid test and a single-bubble-dynamics test, are performed to reinforce the analysis, which confirms an AE peak of 1.5 kHz corresponding to bubble ebullition. The data-driven analysis is focused on enabling the early fusion of acoustic and optical signals for improved boiling state and flux predictions. Unlike single-modality analysis or commonly-used late fusion algorithms that concatenate processed signals in dense layers, the current work performs the fusion process in the deep feature domain using a multi-layer perceptron regression model. This early fusion algorithm is shown to lead to more accurate and robust predictions. The coupled multimodal sensing and data fusion platform is promising to enable reliable thermal monitoring and advance the understanding of dominant transport mechanisms during boiling.

     
    more » « less