skip to main content

Title: Generalized Bilinear Deep Convolutional Neural Networks for Multimodal Biometric Identification
In this paper, we propose to employ a bank of modality-dedicated Convolutional Neural Networks (CNNs), fuse, train, and optimize them together for person classification tasks. A modality-dedicated CNN is used for each modality to extract modality-specific features. We demonstrate that, rather than spatial fusion at the convolutional layers, the fusion can be performed on the outputs of the fully-connected layers of the modality-specific CNNs without any loss of performance and with significant reduction in the number of parameters. We show that, using multiple CNNs with multimodal fusion at the feature-level, we significantly outperform systems that use unimodal representation. We study weighted feature, bilinear, and compact bilinear feature-level fusion algorithms for multimodal biometric person identification. Finally, We propose generalized compact bilinear fusion algorithm to deploy both the weighted feature fusion and compact bilinear schemes. We provide the results for the proposed algorithms on three challenging databases: CMU Multi-PIE, BioCop, and BIOMDATA.
Authors:
; ; ;
Award ID(s):
1650474
Publication Date:
NSF-PAR ID:
10091242
Journal Name:
IEEE International Conference on Image Processing (ICIP)
Page Range or eLocation-ID:
763 to 767
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a deep multimodal fusion network to fuse multiple modalities (face, iris, and fingerprint) for person identification. The proposed deep multimodal fusion algorithm consists of multiple streams of modality-specific Convolutional Neural Networks (CNNs), which are jointly optimized at multiple feature abstraction levels. Multiple features are extracted at several different convolutional layers from each modality-specific CNN for joint feature fusion, optimization, and classification. Features extracted at different convolutional layers of a modality-specific CNN represent the input at several different levels of abstract representations. We demonstrate that an efficient multimodal classification can be accomplished with a significant reduction in the number of network parameters by exploiting these multi-level abstract representations extracted from all the modality-specific CNNs. We demonstrate an increase in multimodal person identification performance by utilizing the proposed multi-level feature abstract representations in our multimodal fusion, rather than using only the features from the last layer of each modality-specific CNNs. We show that our deep multi-modal CNNs with multimodal fusion at several different feature level abstraction can significantly outperform the unimodal representation accuracy. We also demonstrate that the joint optimization of all the modality-specific CNNs excels the score and decision level fusions of independently optimized CNNs.
  2. Deformable Convolutional Networks (DCN) have been proposed as a powerful tool to boost the representation power of Convolutional Neural Networks (CNN) in computer vision tasks via adaptive sampling of the input feature map. Much like vision transformers, DCNs utilize a more flexible inductive bias than standard CNNs and have also been shown to improve performance of particular models. For example, drop-in DCN layers were shown to increase the AP score of Mask RCNN by 10.6 points while introducing only 1% additional parameters and FLOPs, improving the state-of-the art model at the time of publication. However, despite evidence that more DCN layers placed earlier in the network can further improve performance, we have not seen this trend continue with further scaling of deformations in CNNs, unlike for vision transformers. Benchmarking experiments show that a realistically sized DCN layer (64H×64W, 64 in-out channel) incurs a 4× slowdown on a GPU platform, discouraging the more ubiquitous use of deformations in CNNs. These slowdowns are caused by the irregular input-dependent access patterns of the bilinear interpolation operator, which has a disproportionately low arithmetic intensity (AI) compared to the rest of the DCN. To address the disproportionate slowdown of DCNs and enable their expanded usemore »in CNNs, we propose DefT, a series of workload-aware optimizations for DCN kernels. DefT identifies performance bottlenecks in DCNs and fuses specific operators that are observed to limit DCN AI. Our approach also uses statistical information of DCN workloads to adapt the workload tiling to the DCN layer dimensions, minimizing costly out-of-boundary input accesses. Experimental results show that DefT mitigates up to half of DCN slowdown over the current-art PyTorch implementation. This translates to a layerwise speedup of up to 134% and a reduction of normalized training time of 46% on a fully DCN-enabled ResNet model.« less
  3. Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in mul- tiple vision tasks. Different from conventional global aver- age pooling or fully connected layer, bilinear pooling gath- ers 2nd order information in a translation invariant fash- ion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pool- ing methods with compact properties have been explored towards resolving this weakness. Additionally, recent re- sults have shown that significant performance gains can be achieved by adding 1st order information and applying ma- trix normalization to regularize unstable higher order in- formation. However, combining compact pooling with ma- trix normalization and other order information has not been explored until now. In this paper, we unify bilinear pool- ing and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normal- ize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pool- ing methods. Our experiments on three widely used fine- grained classification datasets illustrate that our proposed architecture, MoNet,more »can achieve similar or better perfor- mance than with the state-of-art G 2 DeNet. Furthermore, when combined with compact pooling technique, MoNet ob- tains comparable performance with encoded features with 96% less dimensions.« less
  4. Bilinear pooling has been recently proposed as a feature encoding layer, which can be used after the convolutional layers of a deep network, to improve performance in multiple vision tasks. Different from conventional global average pooling or fully connected layer, bilinear pooling gathers 2nd order information in a translation invariant fashion. However, a serious drawback of this family of pooling layers is their dimensionality explosion. Approximate pooling methods with compact properties have been explored towards resolving this weakness. Additionally, recent results have shown that significant performance gains can be achieved by adding 1st order information and applying matrix normalization to regularize unstable higher order information. However, combining compact pooling with matrix normalization and other order information has not been explored until now. In this paper, we unify bilinear pooling and the global Gaussian embedding layers through the empirical moment matrix. In addition, we propose a novel sub-matrix square-root layer, which can be used to normalize the output of the convolution layer directly and mitigate the dimensionality problem with off-the-shelf compact pooling methods. Our experiments on three widely used finegrained classification datasets illustrate that our proposed architecture, MoNet, can achieve similar or better performance than with the state-of-art G2DeNet. Furthermore, whenmore »combined with compact pooling technique, MoNet obtains comparable performance with encoded features with 96% less dimensions.« less
  5. In this paper, we propose a secure multibiometric system that uses deep neural networks and error-correction coding. We present a feature-level fusion framework to generate a secure multibiometric template from each user’s multiple biometrics. Two fusion architectures, fully connected architecture and bilinear architecture, are implemented to develop a robust multibiometric shared representation. The shared representation is used to generate a cancelable biometric template that involves the selection of a different set of reliable and discriminative features for each user. This cancelable template is a binary vector and is passed through an appropriate error-correcting decoder to find a closest codeword and this codeword is hashed to generate the final secure template. The efficacy of the proposed approach is shown using a multimodal database where we achieve state-of-the-art matching performance, along with cancelability and security.