skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Aligned to the Object, Not to the Image: A Unified Pose-Aligned Representation for Fine-Grained Recognition
Dramatic appearance variation due to pose constitutes a great challenge in fine-grained recognition, one which recent methods using attention mechanisms or second-order statistics fail to adequately address. Modern CNNs typically lack an explicit understanding of object pose and are instead confused by entangled pose and appearance. In this paper, we propose a unified object representation built from pose-aligned regions of varied spatial sizes. Rather than representing an object by regions aligned to image axes, the proposed representation characterizes appearance relative to the object's pose using pose-aligned patches whose features are robust to variations in pose, scale and viewing angle. We propose an algorithm that performs pose estimation and forms the unified object representation as the concatenation of pose-aligned region features, which is then fed into a classification network. The proposed algorithm attains state-of-the-art results on two fine-grained datasets, notably 89.2% on the widely-used CUB-200 dataset and 87.9% on the much larger NABirds dataset. Our success relative to competing methods shows the critical importance of disentangling pose and appearance for continued progress in fine-grained recognition.  more » « less
Award ID(s):
1651832
PAR ID:
10092596
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Facial activity is the most direct signal for perceiving emotional states in people. Emotion analysis from facial displays has been attracted an increasing attention because of its wide applications from human-centered computing to neuropsychiatry. Recently, image representation based on sparse coding has shown promising results in facial expression recognition. In this paper, we introduce a novel image representation for facial expression analysis. Specifically, we propose to use the histograms of nonnegative sparse coded image features to represent a facial image. In order to capture fine appearance variations caused by facial expression, logarithmic transformation is further employed on each nonnegative sparse coded feature. In addition, the proposed Histograms of Log-Transformed Nonnegative Sparse Coding (HLNNSC) features are calculated and organized in a pyramid-like structure such that the spatial relationships among the features are captured and utilized to enhance the performance of facial expression recognition. Extensive experiments on the Cohn-Kanade database show that the proposed approach yields a significant improvement in facial expression recognition and outperforms the other sparse coding based baseline approaches. Furthermore, experimental results on the GEMEP-FERA2011 dataset demonstrate that the proposed approach is promising for recognition under less controlled and thus more challenging environment. 
    more » « less
  2. 3D object recognition accuracy can be improved by learning the multi-scale spatial features from 3D spatial geometric representations of objects such as point clouds, 3D models, surfaces, and RGB-D data. Current deep learning approaches learn such features either using structured data representations (voxel grids and octrees) or from unstructured representations (graphs and point clouds). Learning features from such structured representations is limited by the restriction on resolution and tree depth while unstructured representations creates a challenge due to non-uniformity among data samples. In this paper, we propose an end-to-end multi-level learning approach on a multi-level voxel grid to overcome these drawbacks. To demonstrate the utility of the proposed multi-level learning, we use a multi-level voxel representation of 3D objects to perform object recognition. The multi-level voxel representation consists of a coarse voxel grid that contains volumetric information of the 3D object. In addition, each voxel in the coarse grid that contains a portion of the object boundary is subdivided into multiple fine-level voxel grids. The performance of our multi-level learning algorithm for object recognition is comparable to dense voxel representations while using significantly lower memory. 
    more » « less
  3. A training process for facial expression recognition is usually performed sequentially in three individual stages: feature learning, feature selection, and classifier construction. Extensive empirical studies are needed to search for an optimal combination of feature representation, feature set, and classifier to achieve good recognition performance. This paper presents a novel Boosted Deep Belief Network (BDBN) for performing the three training stages iteratively in a unified loopy framework. Through the proposed BDBN framework, a set of features, which is effective to characterize expression-related facial appearance/shape changes, can be learned and selected to form a boosted strong classifier in a statistical way. As learning continues, the strong classifier is improved iteratively and more importantly, the discriminative capabilities of selected features are strengthened as well according to their relative importance to the strong classifier via a joint fine-tune process in the BDBN framework. Extensive experiments on two public databases showed that the BDBN framework yielded dramatic improvements in facial expression analysis. 
    more » « less
  4. We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories. To disentangle the factors without any supervision, our key idea is to use information theory to associate each factor to a latent code, and to condition the relationships between the codes in a specific way to induce the desired hierarchy. Through extensive experiments, we show that FineGAN achieves the desired disentanglement to generate realistic and diverse images belonging to fine-grained classes of birds, dogs, and cars. Using FineGAN's automatically learned features, we also cluster real images as a first attempt at solving the novel problem of unsupervised fine-grained object category discovery. 
    more » « less
  5. In recent years, face recognition systems have achieved exceptional success due to promising advances in deep learning architectures. However, they still fail to achieve the expected accuracy when matching profile images against a gallery of frontal images. Current approaches either perform pose normalization (i.e., frontalization) or disentangle pose information for face recognition. We instead propose a new approach to utilize pose as auxiliary information via an attention mechanism. In this paper, we hypothesize that pose-attended information using an attention mechanism can guide contextual and distinctive feature extraction from profile faces, which further benefits better representation learning in an embedded domain. To achieve this, first, we design a unified coupled profile-to-frontal face recognition network. It learns the mapping from faces to a compact embedding subspace via a class-specific contrastive loss. Second, we develop a novel pose attention block (PAB) to specially guide the pose-agnostic feature extraction from profile faces. To be more specific, PAB is designed to explicitly help the network to focus on important features along both “channel” and “spatial” dimensions while learning discriminative yet pose-invariant features in an embedding subspace. To validate the effectiveness of our proposed method, we conduct experiments on both controlled and in the- wild benchmarks including Multi-PIE, CFP, and IJB-C, and show superiority over the state-of-the-art. 
    more » « less