skip to main content


Search for: All records

Award ID contains: 1749376

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. We propose BAPose, a novel bottom-up approach that achieves state-of-the-art results for multi-person pose estimation. Our end-to-end trainable framework leverages a disentangled multi-scale waterfall architecture and incorporates adaptive convolutions to infer keypoints more precisely in crowded scenes with occlusions. The multiscale representations, obtained by the disentangled water-fall module in BAPose, leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of- view comparable to spatial pyra-mid configurations. Our results on the challenging COCO and CrowdPose datasets demonstrate that BAPose is an efficient and robust framework for multi-person pose estimation, significantly improving state-of-the-art accuracy. 
    more » « less
  2. We propose UniPose+, a unified framework for 2D and 3D human pose estimation in images and videos. The UniPose+ architecture leverages multi-scale feature representations to increase the effectiveness of backbone feature extractors, with no significant increase in network size and no postprocessing. Current pose estimation methods heavily rely on statistical postprocessing or predefined anchor poses for joint localization. The UniPose+ framework incorporates contextual information across scales and joint localization with Gaussian heatmap modulation at the decoder output to estimate 2D and 3D human pose in a single stage with state-of-the-art accuracy, without relying on predefined anchor poses. The multi-scale representations allowed by the waterfall module in the UniPose+ framework leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on multiple datasets demonstrate that UniPose+, with a HRNet, ResNet or SENet backbone and waterfall module, is a robust and efficient architecture for single person 2D and 3D pose estimation in single images and videos. 
    more » « less
  3. We propose GourmetNet, a single-pass, end-to-end trainable network for food segmentation that achieves state-of-the-art performance. Food segmentation is an important problem as the first step for nutrition monitoring, food volume and calorie estimation. Our novel architecture incorporates both channel attention and spatial attention information in an expanded multi-scale feature representation using our advanced Waterfall Atrous Spatial Pooling module. GourmetNet refines the feature extraction process by merging features from multiple levels of the backbone through the two attention modules. The refined features are processed with the advanced multi-scale waterfall module that combines the benefits of cascade filtering and pyramid representations without requiring a separate decoder or post-processing. Our experiments on two food datasets show that GourmetNet significantly outperforms existing current state-of-the-art methods. 
    more » « less
  4. Tescher, Andrew G. ; Ebrahimi, Touradj (Ed.)
    Vehicle pose estimation is useful for applications such as self-driving cars, traffic monitoring, and scene analysis. Recent developments in computer vision and deep learning have achieved significant progress in human pose estimation, but little of this work has been applied to vehicle pose. We propose VehiPose, an efficient architecture for vehicle pose estimation, based on a multi-scale deep learning approach that achieves high accuracy vehicle pose estimation while maintaining manageable network complexity and modularity. The VehiPose architecture combines an encoder-decoder architecture with a waterfall atrous convolution module for multi-scale feature representation. Our approach aims to reduce the loss due to successive pooling layers and preserve the multiscale contextual and spatial information in the encoder feature representations. The waterfall module generates multiscale features, as it leverages the efficiency of progressive filtering while maintaining wider fields-of-view through the concatenation of multiple features. This multi-scale approach results in a robust vehicle pose estimation architecture that incorporates contextual information across scales and performs the localization of vehicle keypoints in an end-to-end trainable network. 
    more » « less
  5. The PoseASL dataset consists of color and depth videos collected from ASL signers at the Linguistic and Assistive Technologies Laboratory under the direction of Matt Huenerfauth, as part of a collaborative research project with researchers at the Rochester Institute of Technology, Boston University, and the University of Pennsylvania. Access: After becoming an authorized user of Databrary, please contact Matt Huenerfauth if you have difficulty accessing this volume. We have collected a new dataset consisting of color and depth videos of fluent American Sign Language signers performing sequences ASL signs and sentences. Given interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the video files, we share depth data files from a Kinect v2 sensor, as well as additional motion-tracking files produced through post-processing of this data. Organization of the Dataset: The dataset is organized into sub-folders, with codenames such as "P01" or "P16" etc. These codenames refer to specific human signers who were recorded in this dataset. Please note that there was no participant P11 nor P14; those numbers were accidentally skipped during the process of making appointments to collect video stimuli. Task: During the recording session, the participant was met by a member of our research team who was a native ASL signer. No other individuals were present during the data collection session. After signing the informed consent and video release document, participants responded to a demographic questionnaire. Next, the data-collection session consisted of English word stimuli and cartoon videos. The recording session began with showing participants stimuli consisting of slides that displayed English word and photos of items, and participants were asked to produce the sign for each (PDF included in materials subfolder). Next, participants viewed three videos of short animated cartoons, which they were asked to recount in ASL: - Canary Row, Warner Brothers Merrie Melodies 1950 (the 7-minute video divided into seven parts) - Mr. Koumal Flies Like a Bird, Studio Animovaneho Filmu 1969 - Mr. Koumal Battles his Conscience, Studio Animovaneho Filmu 1971 The word list and cartoons were selected as they are identical to the stimuli used in the collection of the Nicaraguan Sign Language video corpora - see: Senghas, A. (1995). Children’s Contribution to the Birth of Nicaraguan Sign Language. Doctoral dissertation, Department of Brain and Cognitive Sciences, MIT. Demographics: All 14 of our participants were fluent ASL signers. As screening, we asked our participants: Did you use ASL at home growing up, or did you attend a school as a very young child where you used ASL? All the participants responded affirmatively to this question. A total of 14 DHH participants were recruited on the Rochester Institute of Technology campus. Participants included 7 men and 7 women, aged 21 to 35 (median = 23.5). All of our participants reported that they began using ASL when they were 5 years old or younger, with 8 reporting ASL use since birth, and 3 others reporting ASL use since age 18 months. Filetypes: *.avi, *_dep.bin: The PoseASL dataset has been captured by using a Kinect 2.0 RGBD camera. The output of this camera system includes multiple channels which include RGB, depth, skeleton joints (25 joints for every video frame), and HD face (1,347 points). The video resolution produced in 1920 x 1080 pixels for the RGB channel and 512 x 424 pixels for the depth channels respectively. Due to limitations in the acceptable filetypes for sharing on Databrary, it was not permitted to share binary *_dep.bin files directly produced by the Kinect v2 camera system on the Databrary platform. If your research requires the original binary *_dep.bin files, then please contact Matt Huenerfauth. *_face.txt, *_HDface.txt, *_skl.txt: To make it easier for future researchers to make use of this dataset, we have also performed some post-processing of the Kinect data. To extract the skeleton coordinates of the RGB videos, we used the Openpose system, which is capable of detecting body, hand, facial, and foot keypoints of multiple people on single images in real time. The output of Openpose includes estimation of 70 keypoints for the face including eyes, eyebrows, nose, mouth and face contour. The software also estimates 21 keypoints for each of the hands (Simon et al, 2017), including 3 keypoints for each finger, as shown in Figure 2. Additionally, there are 25 keypoints estimated for the body pose (and feet) (Cao et al, 2017; Wei et al, 2016). Reporting Bugs or Errors: Please contact Matt Huenerfauth to report any bugs or errors that you identify in the corpus. We appreciate your help in improving the quality of the corpus over time by identifying any errors. Acknowledgement: This material is based upon work supported by the National Science Foundation under award 1749376: "Collaborative Research: Multimethod Investigation of Articulatory and Perceptual Constraints on Natural Language Evolution." 
    more » « less
  6. We propose UniPose, a unified framework for human pose estimation, based on our “Waterfall” Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. Current pose estimation methods utilizing standard CNN architectures heavily rely on statistical postprocessing or predefined anchor poses for joint localization. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPoseLSTM for multi-frame processing and achieves state-of-theart results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-ofthe-art results in single person pose detection for both single images and videos 
    more » « less
  7. Monocular depth estimation is an important task in scene understanding with applications to pose, segmentation and autonomous navigation. Deep Learning methods relying on multilevel features are currently used for extracting local information that is used to infer depth from a single RGB image. We present an efficient architecture that utilizes the features from multiple levels with fewer connections compared to previous networks. Our model achieves comparable scores for monocular depth estimation with better efficiency on the memory requirements and computational burden. 
    more » « less
  8. We propose a new efficient architecture for semantic segmentation, based on a “Waterfall” Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset. 
    more » « less