skip to main content

Search for: All records

Creators/Authors contains: "Sarkar, Sudeep"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera’s extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.
    Free, publicly-accessible full text available October 23, 2023
  2. Event perception tasks such as recognizing and localizing actions in streaming videos are essential for scaling to real-world application contexts. We tackle the problem of learning actor-centered representations through the notion of continual hierarchical predictive learning to localize actions in streaming videos without the need for training labels and outlines for the objects in the video. We propose a framework driven by the notion of hierarchical predictive learning to construct actor-centered features by attention-based contextualization. The key idea is that predictable features or objects do not attract attention and hence do not contribute to the action of interest. Experiments on three benchmark datasets show that the approach can learn robust representations for localizing actions using only one epoch of training, i.e., a single pass through the streaming video. We show that the proposed approach outperforms unsupervised and weakly supervised baselines while offering competitive performance to fully supervised approaches. Additionally, we extend the model to multi-actor settings to recognize group activities while localizing the multiple, plausible actors. We also show that it generalizes to out-of-domain data with limited performance degradation.
    Free, publicly-accessible full text available October 1, 2023
  3. Graph-based representations are becoming increasingly popular for representing and analyzing video data, especially in object tracking and scene understanding applications. Accordingly, an essential tool in this approach is to generate statistical inferences for graphical time series associated with videos. This paper develops a Kalman-smoothing method for estimating graphs from noisy, cluttered, and incomplete data. The main challenge here is to find and preserve the registration of nodes (salient detected objects) across time frames when the data has noise and clutter due to false and missing nodes. First, we introduce a quotient-space representation of graphs that incorporates temporal registration of nodes, and we use that metric structure to impose a dynamical model on graph evolution. Then, we derive a Kalman smoother, adapted to the quotient space geometry, to estimate dense, smooth trajectories of graphs. We demonstrate this framework using simulated data and actual video graphs extracted from the Multiview Extended Video with Activities (MEVA) dataset. This framework successfully estimates graphs despite the noise, clutter, and missed detections.
    Free, publicly-accessible full text available October 1, 2023
  4. Practical robot applications that require grasping often fail due to failed grasp planning, and a good grasp quality measure is the key to successful grasp planning. In this paper, we developed a novel grasp quality measure that quantifies and evaluates grasp quality in real-time. To quantify the grasp quality, we compute a set of object movement features from analyzing the interaction between the gripper and the object’s projections in the image space. The normalizations and weights of the features are tuned to make practical and intuitive grasp quality predictions. To evaluate our grasp quality measure, we conducted a real robot grasping experiment with 1000 robot grasp trials on ten household objects to examine the relationship between our grasp scores and the actual robot grasping results. The results show that the average grasp success rate increases, and the average amount of undesired object movement decrease as the calculated grasp score increases, which validates our quality measure. We achieved a 100% grasp success rate from 100 grasps of the ten objects when using our grasp quality measure in planning top quality grasps.
  5. Vedaldi, A. ; Bischof, H. ; Brox, T. ; Frahm, JM. (Ed.)
    The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. Wemore »demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS’13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task.« less
  6. The role of perceptual organization in motion analysis has heretofore been minimal. In this work we demonstrate that the use of perceptual organization principles of temporal coherence (common fate) and spatial proximity can result in a robust motion segmentation algorithm that is able to handle drastic illumination changes, occlusion events, and multiple moving objects, without the use of object models. The adopted algorithm does not employ the traditional frame by frame motion analysis, but rather treats the image sequence as a single 3D spatio-temporal block of data. We describe motion using spatio-temporal surfaces, which we, in turn, describe as compositions of finite planar patches. These planar patches, referred to as temporal envelopes, capture the local nature of the motions. We detect these temporal envelopes using 3D-edge detection followed by Hough transform, and represent them with convex hulls. We present a graph-based method to group these temporal envelopes arising from one object based on Gestalt organizational principles. A probabilistic Bayesian network quantifies the saliencies of the relationships between temporal envelopes. We present results on sequences with multiple moving persons, significant occlusions, and scene illumination changes.
  7. In this paper we propose a method for increasing precision and reliability of elasticity analysis in complicated burn scar cases. The need for a technique that would help physicians by objectively assessing elastic properties of scars, motivated our original algorithm. This algorithm successfully employed active contours for tracking and finite element models for strain analysis. However, the previous approach considered only one normal area and one abnormal area within the region of interest, and scar shapes which were somewhat simplified. Most burn scars have rather complicated shapes and may include multiple regions with different elastic properties. Hence, we need a method capable of adequately addressing these characteristics. The new method can split the region into more than two localities with different material properties, select and quantify abnormal areas, and apply different forces if it is necessary for a better shape description of the scar. The method also demonstrates the application of scale and mesh refinement techniques in this important domain. It is accomplished by increasing the number of Finite Element Method (FEM) areas as well as the number of elements within the area. The method is successfully applied to elastic materials and real burn scar cases. We demonstrate all ofmore »the proposed techniques and investigate the behavior of elasticity function in a 3-D space. Recovered properties of elastic materials are compared with those obtained by a conventional mechanics-based approach. Scar ratings achieved with the method are correlated against the judgments of physicians.« less