skip to main content

Title: Using Geometric Features to Represent Near-Contact Behavior in Robotic Grasping
In this paper we define two feature representations for grasping. These representations capture hand-object geometric relationships at the near-contact stage - before the fingers close around the object. Their benefits are: 1) They are stable under noise in both joint and pose variation. 2) They are largely hand and object agnostic, enabling direct comparison across different hand morphologies. 3) Their format makes them suitable for direct application of machine learning techniques developed for images. We validate the representations by: 1) Demonstrating that they can accurately predict the distribution of ε-metric values generated by kinematic noise. I.e., they capture much of the information inherent in contact points and force vectors without the corresponding instabilities. 2) Training a binary grasp success classifier on a real-world data set consisting of 588 grasps.
; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
International Conference on Robotics and Applications
Page Range or eLocation-ID:
272772-277772 to 2777
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose a visually-grounded library of behaviors approach for learning to manipulate diverse objects across varying initial and goal configurations and camera placements. Our key innovation is to disentangle the standard image-to-action mapping into two separate modules that use different types of perceptual input:(1) a behavior selector which conditions on intrinsic and semantically-rich object appearance features to select the behaviors that can successfully perform the desired tasks on the object in hand, and (2) a library of behaviors each of which conditions on extrinsic and abstract object properties, such as object location and pose, to predict actions to execute over time. The selector uses a semantically-rich 3D object feature representation extracted from images in a differential end-to-end manner. This representation is trained to be view-invariant and affordance-aware using self-supervision, by predicting varying views and successful object manipulations. We test our framework on pushing and grasping diverse objects in simulation as well as transporting rigid, granular, and liquid food ingredients in a real robot setup. Our model outperforms image-to-action mappings that do not factorize static and dynamic object properties. We further ablate the contribution of the selector's input and show the benefits of the proposed view-predictive, affordance-aware 3D visual object representations.
  2. While current vision algorithms excel at many challenging tasks, it is unclear how well they understand the physical dynamics of real-world environments. Here we introduce Physion, a dataset and benchmark for rigorously evaluating the ability to predict how physical scenarios will evolve over time. Our dataset features realistic simulations of a wide range of physical phenomena, including rigid and soft-body collisions, stable multi-object configurations, rolling, sliding, and projectile motion, thus providing a more comprehensive challenge than previous benchmarks. We used Physion to benchmark a suite of models varying in their architecture, learning objective, input-output structure, and training data. In parallel, we obtained precise measurements of human prediction behavior on the same set of scenarios, allowing us to directly evaluate how well any model could approximate human behavior. We found that vision algorithms that learn object-centric representations generally outperform those that do not, yet still fall far short of human performance. On the other hand, graph neural networks with direct access to physical state information both perform substantially better and make predictions that are more similar to those made by humans. These results suggest that extracting physical representations of scenes is the main bottleneck to achieving human-level and human-like physical understandingmore »in vision algorithms. We have publicly released all data and code to facilitate the use of Physion to benchmark additional models in a fully reproducible manner, enabling systematic evaluation of progress towards vision algorithms that understand physical environments as robustly as people do.« less
  3. Dynamic social interaction networks are an important abstraction to model time-stamped social interactions such as eye contact, speaking and listening between people. These networks typically contain informative while subtle patterns that reflect people’s social characters and relationship, and therefore attract the attentions of a lot of social scientists and computer scientists. Previous approaches on extracting those patterns primarily rely on sophisticated expert knowledge of psychology and social science, and the obtained features are often overly task-specific. More generic models based on representation learning of dynamic networks may be applied, but the unique properties of social interactions cause severe model mismatch and degenerate the quality of the obtained representations. Here we fill this gap by proposing a novel framework, termed TEmporal network-DIffusion Convolutional networks (TEDIC), for generic representation learning on dynamic social interaction networks. We make TEDIC a good fit by designing two components: 1) Adopt diffusion of node attributes over a combination of the original network and its complement to capture long-hop interactive patterns embedded in the behaviors of people making or avoiding contact; 2) Leverage temporal convolution networks with hierarchical set-pooling operation to flexibly extract patterns from different-length interactions scattered over a long time span. The design also endowsmore »TEDIC with certain self-explaining power. We evaluate TEDIC over five real datasets for four different social character prediction tasks including deception detection, dominance identification, nervousness detection and community detection. TEDIC not only consistently outperforms previous SOTA’s, but also provides two important pieces of social insight. In addition, it exhibits favorable societal characteristics by remaining unbiased to people from different regions. Our project website is:« less
  4. Drilling and milling operations are material removal processes involved in everyday conventional productions, especially in the high-speed metal cutting industry. The monitoring of tool information (wear, dynamic behavior, deformation, etc.) is essential to guarantee the success of product fabrication. Many methods have been applied to monitor the cutting tools from the information of cutting force, spindle motor current, vibration, as well as sound acoustic emission. However, those methods are indirect and sensitive to environmental noises. Here, the in-process imaging technique that can capture the cutting tool information while cutting the metal was studied. As machinists judge whether a tool is worn-out by the naked eye, utilizing the vision system can directly present the performance of the machine tools. We proposed a phase shifted strobo-stereoscopic method (Figure 1) for three-dimensional (3D) imaging. The stroboscopic instrument is usually applied for the measurement of fast-moving objects. The operation principle is as follows: when synchronizing the frequency of the light source illumination and the motion of object, the object appears to be stationary. The motion frequency of the target is transferring from the count information of the encoder signals from the working rotary spindle. If small differences are added to the frequency, the objectmore »appears to be slowly moving or rotating. This effect can be working as the source for the phase-shifting; with this phase information, the target can be whole-view 3D reconstructed by 360 degrees. The stereoscopic technique is embedded with two CCD cameras capturing images that are located bilateral symmetrically in regard to the target. The 3D scene is reconstructed by the location information of the same object points from both the left and right images. In the proposed system, an air spindle was used to secure the motion accuracy and drilling/milling speed. As shown in Figure 2, two CCDs with 10X objective lenses were installed on a linear rail with rotary stages to capture the machine tool bit raw picture for further 3D reconstruction. The overall measurement process was summarized in the flow chart (Figure 3). As the count number of encoder signals is related to the rotary speed, the input speed (unit of RPM) was set as the reference signal to control the frequency (f0) of the illumination of the LED. When the frequency was matched with the reference signal, both CCDs started to gather the pictures. With the mismatched frequency (Δf) information, a sequence of images was gathered under the phase-shifted process for a whole-view 3D reconstruction. The study in this paper was based on a 3/8’’ drilling tool performance monitoring. This paper presents the principle of the phase-shifted strobe-stereoscopic 3D imaging process. A hardware set-up is introduced, , as well as the 3D imaging algorithm. The reconstructed image analysis under different working speeds is discussed, the reconstruction resolution included. The uncertainty of the imaging process and the built-up system are also analyzed. As the input signal is the working speed, no other information from other sources is required. This proposed method can be applied as an on-machine or even in-process metrology. With the direct method of the 3D imaging machine vision system, it can directly offer the machine tool surface and fatigue information. This presented method can supplement the blank for determining the performance status of the machine tools, which further guarantees the fabrication process.« less
  5. Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies eithermanually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of ⟨query concept, anchor concept⟩ pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data. Extensive experimentsmore »on three large-scale datasets from different domains demonstrate both the effectiveness and the efficiency of TaxoExpan for taxonomy expansion.« less