skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Hyperbolic Contrastive Learning for Visual Representations beyond Objects
Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot fashion.  more » « less
Award ID(s):
2213335 1910132
PAR ID:
10468282
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISSN:
10636919
ISBN:
979-8-3503-0129-8
Format(s):
Medium: X
Location:
Vancouver Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose a method for autonomously learning an object-centric representation of a continuous and high-dimensional environment that is suitable for planning. Such representations can immediately be transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We first demonstrate our approach on a 2D crafting domain consisting of numerous objects where the agent learns a compact, lifted representation that generalises across objects. We then apply it to a series of Minecraft tasks to learn object-centric representations and object types---directly from pixel data---that can be leveraged to solve new tasks quickly. The resulting learned representations enable the use of a task-level planner, resulting in an agent capable of transferring learned representations to form complex, long-term plans. 
    more » « less
  2. Social recommendation has achieved great success in many domains including e-commerce and location-based social networks. Existing methods usually explore the user-item interactions or user-user connections to predict users’ preference behaviors. However, they usually learn both user and item representations in Euclidean space, which has large limitations for exploring the latent hierarchical property in the data. In this article, we study a novel problem of hyperbolic social recommendation, where we aim to learn the compact but strong representations for both users and items. Meanwhile, this work also addresses two critical domain-issues, which are under-explored. First, users often make trade-offs with multiple underlying aspect factors to make decisions during their interactions with items. Second, users generally build connections with others in terms of different aspects, which produces different influences with aspects in social network. To this end, we propose a novel graph neural network (GNN) framework with multiple aspect learning, namely, HyperSoRec. Specifically, we first embed all users, items, and aspects into hyperbolic space with superior representations to ensure their hierarchical properties. Then, we adapt a GNN with novel multi-aspect message-passing-receiving mechanism to capture different influences among users. Next, to characterize the multi-aspect interactions of users on items, we propose an adaptive hyperbolic metric learning method by introducing learnable interactive relations among different aspects. Finally, we utilize the hyperbolic translational distance to measure the plausibility in each user-item pair for recommendation. Experimental results on two public datasets clearly demonstrate that our HyperSoRec not only achieves significant improvement for recommendation performance but also shows better representation ability in hyperbolic space with strong robustness and reliability. 
    more » « less
  3. Objects and object complexes in 3D, as well as those in 2D, have many possible representations. Among them skeletal representations have special advantages and some limitations. For the special form of skeletal representation called “s-reps,” these advantages include strong suitability for representing slabular object populations and statistical applications on these populations. Accomplishing these statistical applications is best if one recognizes that s-reps live on a curved shape space. Here we will lay out the definition of s-reps, their advantages and limitations, their mathematical properties, methods for fitting s-reps to single- and multi-object boundaries, methods for measuring the statistics of these object and multi-object representations, and examples of such applications involving statistics. While the basic theory, ideas, and programs for the methods are described in this paper and while many applications with evaluations have been produced, there remain many interesting open opportunities for research on comparisons to other shape representations, new areas of application and further methodological developments, many of which are explicitly discussed here. 
    more » « less
  4. Matni, N; Morari, M; Pappas, G.J. (Ed.)
    One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights. 
    more » « less
  5. 3D object recognition accuracy can be improved by learning the multi-scale spatial features from 3D spatial geometric representations of objects such as point clouds, 3D models, surfaces, and RGB-D data. Current deep learning approaches learn such features either using structured data representations (voxel grids and octrees) or from unstructured representations (graphs and point clouds). Learning features from such structured representations is limited by the restriction on resolution and tree depth while unstructured representations creates a challenge due to non-uniformity among data samples. In this paper, we propose an end-to-end multi-level learning approach on a multi-level voxel grid to overcome these drawbacks. To demonstrate the utility of the proposed multi-level learning, we use a multi-level voxel representation of 3D objects to perform object recognition. The multi-level voxel representation consists of a coarse voxel grid that contains volumetric information of the 3D object. In addition, each voxel in the coarse grid that contains a portion of the object boundary is subdivided into multiple fine-level voxel grids. The performance of our multi-level learning algorithm for object recognition is comparable to dense voxel representations while using significantly lower memory. 
    more » « less