skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Invariance Through Latent Alignment
A robot’s deployment environment often involves perceptual changes that differ from what it has experienced during training. Standard practices such as data augmentation attempt to bridge this gap by augmenting source images in an effort to extend the support of the training distribution to better cover what the agent might experience at test time. In many cases, however, it is impossible to know test-time distribution- shift a priori, making these schemes infeasible. In this paper, we introduce a general approach, called Invariance through Latent Alignment (ILA), that improves the test-time performance of a visuomotor control policy in deployment environments with unknown perceptual variations. ILA performs unsupervised adaptation at deployment-time by matching the distribution of latent features on the target domain to the agent’s prior experience, without relying on paired data. Although simple, we show that this idea leads to surprising improvements on a variety of challenging adaptation scenarios, including changes in lighting conditions, the content in the scene, and camera poses. We present results on calibrated control benchmarks in simulation—the distractor control suite—and a physical robot under a sim-to-real setup. Video and code available at: https: //invariance-through-latent-alignment.github.io  more » « less
Award ID(s):
1830660
PAR ID:
10346055
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of Robotics: Science and Systems (RSS)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many applications of representation learning, such as privacy preservation, algorithmic fairness, and domain adaptation, desire explicit control over semantic information being discarded. This goal is formulated as satisfying two objectives: maximizing utility for predicting a target attribute while simultaneously being invariant (independent) to a known semantic attribute. Solutions to invariant representation learning (IRepL) problems lead to a trade-off between utility and invariance when they are competing. While existing works study bounds on this trade-off, two questions remain outstanding: 1) What is the exact trade-off between utility and invariance? and 2) What are the encoders (mapping the data to a representation) that achieve the trade-off, and how can we estimate it from training data? This paper addresses these questions for IRepLs in reproducing kernel Hilbert spaces (RKHS)s. Under the assumption that the distribution of a low-dimensional projection of high-dimensional data is approximately normal, we derive a closed-form solution for the global optima of the underlying optimization problem for encoders in RKHSs. This yields closed formulae for a near-optimal trade-off, corresponding optimal representation dimensionality, and the corresponding encoder(s). We also numerically quantify the trade-off on representative problems and compare them to those achieved by baseline IRepL algorithms. 
    more » « less
  2. Vision–language models learn visual concepts from the supervision of natural language. It can significantly enhance the generalizability of real-time intelligent sensing, such as analyzing camera-captured real-time images for visually impaired users. However, adapting vision–language models to distribution shifts at test time, caused by several factors such as lighting or weather changes, remains challenging. In particular, most existing test-time adaptation methods rely on gradient-based fine-tuning and backpropagation, making them computationally expensive and unsuitable for real-time applications. To address this challenge, the Training-Free Dynamic Adapter (TDA) has recently been introduced as a lightweight alternative that uses a dynamic key–value cache and pseudo-label refinement for test-time adaptation without backpropagation. Building on this, we propose TDA-L, a new framework that integrates Low-Rank Adaptation (LoRA) to reduce the size of feature representations and related computational overhead at test time using pre-learned low-rank matrices. TDA-L applies LoRA transformations to both query and cached features during inference, cost-efficiently improving robustness to distribution shifts while maintaining the training-free nature of TDA. Experimental results on seven benchmarks show that TDA-L maintains accuracy but achieves lower latency, less memory consumption, and higher throughput, making it well-suited for AI-based real-time sensing. 
    more » « less
  3. Recent work on perceptual learning for speech has suggested that while high-variability training typically results in generalization, low-variability exposure can sometimes be sufficient for cross-talker generalization. We tested predictions of a similarity-based account, according to which, generalization depends on training-test talker similarity rather than on exposure to variability. We compared perceptual adaptation to second-language (L2) speech following single- or multiple-talker training with a round-robin design in which four L2 English talkers from four different first-language (L1) backgrounds served as both training and test talkers. After exposure to 60 L2 English sentences in one training session, cross-talker/cross-accent generalization was possible (but not guaranteed) following either multiple- or single-talker training with variation across training-test talker pairings. Contrary to predictions of the similarity-based account, adaptation was not consistently better for identical than for mismatched training-test talker pairings, and generalization patterns were asymmetrical across training-test talker pairs. Acoustic analyses also revealed a dissociation between phonetic similarity and cross-talker/cross-accent generalization. Notably, variation in adaptation and generalization related to variation in training phase intelligibility. Together with prior evidence, these data suggest that perceptual learning for speech may benefit from some combination of exposure to talker variability, training-test similarity, and high training phase intelligibility. 
    more » « less
  4. Recent advancements in wearable physiological sensing and artificial intelligence have made some remarkable progress in workers’ health monitoring in construction sites. However, the scalable application is still challenging. One of the major complications for deployment has been the distribution shift observed in the physiological data obtained through sensors. This study develops a deep adversarial domain adaptation framework to adapt to out-of-distribution data(ODD) in the wearable physiological device based on photoplethysmography (PPG). The domain adaptation framework is developed and validated with reference to the heart rate predictor based on PPG. A heart rate predictor module comprising feature generating encoder and predictor isinitially trained with data from a given training domain. An unsupervised adversarial domain adaptation method is then implemented for the test domain. In the domain adaptation process, the encoder network is adapted to generate domain invariant features for the test domain using discriminator-based adversarial optimization. The results demonstrate that this approach can effectively accomplish domain adaptation, as evidenced by a 27.68% reduction in heart rate prediction error for the test domain. The proposed framework offers potential for scaled adaptation in the jobsite by addressing the ODD problem. 
    more » « less
  5. This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers. 
    more » « less