Visual event perception tasks such as action localization have primarily focused on supervised learning settings under a static observer, i.e., the camera is static and cannot be controlled by an algorithm. They are often restricted by the quality, quantity, and diversity of annotated training data and do not often generalize to out-of-domain samples. In this work, we tackle the problem of active action localization where the goal is to localize an action while controlling the geometric and physical parameters of an active camera to keep the action in the field of view without training data. We formulate an energy-based mechanism that combines predictive learning and reactive control to perform active action localization without rewards, which can be sparse or non-existent in real-world environments. We perform extensive experiments in both simulated and real-world environments on two tasks - active object tracking and active action localization. We demonstrate that the proposed approach can generalize to different tasks and environments in a streaming fashion, without explicit rewards or training. We show that the proposed approach outperforms unsupervised baselines and obtains competitive performance compared to those trained with reinforcement learning.
more »
« less
Hamiltonian learning using machine-learning models trained with continuous measurements
We build upon recent work on the use of machine-learning models to estimate Hamiltonian parameters using continuous weak measurement of qubits as input. We consider two settings for the training of our model: (1) supervised learning, where the weak-measurement training record can be labeled with known Hamiltonian parameters, and (2) unsupervised learning, where no labels are available. The first has the advantage of not requiring an explicit representation of the quantum state, thus potentially scaling very favorably to a larger number of qubits. The second requires the implementation of a physical model to map the Hamiltonian parameters to a measurement record, which we implement using an integrator of the physical model with a recurrent neural network to provide a model-free correction at every time step to account for small effects not captured by the physical model. We test our construction on a system of two qubits and demonstrate accurate prediction of multiple physical parameters in both the supervised context and the unsupervised context. We demonstrate that the model benefits from larger training sets, establishing that it is “learning,” and we show robustness regarding errors in the assumed physical model by achieving accurate parameter estimation in the presence of unanticipated single-particle relaxation.
more »
« less
- Award ID(s):
- 1936388
- PAR ID:
- 10552826
- Publisher / Repository:
- American Physical Society
- Date Published:
- Journal Name:
- Physical Review Applied
- Volume:
- 22
- Issue:
- 4
- ISSN:
- 2331-7019
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In Activities of Daily Living (ADL) research, which has gained prominence due to the burgeoning aging population, the challenge of acquiring sufficient ground truth data for model training is a significant bottleneck. This obstacle necessitates a pivot towards unsupervised representation learning methodologies, which do not require many labeled datasets. The existing research focused on the tradeoff between the fully supervised model and the unsupervised pre-trained model and found that the unsupervised version outperformed in most cases. However, their investigation did not use large enough Human Activity Recognition (HAR) datasets, both datasets resulting in 3 dimensions. This poster extends the investigation by employing a large multivariate time series HAR dataset and experimenting with the models with different combinations of critical training parameters such as batch size and learning rate to observe the performance tradeoff. Our findings reveal that the pre-trained model is comparable to the fully supervised classification with a larger multivariate time series HAR dataset. This discovery underscores the potential of unsupervised representation learning in ADL extractions and highlights the importance of model configuration in optimizing performance.more » « less
-
We present a novel fine-tuning algorithm in a deep hybrid architecture for semi-supervised text classification. During each increment of the online learning process‚ the fine-tuning algorithm serves as a top-down mechanism for pseudo-jointly modifying model parameters following a bottom-up generative learning pass. The resulting model‚ trained under what we call the Bottom-Up-Top-Down learning algorithm‚ is shown to outperform a variety of competitive models and baselines trained across a wide range of splits between supervised and unsupervised training data.more » « less
-
Semantic segmentation algorithms, such as UNet, that rely on convolutional neural network (CNN)-based architectures, due to their ability to capture local textures and spatial context, have shown promise for anthropogenic geomorphic feature extraction when using land surface parameters (LSPs) derived from digital terrain models (DTMs) as input predictor variables. However, the operationalization of these supervised classification methods is limited by a lack of large volumes of quality training data. This study explores the use of transfer learning, where information learned from another, and often much larger, dataset is used to potentially reduce the need for a large, problem-specific training dataset. Two anthropogenic geomorphic feature extraction problems are explored: the extraction of agricultural terraces and the mapping of surface coal mine reclamation-related valley fill faces. Light detection and ranging (LiDAR)-derived DTMs were used to generate LSPs. We developed custom transfer parameters by attempting to predict geomorphon-based landforms using a large dataset of digital terrain data provided by the United States Geological Survey’s 3D Elevation Program (3DEP). We also explored the use of pre-trained ImageNet parameters and initializing models using parameters learned from the other mapping task investigated. The geomorphon-based transfer learning resulted in the poorest performance while the ImageNet-based parameters generally improved performance in comparison to a random parameter initialization, even when the encoder was frozen or not trained. Transfer learning between the different geomorphic datasets offered minimal benefits. We suggest that pre-trained models developed using large, image-based datasets may be of value for anthropogenic geomorphic feature extraction from LSPs even given the data and task disparities. More specifically, ImageNet-based parameters should be considered as an initialization state for the encoder component of semantic segmentation architectures applied to anthropogenic geomorphic feature extraction even when using non-RGB image-based predictor variables, such as LSPs. The value of transfer learning between the different geomorphic mapping tasks may have been limited due to smaller sample sizes, which highlights the need for continued research in using unsupervised and semi-supervised learning methods, especially given the large volume of digital terrain data available, despite the lack of associated labels.more » « less
-
Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, JM. (Ed.)The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS’13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task.more » « less
An official website of the United States government

