skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On the Role of Prediction in Streaming Hierarchical Learning
In today's world, AI systems need to make sense of large amounts of data as it unfolds in real-time, whether it's a video from surveillance and monitoring cameras, streams of egocentric footage, or sequences in other domains such as text or audio. The ability to break these continuous data streams into meaningful events, discover nested structures, and predict what might happen next at different levels of abstraction is crucial for applications ranging from passive surveillance systems to sensory-motor autonomous learning. However, most existing models rely heavily on large, annotated datasets with fixed data distributions and offline epoch-based training, which makes them impractical for handling the unpredictability and scale of dynamic real-world environments. This dissertation tackles these challenges by introducing a set of predictive models designed to process streaming data efficiently, segment events, and build sequential memory models without supervision or data storage. First, we present a single-layer predictive model that segments long, unstructured video streams by detecting temporal events and spatially localizing objects in each frame. The model is applied to wildlife monitoring footage, where it processes continuous, high-frame-rate video and successfully detects and tracks events without supervision. It operates in an online streaming manner to perform simultaneous training and inference without storing or revisiting the processed data. This approach alleviates the need for manual labeling, making it ideal for handling long-duration, real-world video footage. Building on this, we introduce STREAMER, a multi-layered architecture that extends the single-layer model into a hierarchical predictive framework. STREAMER segments events at different levels of abstraction, capturing the compositional structure of activities in egocentric videos. By dynamically adapting to various timescales, it creates a hierarchy of nested events and forms more complex and abstract representations of the input data. Finally, we propose the Predictive Attractor Model (PAM), which builds biologically plausible memory models of sequential data. Inspired by neuroscience, PAM uses sparse distributed representations and local learning rules to avoid catastrophic forgetting, allowing it to continually learn and make predictions without overwriting previous knowledge. Unlike many traditional models, PAM can generate multiple potential future outcomes conditioned on the same context, which allows for handling uncertainty in generative tasks. Together, these models form a unified framework of predictive learning that addresses multiple challenges in event understanding and temporal data analyses. By using prediction as the core mechanism, they segment continuous data streams into events, discover hierarchical structures across multiple levels of abstraction, learn semantic event representations, and model sequences without catastrophic forgetting.  more » « less
Award ID(s):
1956050
PAR ID:
10632602
Author(s) / Creator(s):
Publisher / Repository:
Digital Commons at University of South Florida
Date Published:
Format(s):
Medium: X
Institution:
University of South Florida
Sponsoring Org:
National Science Foundation
More Like this
  1. Sequential memory, the ability to form and accurately recall a sequence of events or stimuli in the correct order, is a fundamental prerequisite for biological and artificial intelligence as it underpins numerous cognitive functions (e.g., language comprehension, planning, episodic memory formation, etc.) However, existing methods of sequential memory suffer from catastrophic forgetting, limited capacity, slow iterative learning procedures, low-order Markov memory, and, most importantly, the inability to represent and generate multiple valid future possibilities stemming from the same context. Inspired by biologically plausible neuroscience theories of cognition, we propose Predictive Attractor Models (PAM), a novel sequence memory architecture with desirable generative properties. PAM is a streaming model that learns a sequence in an online, continuous manner by observing each input only once. Additionally, we find that PAM avoids catastrophic forgetting by uniquely representing past context through lateral inhibition in cortical minicolumns, which prevents new memories from overwriting previously learned knowledge. PAM generates future predictions by sampling from a union set of predicted possibilities; this generative ability is realized through an attractor model trained alongside the predictor. We show that PAM is trained with local computations through Hebbian plasticity rules in a biologically plausible framework. Other desirable traits (e.g., noise tolerance, CPU-based learning, capacity scaling) are discussed throughout the paper. Our findings suggest that PAM represents a significant step forward in the pursuit of biologically plausible and computationally efficient sequential memory models, with broad implications for cognitive science and artificial intelligence research. Illustration videos and code are available on our project page: https://ramymounir.com/publications/pam. 
    more » « less
  2. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, JM. (Ed.)
    The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS’13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task. 
    more » « less
  3. When an agent acquires new information, ideally it would immediately be capable of using that information to understand its environment. This is not possible using conventional deep neural networks, which suffer from catastrophic forgetting when they are incrementally updated, with new knowledge overwriting established representations. A variety of approaches have been developed that attempt to mitigate catastrophic forgetting in the incremental batch learning scenario, where a model learns from a series of large collections of labeled samples. However, in this setting, inference is only possible after a batch has been accumulated, which prohibits many applications. An alternative paradigm is online learning in a single pass through the training dataset on a resource constrained budget, which is known as streaming learning. Streaming learning has been much less studied in the deep learning community. In streaming learning, an agent learns instances one-by-one and can be tested at any time, rather than only after learning a large batch. Here, we revisit streaming linear discriminant analysis, which has been widely used in the data mining research community. By combining streaming linear discriminant analysis with deep learning, we are able to outperform both incremental batch learning and streaming learning algorithms on both ImageNet ILSVRC-2012 and CORe50, a dataset that involves learning to classify from temporally ordered samples. 
    more » « less
  4. Real-time interactive video streaming applications like cloud-based video games, AR, and VR require high quality video streams and extremely low end-to-end interaction delays. These requirements cause the QoE to be extremely sensitive to packet losses. Due to the inter-dependency between compressed frames, packet losses stall the video decode pipeline until the lost packets are retransmitted (resulting in stutters and higher delays), or the decoder state is reset using IDR-frames (lower video quality for given bandwidth). Prism is a hybrid predictive-reactive packet loss recovery scheme that uses a split-stream video coding technique to meet the needs of ultra-low latency video streaming applications. Prism's approach enables aggressive loss prediction, rapid loss recovery, and high video quality post-recovery, with zero overhead during normal operation - avoiding the pitfalls of existing approaches. Our evaluation on real video game footage shows that Prism reduces the penalty of using I-frames for recovery by 81%, while achieving 30% lower delay than pure retransmission-based recovery. 
    more » « less
  5. We present a novel self-supervised approach for hierarchical representation learning and segmentation of perceptual inputs in a streaming fashion. Our research addresses how to semantically group streaming inputs into chunks at various levels of a hierarchy while simultaneously learning, for each chunk, robust global representations throughout the domain. To achieve this, we propose STREAMER, an architecture that is trained layer-by-layer, adapting to the complexity of the input domain. In our approach, each layer is trained with two primary objectives: making accurate predictions into the future and providing necessary information to other levels for achieving the same objective. The event hierarchy is constructed by detecting prediction error peaks at different levels, where a detected boundary triggers a bottom-up information flow. At an event boundary, the encoded representation of inputs at one layer becomes the input to a higher-level layer. Additionally, we design a communication module that facilitates top-down and bottom-up exchange of information during the prediction process. Notably, our model is fully self-supervised and trained in a streaming manner, enabling a single pass on the training data. This means that the model encounters each input only once and does not store the data. We evaluate the performance of our model on the egocentric EPIC-KITCHENS dataset, specifically focusing on temporal event segmentation. Furthermore, we conduct event retrieval experiments using the learned representations to demonstrate the high quality of our video event representations. 
    more » « less