skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 12, 2026

Title: MooBot: RAG-based Video Querying System for Dairy Cattle Behavior and Health Insights
We introduce MooBot, a RAG-based video querying system powered by GPT-4o designed to bridge the gap between what complex cattle video data can provide and what dairy farmers need through a natural language web interface. MooBot applies computer vision inference on barn videos to detect cows, identify individuals, and classify their behaviors, transforming visual data into a structured schema containing useful insights. Our results demonstrate the potential of MooBot for enhancing accessibility to video-derived insights in precision livestock farming, bringing advanced computer vision analytics within reach of farmers without requiring technical expertise.  more » « less
Award ID(s):
2435327
PAR ID:
10612104
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Computer Vision for Animal Behavior Tracking and Modeling
Date Published:
Format(s):
Medium: X
Location:
Nashville, TN
Sponsoring Org:
National Science Foundation
More Like this
  1. Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. 
    more » « less
  2. Langran, E; Rutledge, D (Ed.)
    As part of a professional development experience, inservice teachers were provided with lessons and activities emphasizing computer vision. Lessons focused on image processing, machine learning for visual data, and applications in robotics and visual media. Personal Construct Theory (Kelly, 1955) was used to answer the following research question: After middle school teachers engaged in professional development emphasizing computer vision, what changes in thinking occurred in relation to computer vision? Results showed changes in thinking in the form of a more refined understanding of computer vision and its role in STEM education. 
    more » « less
  3. Computer vision on low-power edge devices enables applications including search-and-rescue and security. State-of-the-art computer vision algorithms, such as Deep Neural Networks (DNNs), are too large for inference on low-power edge devices. To improve efficiency, some existing approaches parallelize DNN inference across multiple edge devices. How-ever, these techniques introduce significant communication and synchronization overheads or are unable to balance workloads across devices. This paper demonstrates that the hierarchical DNN architecture is well suited for parallel processing on multiple edge devices. We design a novel method that creates a parallel inference pipeline for computer vision problems that use hierarchical DNNs. The method balances loads across the collaborating devices and reduces communication costs to facilitate the processing of multiple video frames simultaneously with higher throughput. Our experiments consider a representative computer vision problem where image recognition is performed on each video frame, running on multiple Raspberry Pi 4Bs. With four collaborating low-power edge devices, our approach achieves 3.21× higher throughput, 68% less energy consumption per device per frame, and a 58% decrease in memory when compared with existing sinaledevice hierarchical DNNs. 
    more » « less
  4. Video-based eye trackers increasingly have potential to improve on-screen magnification for low-vision computer users. Yet, little is known about the viability of eye tracking hardware for gaze-guided magnification. We employed a magnification prototype to assess eye tracking quality for low-vision users as they performed reading and search tasks. We show that a high degree of tracking loss prevents current video-based eye tracking from capturing gaze input for low-vision users. Our findings show current technologies were not made with low vision users in mind, and we offer suggestions to improve gaze-tracking for diverse eye input. 
    more » « less
  5. Our brains are, “prediction machines”, where we are continuously comparing our surroundings with predictions from internal models generated by our brains. This is demonstrated by observing our basic low level sensory systems and how they predict environmental changes as we move through space and time. Indeed, even at higher cognitive levels, we are able to do prediction. We can predict how the laws of physics affect people, places, and things and even predict the end of someone’s sentence. In our work, we sought to create an artificial model that is able to mimic early, low level biological predictive behavior in a computer vision system. Our predictive vision model uses spatiotemporal sequence memories learned from deep sparse coding. This model is implemented using a biologically inspired architecture: one that utilizes sequence memories, lateral inhibition, and top-down feed- back in a generative framework. Our model learns the causes of the data in a completely unsupervised manner, by simply observing and learning about the world. Spatiotemporal features are learned by minimizing a reconstruction error convolved over space and time, and can subsequently be used for recognition, classification, and future video prediction. Our experiments show that we are able to accurately predict what will happen in the future; furthermore, we can use our predictions to detect anomalous, unexpected events in both synthetic and real video sequences. 
    more » « less