skip to main content


This content will become publicly available on May 5, 2024

Title: COMPASS: a formal framework and aggregate dataset for generalized surgical procedure modeling
Purpose: We propose a formal framework for the modeling and segmentation of minimally invasive surgical tasks using a unified set of motion primitives (MPs) to enable more objective labeling and the aggregation of different datasets. Methods: We model dry-lab surgical tasks as finite state machines, representing how the execution of MPs as the basic surgical actions results in the change of surgical context, which characterizes the physical interactions among tools and objects in the surgical environment. We develop methods for labeling surgical context based on video data and for automatic translation of context to MP labels. We then use our framework to create the COntext and Motion Primitive Aggregate Surgical Set (COMPASS), including six dry-lab surgical tasks from three publicly available datasets (JIGSAWS, DESK, and ROSMA), with kinematic and video data and context and MP labels. Results: Our context labeling method achieves near-perfect agreement between consensus labels from crowd-sourcing and expert surgeons. Segmentation of tasks to MPs results in the creation of the COMPASS dataset that nearly triples the amount of data for modeling and analysis and enables the generation of separate transcripts for the left and right tools. Conclusion: The proposed framework results in high quality labeling of surgical data based on context and fine-grained MPs. Modeling surgical tasks with MPs enables the aggregation of different datasets and the separate analysis of left and right hands for bimanual coordination assessment. Our formal framework and aggregate dataset can support the development of explainable and multi-granularity models for improved surgical process analysis, skill assessment, error detection, and autonomy.  more » « less
Award ID(s):
2146295 1829004
NSF-PAR ID:
10411983
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Journal of Computer Assisted Radiology and Surgery
ISSN:
1861-6429
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming by detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page:https://aix.eng.usf.edu/research_automated_ethogramming.html

     
    more » « less
  2. Abstract Background

    Analysing kinematic and video data can help identify potentially erroneous motions that lead to sub‐optimal surgeon performance and safety‐critical events in robot‐assisted surgery.

    Methods

    We develop a rubric for identifying task and gesture‐specific executional and procedural errors and evaluate dry‐lab demonstrations of suturing and needle passing tasks from the JIGSAWS dataset. We characterise erroneous parts of demonstrations by labelling video data, and use distribution similarity analysis and trajectory averaging on kinematic data to identify parameters that distinguish erroneous gestures.

    Results

    Executional error frequency varies by task and gesture, and correlates with skill level. Some predominant error modes in each gesture are distinguishable by analysing error‐specific kinematic parameters. Procedural errors could lead to lower performance scores and increased demonstration times but also depend on surgical style.

    Conclusions

    This study provides insights into context‐dependent errors that can be used to design automated error detection mechanisms and improve training and skill assessment.

     
    more » « less
  3. Abstract Computational fluid dynamics (CFD) modeling of left ventricle (LV) flow combined with patient medical imaging data has shown great potential in obtaining patient-specific hemodynamics information for functional assessment of the heart. A typical model construction pipeline usually starts with segmentation of the LV by manual delineation followed by mesh generation and registration techniques using separate software tools. However, such approaches usually require significant time and human efforts in the model generation process, limiting large-scale analysis. In this study, we propose an approach toward fully automating the model generation process for CFD simulation of LV flow to significantly reduce LV CFD model generation time. Our modeling framework leverages a novel combination of techniques including deep-learning based segmentation, geometry processing, and image registration to reliably reconstruct CFD-suitable LV models with little-to-no user intervention.1 We utilized an ensemble of two-dimensional (2D) convolutional neural networks (CNNs) for automatic segmentation of cardiac structures from three-dimensional (3D) patient images and our segmentation approach outperformed recent state-of-the-art segmentation techniques when evaluated on benchmark data containing both magnetic resonance (MR) and computed tomography(CT) cardiac scans. We demonstrate that through a combination of segmentation and geometry processing, we were able to robustly create CFD-suitable LV meshes from segmentations for 78 out of 80 test cases. Although the focus on this study is on image-to-mesh generation, we demonstrate the feasibility of this framework in supporting LV hemodynamics modeling by performing CFD simulations from two representative time-resolved patient-specific image datasets. 
    more » « less
  4. In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing K-class classification using the predictions of smaller classifiers, each trained on a subset of [K], and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a near-optimal scheme for designing such configurations of classifiers which recovers the well known one-vs.-one classification approach as a special case. Experiments with the MNIST and CIFAR-10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to larger-scale multiclass problems. 
    more » « less
  5. Activity recognition is central to many motion analysis applications ranging from health assessment to gaming. However, the need for obtaining sufficiently large amounts of labeled data has limited the development of personalized activity recognition models. Semi-supervised learning has traditionally been a promising approach in many application domains to alleviate reliance on large amounts of labeled data by learning the label information from a small set of seed labels. Nonetheless, existing approaches perform poorly in highly dynamic settings, such as wearable systems, because some algorithms rely on predefined hyper-parameters or distribution models that needs to be tuned for each user or context. To address these challenges, we introduce LabelForest 1, a novel non-parametric semi-supervised learning framework for activity recognition. LabelForest has two algorithms at its core: (1) a spanning forest algorithm for sample selection and label inference; and (2) a silhouette-based filtering method to finalize label augmentation for machine learning model training. Our thorough analysis on three human activity datasets demonstrate that LabelForest achieves a labeling accuracy of 90.1% in presence of a skewed label distribution in the seed data. Compared to self-training and other sequential learning algorithms, LabelForest achieves up to 56.9% and 175.3% improvement in the accuracy on balanced and unbalanced seed data, respectively. 
    more » « less