Abstract BackgroundAnalysing kinematic and video data can help identify potentially erroneous motions that lead to sub‐optimal surgeon performance and safety‐critical events in robot‐assisted surgery. MethodsWe develop a rubric for identifying task and gesture‐specific executional and procedural errors and evaluate dry‐lab demonstrations of suturing and needle passing tasks from the JIGSAWS dataset. We characterise erroneous parts of demonstrations by labelling video data, and use distribution similarity analysis and trajectory averaging on kinematic data to identify parameters that distinguish erroneous gestures. ResultsExecutional error frequency varies by task and gesture, and correlates with skill level. Some predominant error modes in each gesture are distinguishable by analysing error‐specific kinematic parameters. Procedural errors could lead to lower performance scores and increased demonstration times but also depend on surgical style. ConclusionsThis study provides insights into context‐dependent errors that can be used to design automated error detection mechanisms and improve training and skill assessment.
more »
« less
COMPASS: a formal framework and aggregate dataset for generalized surgical procedure modeling
Purpose: We propose a formal framework for the modeling and segmentation of minimally invasive surgical tasks using a unified set of motion primitives (MPs) to enable more objective labeling and the aggregation of different datasets. Methods: We model dry-lab surgical tasks as finite state machines, representing how the execution of MPs as the basic surgical actions results in the change of surgical context, which characterizes the physical interactions among tools and objects in the surgical environment. We develop methods for labeling surgical context based on video data and for automatic translation of context to MP labels. We then use our framework to create the COntext and Motion Primitive Aggregate Surgical Set (COMPASS), including six dry-lab surgical tasks from three publicly available datasets (JIGSAWS, DESK, and ROSMA), with kinematic and video data and context and MP labels. Results: Our context labeling method achieves near-perfect agreement between consensus labels from crowd-sourcing and expert surgeons. Segmentation of tasks to MPs results in the creation of the COMPASS dataset that nearly triples the amount of data for modeling and analysis and enables the generation of separate transcripts for the left and right tools. Conclusion: The proposed framework results in high quality labeling of surgical data based on context and fine-grained MPs. Modeling surgical tasks with MPs enables the aggregation of different datasets and the separate analysis of left and right hands for bimanual coordination assessment. Our formal framework and aggregate dataset can support the development of explainable and multi-granularity models for improved surgical process analysis, skill assessment, error detection, and autonomy.
more »
« less
- PAR ID:
- 10411983
- Date Published:
- Journal Name:
- International Journal of Computer Assisted Radiology and Surgery
- ISSN:
- 1861-6429
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Stormwater runoff microplastics: Polymer types, particle size, and factors controlling loading ratesStormwater runoff is a pathway of entry for microplastics (MPs, plastics <5 mm) into aquatic ecosystems. The objectives of this study were to determine MP size, morphology, chemical composition, and loading across urban storm events. Particles were extracted from stormwater samples collected at outfall locations using wet peroxide oxidation and cellulose digestion followed by analysis via attenuated total reflectance (ATR) FTIR. Concentrations observed were 0.99 ± 1.10 MP/L for 500–1000 μm and 0.41 ± 0.30 MP/L for the 1000–5000 μm size ranges. Seventeen different polymer types were observed. MP particle sizes measured using a FTIR-microscope camera indicated non-target size particles based on sieve-size classification, highlighting a potential source of error in studies reporting concentration by size class. A maximum MP load of 38.3 MP/m2 of upstream catchment was calculated. MP loadings had moderate correlations with both rainfall accumulation and intensity (Kendall τ = 0.54 and 0.42, respectively, both p ≤ 0.005). First flush (i.e. rapid wash-off of pollutants from watershed surfaces during rainfall early stages) was not always observed, and antecedent dry days were not correlated with MP abundance, likely due to the short dry periods between sampling events. Overall, the results presented provide data for risk assessment and mitigation strategies.more » « less
-
Abstract Computational fluid dynamics (CFD) modeling of left ventricle (LV) flow combined with patient medical imaging data has shown great potential in obtaining patient-specific hemodynamics information for functional assessment of the heart. A typical model construction pipeline usually starts with segmentation of the LV by manual delineation followed by mesh generation and registration techniques using separate software tools. However, such approaches usually require significant time and human efforts in the model generation process, limiting large-scale analysis. In this study, we propose an approach toward fully automating the model generation process for CFD simulation of LV flow to significantly reduce LV CFD model generation time. Our modeling framework leverages a novel combination of techniques including deep-learning based segmentation, geometry processing, and image registration to reliably reconstruct CFD-suitable LV models with little-to-no user intervention.1 We utilized an ensemble of two-dimensional (2D) convolutional neural networks (CNNs) for automatic segmentation of cardiac structures from three-dimensional (3D) patient images and our segmentation approach outperformed recent state-of-the-art segmentation techniques when evaluated on benchmark data containing both magnetic resonance (MR) and computed tomography(CT) cardiac scans. We demonstrate that through a combination of segmentation and geometry processing, we were able to robustly create CFD-suitable LV meshes from segmentations for 78 out of 80 test cases. Although the focus on this study is on image-to-mesh generation, we demonstrate the feasibility of this framework in supporting LV hemodynamics modeling by performing CFD simulations from two representative time-resolved patient-specific image datasets.more » « less
-
null (Ed.)Convolutional Neural Network (CNN) based image segmentation has made great progress in recent years. However, video object segmentation remains a challenging task due to its high computational complexity. Most of the previous methods employ a two-stream CNN framework to handle spatial and motion features separately. In this paper, we propose an end-to-end encoder-decoder style 3D CNN to aggregate spatial and temporal information simultaneously for video object segmentation. To efficiently process video, we propose 3D separable convolution for the pyramid pooling module and decoder, which dramatically reduces the number of operations while maintaining the performance. Moreover, we also extend our framework to video action segmentation by adding an extra classifier to predict the action label for actors in videos. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.more » « less
-
Abstract Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming by detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page:https://aix.eng.usf.edu/research_automated_ethogramming.htmlmore » « less