skip to main content


Title: Analysis of executional and procedural errors in dry‐lab robotic surgery experiments
Abstract Background

Analysing kinematic and video data can help identify potentially erroneous motions that lead to sub‐optimal surgeon performance and safety‐critical events in robot‐assisted surgery.

Methods

We develop a rubric for identifying task and gesture‐specific executional and procedural errors and evaluate dry‐lab demonstrations of suturing and needle passing tasks from the JIGSAWS dataset. We characterise erroneous parts of demonstrations by labelling video data, and use distribution similarity analysis and trajectory averaging on kinematic data to identify parameters that distinguish erroneous gestures.

Results

Executional error frequency varies by task and gesture, and correlates with skill level. Some predominant error modes in each gesture are distinguishable by analysing error‐specific kinematic parameters. Procedural errors could lead to lower performance scores and increased demonstration times but also depend on surgical style.

Conclusions

This study provides insights into context‐dependent errors that can be used to design automated error detection mechanisms and improve training and skill assessment.

 
more » « less
Award ID(s):
1829004
NSF-PAR ID:
10446454
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
The International Journal of Medical Robotics and Computer Assisted Surgery
Volume:
18
Issue:
3
ISSN:
1478-5951
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Despite significant developments in the design of surgical robots and automated techniques for objective evalua- tion of surgical skills, there are still challenges in ensuring safety in robot-assisted minimally-invasive surgery (RMIS). This pa- per presents a runtime monitoring system for the detection of executional errors during surgical tasks through the analysis of kinematic data. The proposed system incorporates dual Siamese neural networks and knowledge of surgical context, including surgical tasks and gestures, their distributional similarities, and common error modes, to learn the differences between normal and erroneous surgical trajectories from small training datasets. We evaluate the performance of the error detection using Siamese networks compared to single CNN and LSTM networks trained with different levels of contextual knowledge and training data, using the dry-lab demonstrations of the Suturing and Needle Passing tasks from the JIGSAWS dataset. Our results show that gesture specific task nonspecific Siamese networks obtain micro F1 scores of 0.94 (Siamese-CNN) and 0.95 (Siamese-LSTM), and perform better than single CNN (0.86) and LSTM (0.87) networks. These Siamese networks also outperform gesture nonspecific task specific Siamese-CNN and Siamese-LSTM models for Suturing and Needle Passing. 
    more » « less
  2. Electron Backscatter Diffraction (EBSD) is a widely used approach for characterising the microstructure of various materials. However, it is difficult to accurately distinguish similar (body centred cubic and body centred tetragonal, with small tetragonality) phases in steels using standard EBSD software. One method to tackle the problem of phase distinction is to measure the tetragonality of the phases, which can be done using simulated patterns and cross‐correlation techniques to detect distortion away from a perfectly cubic crystal lattice. However, small errors in the determination of microscope geometry (the so‐called pattern or projection centre) can cause significant errors in tetragonality measurement and lead to erroneous results. This paper utilises a new approach for accurate pattern centre determination via a strain minimisation routine across a large number of grains in dual phase steels. Tetragonality maps are then produced and used to identify phase and estimate local carbon content. The technique is implemented using both kinetically simulated and dynamically simulated patterns to determine their relative accuracy. Tetragonality maps, and subsequent phase maps, based on dynamically simulated patterns in a point‐by‐point and grain average comparison are found to consistently produce more precise and accurate results, with close to 90% accuracy for grain phase identification, when compared with an image‐quality identification method. The error in tetragonality measurements appears to be of the order of 1%, thus producing a commensurate ∼0.2% error in carbon content estimation. Such an error makes the technique unsuitable for estimation of total carbon content of most commercial steels, which often have carbon levels below 0.1%. However, even in the DP steel for this study (0.1 wt.% carbon) it can be used to map carbon in regions with higher accumulation (such as in martensite with nonhomogeneous carbon content).

    Lay Description

    Electron Backscatter Diffraction (EBSD) is a widely used approach for characterising the microstructure of various materials. However, it is difficult to accurately distinguish similar (BCC and BCT) phases in steels using standard EBSD software due to the small difference in crystal structure. One method to tackle the problem of phase distinction is to measure the tetragonality, or apparent ‘strain’ in the crystal lattice, of the phases. This can be done by comparing experimental EBSD patterns with simulated patterns via cross‐correlation techniques, to detect distortion away from a perfectly cubic crystal lattice. However, small errors in the determination of microscope geometry (the so‐called pattern or projection centre) can cause significant errors in tetragonality measurement and lead to erroneous results. This paper utilises a new approach for accurate pattern centre determination via a strain minimisation routine across a large number of grains in dual phase steels. Tetragonality maps are then produced and used to identify phase and estimate local carbon content. The technique is implemented using both simple kinetically simulated and more complex dynamically simulated patterns to determine their relative accuracy. Tetragonality maps, and subsequent phase maps, based on dynamically simulated patterns in a point‐by‐point and grain average comparison are found to consistently produce more precise and accurate results, with close to 90% accuracy for grain phase identification, when compared with an image‐quality identification method. The error in tetragonality measurements appears to be of the order of 1%, thus producing a commensurate error in carbon content estimation. Such an error makes an estimate of total carbon content particularly unsuitable for low carbon steels; although maps of local carbon content may still be revealing.

    Application of the method developed in this paper will lead to better understanding of the complex microstructures of steels, and the potential to design microstructures that deliver higher strength and ductility for common applications, such as vehicle components.

     
    more » « less
  3. Abstract Importance

    The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models’ performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets.

    Objectives

    This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance.

    Materials and Methods

    We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.

    Results

    Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed.

    Discussion

    The study’s findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings.

    Conclusion

    While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

     
    more » « less
  4. Abstract

    Snowpack provides the majority of predictive information for water supply forecasts (WSFs) in snow-dominated basins across the western United States. Drought conditions typically accompany decreased snowpack and lowered runoff efficiency, negatively impacting WSFs. Here, we investigate the relationship between snow water equivalent (SWE) and April–July streamflow volume (AMJJ-V) during drought in small headwater catchments, using observations from 31 USGS streamflow gauges and 54 SNOTEL stations. A linear regression approach is used to evaluate forecast skill under different historical climatologies used for model fitting, as well as with different forecast dates. Experiments are constructed in which extreme hydrological drought years are withheld from model training, that is, years with AMJJ-V below the 15th percentile. Subsets of the remaining years are used for model fitting to understand how the climatology of different training subsets impacts forecasts of extreme drought years. We generally report overprediction in drought years. However, training the forecast model on drier years, that is, below-median years (P15,P57.5], minimizes residuals by an average of 10% in drought year forecasts, relative to a baseline case, with the highest median skill obtained in mid- to late April for colder regions. We report similar findings using a modified National Resources Conservation Service (NRCS) procedure in nine large Upper Colorado River basin (UCRB) basins, highlighting the importance of the snowpack–streamflow relationship in streamflow predictability. We propose an “adaptive sampling” approach of dynamically selecting training years based on antecedent SWE conditions, showing error reductions of up to 20% in historical drought years relative to the period of record. These alternate training protocols provide opportunities for addressing the challenges of future drought risk to water supply planning.

    Significance Statement

    Seasonal water supply forecasts based on the relationship between peak snowpack and water supply exhibit unique errors in drought years due to low snow and streamflow variability, presenting a major challenge for water supply prediction. Here, we assess the reliability of snow-based streamflow predictability in drought years using a fixed forecast date or fixed model training period. We critically evaluate different training protocols that evaluate predictive performance and identify sources of error during historical drought years. We also propose and test an “adaptive sampling” application that dynamically selects training years based on antecedent SWE conditions providing to overcome persistent errors and provide new insights and strategies for snow-guided forecasts.

     
    more » « less
  5. ABSTRACT Introduction

    Remote military operations require rapid response times for effective relief and critical care. Yet, the military theater is under austere conditions, so communication links are unreliable and subject to physical and virtual attacks and degradation at unpredictable times. Immediate medical care at these austere locations requires semi-autonomous teleoperated systems, which enable the completion of medical procedures even under interrupted networks while isolating the medics from the dangers of the battlefield. However, to achieve autonomy for complex surgical and critical care procedures, robots require extensive programming or massive libraries of surgical skill demonstrations to learn effective policies using machine learning algorithms. Although such datasets are achievable for simple tasks, providing a large number of demonstrations for surgical maneuvers is not practical. This article presents a method for learning from demonstration, combining knowledge from demonstrations to eliminate reward shaping in reinforcement learning (RL). In addition to reducing the data required for training, the self-supervised nature of RL, in conjunction with expert knowledge-driven rewards, produces more generalizable policies tolerant to dynamic environment changes. A multimodal representation for interaction enables learning complex contact-rich surgical maneuvers. The effectiveness of the approach is shown using the cricothyroidotomy task, as it is a standard procedure seen in critical care to open the airway. In addition, we also provide a method for segmenting the teleoperator’s demonstration into subtasks and classifying the subtasks using sequence modeling.

    Materials and Methods

    A database of demonstrations for the cricothyroidotomy task was collected, comprising six fundamental maneuvers referred to as surgemes. The dataset was collected by teleoperating a collaborative robotic platform—SuperBaxter, with modified surgical grippers. Then, two learning models are developed for processing the dataset—one for automatic segmentation of the task demonstrations into a sequence of surgemes and the second for classifying each segment into labeled surgemes. Finally, a multimodal off-policy RL with rewards learned from demonstrations was developed to learn the surgeme execution from these demonstrations.

    Results

    The task segmentation model has an accuracy of 98.2%. The surgeme classification model using the proposed interaction features achieved a classification accuracy of 96.25% averaged across all surgemes compared to 87.08% without these features and 85.4% using a support vector machine classifier. Finally, the robot execution achieved a task success rate of 93.5% compared to baselines of behavioral cloning (78.3%) and a twin-delayed deep deterministic policy gradient with shaped rewards (82.6%).

    Conclusions

    Results indicate that the proposed interaction features for the segmentation and classification of surgical tasks improve classification accuracy. The proposed method for learning surgemes from demonstrations exceeds popular methods for skill learning. The effectiveness of the proposed approach demonstrates the potential for future remote telemedicine on battlefields.

     
    more » « less