skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 25, 2026

Title: Deep Learning for Autonomous Surgical Guidance Using 3‐Dimensional Images From Forward‐Viewing Endoscopic Optical Coherence Tomography
ABSTRACT A three‐dimensional convolutional neural network (3D‐CNN) was developed for the analysis of volumetric optical coherence tomography (OCT) images to enhance endoscopic guidance during percutaneous nephrostomy. The model was performance‐benchmarked using a 10‐fold nested cross‐validation procedure and achieved an average test accuracy of 90.57% across a dataset of 10 porcine kidneys. This performance significantly exceeded that of 2D‐CNN models that attained average test accuracies ranging from 85.63% to 88.22% using 1, 10, or 100 radial sections extracted from the 3D OCT volumes. The 3D‐CNN (~12 million parameters) was benchmarked against three state‐of‐the‐art volumetric architectures: the 3D Vision Transformer (3D‐ViT, ~45 million parameters), 3D‐DenseNet121 (~12 million parameters), and the Multi‐plane and Multi‐slice Transformer (M3T, ~29 million parameters). While these models achieved comparable inferencing accuracy, the 3D‐CNN exhibited lower inference latency (33 ms) than 3D‐ViT (86 ms), 3D‐DenseNet121 (58 ms), and M3T (93 ms), representing a critical advantage for real‐time surgical guidance applications. These results demonstrate the 3D‐CNN's capability as a powerful and practical tool for computer‐aided diagnosis in OCT‐guided surgical interventions.  more » « less
Award ID(s):
2238648 2132161
PAR ID:
10621438
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Wiley‑VCH Verlag GmbH & Co
Date Published:
Journal Name:
Journal of Biophotonics
ISSN:
1864-063X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT-based hybrid models for mobile vision applications. Recently, Vision GNN (ViG) and CNN hybrid models have also been proposed for mobile vision tasks. However, all of these methods remain slower compared to pure CNN-based models. In this work, we propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Using Multi-Level Dilated Convolutions allows for a larger theoretical receptive field than standard convolutions. Different levels of dilation also allow for interactions between the short-range and long-range features in an image. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation. Our fastest model, RapidNet-Ti, achieves 76.3% top-1 accuracy on ImageNet-1K with 0.9 ms inference latency on an iPhone 13 mini NPU, which is faster and more accurate than MobileNetV2x1.4 (74.7% top-1 with 1.0 ms latency). Our work shows that pure CNN architectures can beat SOTA hybrid and ViT models in terms of accuracy and speed when designed properly 
    more » « less
  2. Abstract Understanding three‐dimensional (3D) root traits is essential to improve water uptake, increase nitrogen capture, and raise carbon sequestration from the atmosphere. However, quantifying 3D root traits by reconstructing 3D root models for deeper field‐grown roots remains a challenge due to the unknown tradeoff between 3D root‐model quality and 3D root‐trait accuracy. Therefore, we performed two computational experiments. We first compared the 3D model quality generated by five state‐of‐the‐art open‐source 3D model reconstruction pipelines on 12 contrasting genotypes of field‐grown maize roots. These pipelines included COLMAP, COLMAP+PMVS (Patch‐based Multi‐View Stereo), VisualSFM, Meshroom, and OpenMVG+MVE (Multi‐View Environment). The COLMAP pipeline achieved the best performance regarding 3D model quality versus computational time and image number needed. In the second test, we compared the accuracy of 3D root‐trait measurement generated by the Digital Imaging of Root Traits 3D pipeline (DIRT/3D) using COLMAP‐based 3D reconstruction with our current DIRT/3D pipeline that uses a VisualSFM‐based 3D reconstruction on the same dataset of 12 genotypes, with 5–10 replicates per genotype. The results revealed that (1) the average number of images needed to build a denser 3D model was reduced from 3000 to 3600 (DIRT/3D [VisualSFM‐based 3D reconstruction]) to around 360 for computational test 1, and around 600 for computational test 2 (DIRT/3D [COLMAP‐based 3D reconstruction]); (2) denser 3D models helped improve the accuracy of the 3D root‐trait measurement; (3) reducing the number of images can help resolve data storage problems. The updated DIRT/3D (COLMAP‐based 3D reconstruction) pipeline enables quicker image collection without compromising the accuracy of 3D root‐trait measurements. 
    more » « less
  3. Faggioli, G; Ferro, N; Galuščáková, P; de, A (Ed.)
    This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, different variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture—a hybrid approach leveraging CNN and Transformer advantages—ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images. 
    more » « less
  4. Kidney cancer is a kind of high mortality cancer because of the difficulty in early diagnosis and the high metastatic dissemination in treatments. The surgical resection of tumors is the most effective treatment for renal cancer patients. However, precise assessment of tumor margins is a challenge during surgical resection. The objective of this study is to demonstrate an optical imaging tool in precisely distinguishing kidney tumor borders and identifying tumor zones from normal tissues to assist surgeons in accurately resecting tumors from kidneys during the surgery. 30 samples from six human kidneys were imaged using polarization-sensitive optical coherence tomography (PS-OCT). Cross-sectional, enface, and spatial information of kidney samples were obtained for microenvironment reconstruction. Polarization parameters (phase retardation, optic axis direction, and degree of polarization uniformity (DOPU) and Stokes parameters (Q, U, and V) were utilized for multiparameter analysis. To verify the detection accuracy of PS-OCT, H&E histology staining and dice-coefficient were utilized to quantify the performance of PS-OCT in identifying tumor borders and regions. In this study, tumor borders were clearly identified by PS-OCT imaging, which outperformed the conventional intensity-based OCT. With H&E histological staining as golden standard, PS-OCT precisely identified the tumor regions and tissue distributions at different locations and different depths based on polarization and Stokes parameters. Compared to the traditional attenuation coefficient quantification method, PS-OCT demonstrated enhanced contrast of tissue characteristics between normal and cancerous tissues due to the birefringence effects. Our results demonstrated that PS-OCT was promising to provide imaging guidance for the surgical resection of kidney tumors and had the potential to be used for other human kidney surgeries in clinics such as renal biopsy. 
    more » « less
  5. Ensuring high-quality prints in additive manufacturing is a critical challenge due to the variability in materials, process parameters, and equipment. Machine learning models are increasingly being employed for real-time quality monitoring, enabling the detection and classification of defects such as under-extrusion and over-extrusion. Vision Transformers (ViTs), with their global self-attention mechanisms, offer a promising alternative to traditional convolutional neural networks (CNNs). This paper presents a transformer-based approach for print quality recognition in additive manufacturing technologies, with a focus on fused filament fabrication (FFF), leveraging advanced self-supervised representation learning techniques to enhance the robustness and generalizability of ViTs. We show that the ViT model effectively classifies printing quality into different levels of extrusion, achieving exceptional performance across varying dataset scales and noise levels. Training evaluations show a steady decrease in cross-entropy loss, with prediction accuracy, precision, recall, and the harmonic mean of precision and recall (F1) scores reaching close to 1 within 40 epochs, demonstrating excellent performance across all classes. The macro and micro F1 scores further emphasize the ability of ViT to handle both class imbalance and instance-level accuracy effectively. Our results also demonstrate that ViT outperforms CNN in all scenarios, particularly in noisy conditions and with small datasets. Comparative analysis reveals ViT advantages, particularly in leveraging global self-attention and robust feature extraction methods, enhancing its ability to generalize effectively and remain resilient with limited data. These findings underline the potential of the transformer-based approach as a scalable, interpretable, and reliable solution to real-time quality monitoring in FFF, addressing key challenges in additive manufacturing defect detection and ensuring process efficiency. 
    more » « less