Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            We present a solution to image-based cell counting with dot annotations for both 2D and 3D cases. Current approaches have two major limitations: 1) inability to provide precise locations when cells overlap; and 2) reliance on costly labeled data. To address these two issues, we first adopt the inverse distance kernel, which yields separable density maps for better localization. Second, we take advantage of unlabeled data by self-supervised learning with focal consistency loss, which we propose for our pixel-wise task. These two contributions complement each other. Together, our framework compares favorably against stateof- the-art methods, including methods using full annotations on 2D and 3D benchmarks, while significantly reducing the amount of labeled data needed for training. In addition, we provide a tool to expedite the labeling process for dot annotations. Finally, we make the source code and labeling tool publicly available.more » « lessFree, publicly-accessible full text available February 21, 2026
- 
            Segmentation of echocardiograms plays an essential role in the quantitative analysis of the heart and helps diagnose cardiac diseases. In the recent decade, deep learning-based approaches have significantly improved the performance of echocardiogram segmentation. Most deep learning-based methods assume that the image to be processed is rectangular in shape. However, typically echocardiogram images are formed within a sector of a circle, with a significant region in the overall rectangular image where there is no data, a result of the ultrasound imaging methodology. This large non-imaging region can influence the training of deep neural networks. In this paper, we propose to use polar transformation to help train deep learning algorithms. Using the r-θ transformation, a significant portion of the non-imaging background is removed, allowing the neural network to focus on the heart image. The segmentation model is trained on both x-y and r-θ images. During inference, the predictions from the x-y and r-θ images are combined using max-voting. We verify the efficacy of our method on the CAMUS dataset with a variety of segmentation networks, encoder networks, and loss functions. The experimental results demonstrate the effectiveness and versatility of our proposed method for improving the segmentation results.more » « less
- 
            Image-based cell counting is a fundamental yet challenging task with wide applications in biological research. In this paper, we propose a novel unified deep network framework designed to solve this problem for various cell types in both 2D and 3D images. Specifically, we first propose SAU-Net for cell counting by extending the segmentation network U-Net with a Self-Attention module. Second, we design an extension of Batch Normalization (BN) to facilitate the training process for small datasets. In addition, a new 3D benchmark dataset based on the existing mouse blastocyst (MBC) dataset is developed and released to the community. Our SAU-Net achieves state-of-the-art results on four benchmark 2D datasets - synthetic fluorescence microscopy (VGG) dataset, Modified Bone Marrow (MBM) dataset, human subcutaneous adipose tissue (ADI) dataset, and Dublin Cell Counting (DCC) dataset, and the new 3D dataset, MBC. The BN extension is validated using extensive experiments on the 2D datasets, since GPU memory constraints preclude use of 3D datasets. The source code is available at https://github.com/mzlr/sau-net.more » « less
- 
            There is considerable interest in AI systems that can assist a cardiologist to diagnose echocardiograms, and can also be used to train residents in classifying echocardiograms. Prior work has focused on the analysis of a single frame. Classifying echocardiograms at the video-level is challenging due to intra-frame and inter-frame noise. We propose a two-stream deep network which learns from the spatial context and optical flow for the classification of echocardiography videos. Each stream contains two parts: a Convolutional Neural Network (CNN) for spatial features and a bi-directional Long Short-Term Memory (LSTM) network with Attention for temporal. The features from these two streams are fused for classification. We verify our experimental results on a dataset of 170 (80 normal and 90 abnormal) videos that have been manually labeled by trained cardiologists. Our method provides an overall accuracy of 91:18%, with a sensitivity of 94:11% and a specificity of 88:24%.more » « less
- 
            null (Ed.)Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self supervised cross modal representation approach for learning audio visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross modal and intra modal manner with the learned representations. We verify our experimental results on the VGGSound dataset and our approach achieves promising results.more » « less
- 
            Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g. “what color is the car?”, has exactly one target (“car”) being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA – Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as “Is the dresser in the bedroom bigger than the oven in the kitchen?”, where the agent has to navigate to multiple locations (“dresser in bedroom”, “oven in kitchen”) and perform comparative reasoning (“dresser” bigger than “oven”) before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin. Project page: https://embodiedqa.org.more » « less
- 
            As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.more » « less
- 
            In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.more » « less
- 
            Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al. ), multimodal machine translation (Madhyastha et al. , Frank et al. ), image caption generation (Madhyastha et al. , Tanti et al. ), visual scene understanding (Silberer et al. ), and multimodal learning of high-level attributes (Sorodoc et al. ). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).more » « less
- 
            This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset.Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs acombination of networks trained for specialized tasks such as scene recognition, person activity classification, and attributeprediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support forfeature extraction. We map each of these features, together with candidate answers, to a joint embedding space throughnormalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scoresfrom nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significantimprovement over the previous state of the art and confirm that answering questions from a wide range of types benefits fromexamining a variety of image cues and carefully choosing the spatial support for feature extraction.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available