Search for: All records

Award ID contains: 1633295

« Prev Next »

Total Resources

11

Resource Type
Conference Paper

7

Conference Proceeding

0

Dataset

0

Journal Article

4

Workshop Report

0

Availability
Full Text / Resource Available

11

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SAU-Net: A Unified Network for Cell Counting in 2D and 3D Microscopy Images

https://doi.org/10.1109/TCBB.2021.3089608

Guo, Yue ; Krupa, Oleh ; Stein, Jason ; Wu, Guorong ; Krishnamurthy, Ashok ( June 2021 , IEEE/ACM Transactions on Computational Biology and Bioinformatics)

Image-based cell counting is a fundamental yet challenging task with wide applications in biological research. In this paper, we propose a novel unified deep network framework designed to solve this problem for various cell types in both 2D and 3D images. Specifically, we first propose SAU-Net for cell counting by extending the segmentation network U-Net with a Self-Attention module. Second, we design an extension of Batch Normalization (BN) to facilitate the training process for small datasets. In addition, a new 3D benchmark dataset based on the existing mouse blastocyst (MBC) dataset is developed and released to the community. Our SAU-Net achieves state-of-the-art results on four benchmark 2D datasets - synthetic fluorescence microscopy (VGG) dataset, Modified Bone Marrow (MBM) dataset, human subcutaneous adipose tissue (ADI) dataset, and Dublin Cell Counting (DCC) dataset, and the new 3D dataset, MBC. The BN extension is validated using extensive experiments on the 2D datasets, since GPU memory constraints preclude use of 3D datasets. The source code is available at https://github.com/mzlr/sau-net.
more » « less
Full Text Available
Two-Stream Attention Spatio-Temporal Network For Classification Of Echocardiography Videos

https://doi.org/10.1109/ISBI48211.2021.9433773

Feng, Zishun ; Sivak, Joseph A. ; Krishnamurthy, Ashok K. ( April 2021 , International Symposium on Biomedical Imaging 2021)

There is considerable interest in AI systems that can assist a cardiologist to diagnose echocardiograms, and can also be used to train residents in classifying echocardiograms. Prior work has focused on the analysis of a single frame. Classifying echocardiograms at the video-level is challenging due to intra-frame and inter-frame noise. We propose a two-stream deep network which learns from the spatial context and optical flow for the classification of echocardiography videos. Each stream contains two parts: a Convolutional Neural Network (CNN) for spatial features and a bi-directional Long Short-Term Memory (LSTM) network with Attention for temporal. The features from these two streams are fused for classification. We verify our experimental results on a dataset of 170 (80 normal and 90 abnormal) videos that have been manually labeled by trained cardiologists. Our method provides an overall accuracy of 91:18%, with a sensitivity of 94:11% and a specificity of 88:24%.
more » « less
Full Text Available
Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos

Feng, Z ; Tu, M ; Xia, R ; Wang, Y ; Krishnamurthy, A ( December 2020 , IEEE International Conference on Big Data)
null (Ed.)
Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self supervised cross modal representation approach for learning audio visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross modal and intra modal manner with the learned representations. We verify our experimental results on the VGGSound dataset and our approach achieves promising results.
more » « less
Full Text Available
Multi-Target Embodied Question Answering

Yu, Licheng ; Chen, Xinlei ; Gkioxari, Georgia ; Bansal, Mohit ; Berg, Tamara L ; Batra, Dhruv ( January 2019 , IEEE Conference on Computer Vision and Pattern Recognition)

Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g. “what color is the car?”, has exactly one target (“car”) being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA – Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as “Is the dresser in the bedroom bigger than the oven in the kitchen?”, where the agent has to navigate to multiple locations (“dresser in bedroom”, “oven in kitchen”) and perform comparative reasoning (“dresser” bigger than “oven”) before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin. Project page: https://embodiedqa.org.
more » « less
Full Text Available
MAttNet: Modular Attention Network for Referring Expression Comprehension

Yu, Licheng ; Lin, Zhe ; Shen, Xiaohui ; Yang, Jimei ; Lu, Xin ; Bansal, Mohit ; Berg, Tamara L. ( June 2018 , IEEE Conference on Computer Vision and Pattern Recognition)

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.
more » « less
Full Text Available
Visual to Sound: Generating Natural Sound for Videos in the Wild

Zhou, Yipin ; Wang, Zhaowen ; Fang, Chen ; Bui, Trung ; Berg, Tamara L. ( June 2018 , IEEE Conference on Computer Vision and Pattern Recognition)

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.
more » « less
Full Text Available
From image to language and back again

https://doi.org/10.1017/S1351324918000086

BELZ, A. ; BERG, T.L. ; YU, L. ( May 2018 , Natural Language Engineering)

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al. ), multimodal machine translation (Madhyastha et al. , Frank et al. ), image caption generation (Madhyastha et al. , Tanti et al. ), visual scene understanding (Silberer et al. ), and multimodal learning of high-level attributes (Sorodoc et al. ). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).
more » « less
Full Text Available
Combining Multiple Cues for Visual Madlibs Question Answering

https://doi.org/10.1007/s11263-018-1096-0

Tommasi, Tatiana ; Mallya, Arun ; Plummer, Bryan ; Lazebnik, Svetlana ; Berg, Alexander C. ; Berg, Tamara L. ( April 2018 , International Journal of Computer Vision)

This paper presents an approach for answering ﬁll-in-the-blank multiple choice questions from the Visual Madlibs dataset.Instead of generic and commonly used representations trained on the ImageNet classiﬁcation task, our approach employs acombination of networks trained for specialized tasks such as scene recognition, person activity classiﬁcation, and attributeprediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support forfeature extraction. We map each of these features, together with candidate answers, to a joint embedding space throughnormalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scoresfrom nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a signiﬁcantimprovement over the previous state of the art and conﬁrm that answering questions from a wide range of types beneﬁts fromexamining a variety of image cues and carefully choosing the spatial support for feature extraction.
more » « less
Full Text Available
TVQA: Localized, Compositional Video Question Answering

Lei, Jie ; Yu, Licheng ; Bansal, Mohit ; Berg, Tamara L. ( January 2018 , Empirical Methods in Natural Language Processing)

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
more » « less
Full Text Available
Hierarchically-Attentive RNN for Album Summarization and Storytelling

https://doi.org/10.18653/v1/D17-1101

Yu, Licheng ; Bansal, Mohit ; Berg, Tamara ( January 2017 , Empirical Methods in Natural Language Processing)

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.
more » « less
Full Text Available

« Prev Next »