skip to main content


This content will become publicly available on January 4, 2025

Title: Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models.
Existing building recognition methods, exemplified by BRAILS, utilize supervised learning to extract information from satellite and street-view images for classification and segmentation. However, each task module requires human-annotated data, hindering the scalability and robustness to regional variations and annotation imbalances. In response, we propose a new zero-shot workflow for building attribute extraction that utilizes large-scale vision and language models to mitigate reliance on external annotations. The proposed workflow contains two key components: image-level captioning and segment-level captioning for the building images based on the vocabularies pertinent to structural and civil engineering. These two components generate descriptive captions by computing feature representations of the image and the vocabularies, and facilitating a semantic match between the visual and textual representations. Consequently, our framework offers a promising avenue to enhance AI-driven captioning for building attribute extraction in the structural and civil engineering domains, ultimately reducing reliance on human annotations while bolstering performance and adaptability.  more » « less
Award ID(s):
2131111
NSF-PAR ID:
10502343
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Date Published:
Page Range / eLocation ID:
8647-8656
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Training a deep learning model with a large annotated dataset is still a dominant paradigm in automatic whole slide images (WSIs) processing for digital pathology. However, obtaining manual annotations is a labor-intensive task, and an error-prone to inter and intra-observer variability. In this study, we offer an online deep learning-based clustering workflow for annotating and analysis of different types of tissues from histopathology images. Inspired by learning and optimal transport theory, our proposed model consists of two stages. In the first stage, our model learns tissue-specific discriminative representations by contrasting the features in the latent space at two levels, the instance- and the clusterlevel. This is done by maximizing the similarities of the projections of positive pairs (views of the same image) while minimizing those of negative ones (views of the rest of the images). In the second stage, our framework extends the standard cross-entropy minimization to an optimal transport problem and solves it using the Sinkhorn-Knopp algorithm to produce the cluster assignments. Moreover, our proposed method enforces consistency between the produced assignments obtained from views of the same image. Our framework was evaluated on three common histopathological datasets: NCT-CRC, LC2500, and Kather STAD. Experiments show that our proposed framework can identify different tissues in annotation-free conditions with competitive results. It achieved an accuracy of 0.9364 in human lung patched WSIs and 0.8464 in images of human colorectal tissues outperforming state of the arts contrastive-based methods. 
    more » « less
  2. Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER. 
    more » « less
  3. Building an annotated damage image database is the first step to support AI-assisted hurricane impact analysis. Up to now, annotated datasets for model training are insufficient at a local level despite abundant raw data that have been collected for decades. This paper provides a systematic approach for establishing an annotated hurricane-damaged building image database to support AI-assisted damage assessment and analysis. Optimal rectilinear images were generated from panoramic images collected from Hurricane Harvey, Texas 2017. Then, deep learning models, including Amazon Web Service (AWS) Rekognition and Mask R-CNN (Region Based Convolutional Neural Networks), were retrained on the data to develop a pipeline for building detection and structural component extraction. A web-based dashboard was developed for building data management and processed image visualization along with detected structural components and their damage ratings. The proposed AI-assisted labeling tool and trained models can intelligently and rapidly assist potential users such as hazard researchers, practitioners, and government agencies on natural disaster damage management. 
    more » « less
  4. There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain mapping, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain region the image comes from) is a crucial and open challenge. However, most existing datasets and benchmarks for neuroanatomy consider only a single downstream task at a time. To bridge this gap, we introduce a new dataset, annotations, and multiple downstream tasks that provide diverse ways to readout information about brain structure and architecture from the same image. Our multi-task neuroimaging benchmark (MTNeuro) is built on volumetric, micrometer-resolution X-ray microtomography images spanning a large thalamocortical section of mouse brain, encompassing multiple cortical and subcortical regions. We generated a number of different prediction challenges and evaluated several supervised and self-supervised models for brain-region prediction and pixel-level semantic segmentation of microstructures. Our experiments not only highlight the rich heterogeneity of this dataset, but also provide insights into how self-supervised approaches can be used to learn representations that capture multiple attributes of a single image and perform well on a variety of downstream tasks. Datasets, code, and pre-trained baseline models are provided at: https://mtneuro.github.io/ 
    more » « less
  5. People are able to describe images using thousands of languages, but languages share only one visual world. The aim of this work is to use the learned intermediate visual representations from a deep convolutional neural network to transfer information across languages for which paired data is not available in any form. Our work proposes using backpropagation-based decoding coupled with transformer-based multilingual-multimodal language models in order to obtain translations between any languages used during training. We particularly show the capabilities of this approach in the translation of German-Japanese and Japanese-German sentence pairs, given a training data of images freely associated with text in English, German, and Japanese but for which no single image contains annotations in both Japanese and German. Moreover, we demonstrate that our approach is also generally useful in the multilingual image captioning task when sentences in a second language are available at test time. The results of our method also compare favorably in the Multi30k dataset against recently proposed methods that are also aiming to leverage images as an intermediate source of translations. 
    more » « less