NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Cui, Xuanming; Aparcedo, Alejandro; Jang, Young Kyun; Lim, Ser-Nam (June 2024, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks evaluated across tasks including image classification image captioning and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However our findings suggest that context provided to the model via prompts--such as questions in a QA pair--helps to mitigate the effects of visual adversarial inputs. Notably the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
more » « less
Full Text Available
Efficient and consistent zero-shot video generation with diffusion models

https://doi.org/10.1117/12.3013575

Frakes, Ethan; Khalid, Umar; Chen, Chen (June 2024, SPIE)
Kehtarnavaz, Nasser; Shirvaikar, Mukul V (Ed.)
Recent diffusion-based generative models employ methods such as one-shot fine-tuning an image diffusion model for video generation. However, this leads to long video generation times and suboptimal efficiency. To resolve this long generation time, zero-shot text-to-video models eliminate the fine-tuning method entirely and can generate novel videos from a text prompt alone. While the zero-shot generation method greatly reduces generation time, many models rely on inefficient cross-frame attention processors, hindering the diffusion model’s utilization for real-time video generation. We address this issue by introducing more efficient attention processors to a video diffusion model. Specifically, we use attention processors (i.e. xFormers, FlashAttention, and HyperAttention) that are highly optimized for efficiency and hardware parallelization. We then apply these processors to a video generator and test with both older diffusion models such as Stable Diffusion 1.5 and newer, high-quality models such as Stable Diffusion XL. Our results show that using efficient attention processors alone can reduce generation time by around 25%, while not resulting in any change in video quality. Combined with the use of higher quality models, this use of efficient attention processors in zero-shot generation presents a substantial efficiency and quality increase, greatly expanding the video diffusion model’s application to real-time video generation.
more » « less
Full Text Available
Generalized deep learning model for photovoltaic module segmentation from satellite and aerial imagery

https://doi.org/10.1016/j.solener.2024.112539

García, Gustavo; Aparcedo, Alejandro; Nayak, Gaurav Kumar; Ahmed, Tanvir; Shah, Mubarak; Li, Mengjie (May 2024, Solar Energy)

As solar photovoltaic (PV) has emerged as a dominant player in the energy market, there has been an exponential surge in solar deployment and investment within this sector. With the rapid growth of solar energy adoption, accurate and efficient detection of PV panels has become crucial for effective solar energy mapping and planning. This paper presents the application of the Mask2Former model for segmenting PV panels from a diverse, multi-resolution dataset of satellite and aerial imagery. Our primary objective is to harness Mask2Former’s deep learning capabilities to achieve precise segmentation of PV panels in real-world scenarios. We fine-tune the pre-existing Mask2Former model on a carefully curated multi-resolution dataset and a crowdsourced dataset of satellite and aerial images, showcasing its superiority over other deep learning models like U-Net and DeepLabv3+. Most notably, Mask2Former establishes a new state-of-the-art in semantic segmentation by achieving over 95% IoU scores. Our research contributes significantly to the advancement solar energy mapping and sets a benchmark for future studies in this field.
more » « less
Full Text Available
Ensemble Modeling for Multimodal Visual Action Recognition

Kini, Jyoti; Fleischer, Sarah; Dave, Ishan; Shah, Mubarak (September 2023, 22nd International Conference on Image Analysis and Processing Workshops – Multimodal Action Recognition on the MECCANO Dataset)

In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we propose an exponentially decaying variant of focal loss for our current task. It initially emphasizes learning from the hard misclassified examples and gradually adapts to the entire range of examples in the dataset. This annealing process encourages the model to strike a balance between focusing on the sparse set of hard samples, while still leveraging the information provided by the easier ones. Additionally, we opt for the late fusion strategy to combine the resultant probability distributions from RGB and Depth modalities for final action prediction. Experimental evaluations on the MECCANO dataset demonstrate the effectiveness of our approach. Notably, our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
more » « less
Full Text Available
Learning Situation Hyper-Graphs for Video Question Answering

https://doi.org/10.1109/CVPR52729.2023.01429

Khan, Aisha Urooj; Kuehne, Hilde; Wu, Bo; Chheu, Kim; Bousselham, Walid; Gan, Chuang; Lobo, Niels; Shah, Mubarak (June 2023, IEEE Computer Society)

Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG- VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip. and to use cross-attention between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks11Code will be available at https://github.com/aurooj/SHG-VQA.
more » « less
Full Text Available
Automated Defect Detection and Localization in Photovoltaic Cells Using Semantic Segmentation of Electroluminescence Images

https://doi.org/10.1109/JPHOTOV.2021.3131059

Fioresi, Joseph; Colvin, Dylan J; Frota, Rafaela; Gupta, Rohit; Li, Mengjie; Seigneur, Hubert P; Vyas, Shruti; Oliveira, Sofia; Shah, Mubarak; Davis, Kristopher O (January 2022, IEEE Journal of Photovoltaics)

In this article, we propose a deep learning based semantic segmentation model that identifies and segments defects in electroluminescence (EL) images of silicon photovoltaic (PV) cells. The proposed model can differentiate between cracks, contact interruptions, cell interconnect failures, and contact corrosion for both multicrystalline and monocrystalline silicon cells. Our model utilizes a segmentation Deeplabv3 model with a ResNet-50 backbone. It was trained on 17,064 EL images including 256 physically realistic simulated images of PV cells generated to deal with class imbalance. While performing semantic segmentation for five defect classes, this model achieves a weighted F1-score of 0.95, an unweighted F1-score of 0.69, a pixel-level global accuracy of 95.4%, and a mean intersection over union score of 57.3%. In addition, we introduce the UCF EL Defect dataset, a large-scale dataset consisting of 17,064 EL images, which will be publicly available for use by the PV and computer vision research communities.
more » « less
Full Text Available

Search for: All records