Long-Form Answers to Visual Questions from Blind and Low Vision People

Huh, Mina; Xu, Fangyuan; Peng, Yi-Hao; Chen, Congyan; Murugu, Hansika; Gurari, Danna; Choi, Eunsol; Pavel, Amy

Citation Details

Vision language models can now generate long-form answers to questions about images -- long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies. more »

Award ID(s):: 2521091

PAR ID:: 10617277

Author(s) / Creator(s):: Huh, Mina; Xu, Fangyuan; Peng, Yi-Hao; Chen, Congyan; Murugu, Hansika; Gurari, Danna; Choi, Eunsol; Pavel, Amy

Publisher / Repository:: Conference on Language Modeling

Date Published:: 2024-10-09

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this