skip to main content


Title: What's in an ALT Tag? Exploring Caption Content Priorities through Collaborative Captioning
Evaluating the quality of accessible image captions with human raters is difficult, as it may be difficult for a visually impaired user to know how comprehensive a caption is, whereas a sighted assistant may not know what information a user will need from a caption. To explore how image captioners and caption consumers assess caption content, we conducted a series of collaborative captioning sessions in which six pairs, consisting of a blind person and their sighted partner, worked together to discuss, create, and evaluate image captions. By making captioning a collaborative task, we were able to observe captioning strategies, to elicit questions and answers about image captions, and to explore blind users’ caption preferences. Our findings provide insight about the process of creating good captions and serve as a case study for cross-ability collaboration between blind and sighted people.  more » « less
Award ID(s):
1652907
NSF-PAR ID:
10323218
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ACM Transactions on Accessible Computing
Volume:
15
Issue:
1
ISSN:
1936-7228
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Few VR applications and games implement captioning of speech and audio cues, which either inhibits or prevents access of their application by deaf or hard of hearing (DHH) users, new language learners, and other caption users. Additionally, little to no guidelines exist on how to implement live captioning on VR headsets and how it may differ from traditional television captioning. To help fill the void of information behind user preferences of different VR captioning styles, we conducted a study with eight DHH participants to test three caption movement behaviors (head-locked, lag, and appear- locked) while watching live-captioned, single-speaker presentations in VR. Participants answered a series of Likert scale and open-ended questions about their experience. Participants’ preferences were split, but most participants reported feeling comfortable with using live captions in VR and enjoyed the experience. When participants ranked the caption behaviors, there was almost an equal divide between the three types tested. IPQ results indicated each behavior had similar immersion ratings, however participants found head-locked and lag captions more user-friendly than appear-locked captions. We suggest that participants may vary in caption preference depending on how they use captions, and that providing opportunities for caption customization is best. 
    more » « less
  2. Few VR applications and games implement captioning of speech and audio cues, which either inhibits or prevents access of their application by deaf or hard of hearing (DHH) users, new language learners, and other caption users. Additionally, little to no guidelines exist on how to implement live captioning on VR headsets and how it may differ from traditional television captioning. To help fill the void of information behind user preferences of different VR captioning styles, we conducted a study with eight DHH participants to test three caption movement behaviors (head-locked, lag, and appearlocked) while watching live-captioned, single-speaker presentations in VR. Participants answered a series of Likert scale and open-ended questions about their experience. Participants’ preferences were split, but most participants reported feeling comfortable with using live captions in VR and enjoyed the experience. When participants ranked the caption behaviors, there was almost an equal divide between the three types tested. IPQ results indicated each behavior had similar immersion ratings, however participants found head-locked and lag captions more user-friendly than appear-locked captions. We suggest that participants may vary in caption preference depending on how they use captions, and that providing opportunities for caption customization is best 
    more » « less
  3. Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach the captioning model how to generate plausible sentences. Meanwhile, the knowledge in the visual concept detector is distilled into the captioning model to guide the model to recognize the visual concepts in an image. In order to further encourage the generated captions to be semantically consistent with the image, the image and caption are projected into a common latent space so that they can reconstruct each other. Given that the existing sentence corpora are mainly designed for linguistic research and are thus with little reference to image contents, we crawl a large-scale image description corpus of two million natural sentences to facilitate the unsupervised image captioning scenario. Experimental results show that our proposed model is able to produce quite promising results without any caption annotations. 
    more » « less
  4. Working memory plays an important role in human activities across academic, professional, and social settings. Working memory is defined as the memory extensively involved in goal-directed behaviors in which information must be retained and manipulated to ensure successful task execution. The aim of this research is to understand the effect of image captioning with image description on an individual’s working memory. A study was conducted with eight neutral images comprising situations relatable to daily life such that each image could have a positive or negative description associated with the outcome of the situation in the image. The study consisted of three rounds where the first and second round involved two parts and the third round consisted of one part. The image was captioned a total of five times across the entire study. The findings highlighted that only 25% of participants were able to recall the captions which they captioned for an image after a span of 9–15 days; when comparing the recall rate of the captions, 50% of participants were able to recall the image caption from the previous round in the present round; and out of the positive and negative description associated with the image, 65% of participants recalled the former description rather than the latter. 
    more » « less
  5. Deaf and hard of hearing individuals regularly rely on captioning while watching live TV. Live TV captioning is evaluated by regulatory agencies using various caption evaluation metrics. However, caption evaluation metrics are often not informed by preferences of DHH users or how meaningful the captions are. There is a need to construct caption evaluation metrics that take the relative importance of words in transcript into account. We conducted correlation analysis between two types of word embeddings and human-annotated labelled word-importance scores in existing corpus. We found that normalized contextualized word embeddings generated using BERT correlated better with manually annotated importance scores than word2vec-based word embeddings. We make available a pairing of word embeddings and their human-annotated importance scores. We also provide proof-of-concept utility by training word importance models, achieving an F1-score of 0.57 in the 6-class word importance classification task. 
    more » « less