Few VR applications and games implement captioning of speech and audio cues, which either inhibits or prevents access of their application by deaf or hard of hearing (DHH) users, new language learners, and other caption users. Additionally, little to no guidelines exist on how to implement live captioning on VR headsets and how it may differ from traditional television captioning. To help fill the void of information behind user preferences of different VR captioning styles, we conducted a study with eight DHH participants to test three caption movement behaviors (head-locked, lag, and appear- locked) while watching live-captioned, single-speaker presentations in VR. Participants answered a series of Likert scale and open-ended questions about their experience. Participants’ preferences were split, but most participants reported feeling comfortable with using live captions in VR and enjoyed the experience. When participants ranked the caption behaviors, there was almost an equal divide between the three types tested. IPQ results indicated each behavior had similar immersion ratings, however participants found head-locked and lag captions more user-friendly than appear-locked captions. We suggest that participants may vary in caption preference depending on how they use captions, and that providing opportunities for caption customization is best.
more »
« less
What's in an ALT Tag? Exploring Caption Content Priorities through Collaborative Captioning
Evaluating the quality of accessible image captions with human raters is difficult, as it may be difficult for a visually impaired user to know how comprehensive a caption is, whereas a sighted assistant may not know what information a user will need from a caption. To explore how image captioners and caption consumers assess caption content, we conducted a series of collaborative captioning sessions in which six pairs, consisting of a blind person and their sighted partner, worked together to discuss, create, and evaluate image captions. By making captioning a collaborative task, we were able to observe captioning strategies, to elicit questions and answers about image captions, and to explore blind users’ caption preferences. Our findings provide insight about the process of creating good captions and serve as a case study for cross-ability collaboration between blind and sighted people.
more »
« less
- Award ID(s):
- 1652907
- PAR ID:
- 10323218
- Date Published:
- Journal Name:
- ACM Transactions on Accessible Computing
- Volume:
- 15
- Issue:
- 1
- ISSN:
- 1936-7228
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Few VR applications and games implement captioning of speech and audio cues, which either inhibits or prevents access of their application by deaf or hard of hearing (DHH) users, new language learners, and other caption users. Additionally, little to no guidelines exist on how to implement live captioning on VR headsets and how it may differ from traditional television captioning. To help fill the void of information behind user preferences of different VR captioning styles, we conducted a study with eight DHH participants to test three caption movement behaviors (head-locked, lag, and appearlocked) while watching live-captioned, single-speaker presentations in VR. Participants answered a series of Likert scale and open-ended questions about their experience. Participants’ preferences were split, but most participants reported feeling comfortable with using live captions in VR and enjoyed the experience. When participants ranked the caption behaviors, there was almost an equal divide between the three types tested. IPQ results indicated each behavior had similar immersion ratings, however participants found head-locked and lag captions more user-friendly than appear-locked captions. We suggest that participants may vary in caption preference depending on how they use captions, and that providing opportunities for caption customization is bestmore » « less
-
Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach the captioning model how to generate plausible sentences. Meanwhile, the knowledge in the visual concept detector is distilled into the captioning model to guide the model to recognize the visual concepts in an image. In order to further encourage the generated captions to be semantically consistent with the image, the image and caption are projected into a common latent space so that they can reconstruct each other. Given that the existing sentence corpora are mainly designed for linguistic research and are thus with little reference to image contents, we crawl a large-scale image description corpus of two million natural sentences to facilitate the unsupervised image captioning scenario. Experimental results show that our proposed model is able to produce quite promising results without any caption annotations.more » « less
-
Deaf and hard of hearing individuals regularly rely on captioning while watching live TV. Live TV captioning is evaluated by regulatory agencies using various caption evaluation metrics. However, caption evaluation metrics are often not informed by preferences of DHH users or how meaningful the captions are. There is a need to construct caption evaluation metrics that take the relative importance of words in transcript into account. We conducted correlation analysis between two types of word embeddings and human-annotated labelled word-importance scores in existing corpus. We found that normalized contextualized word embeddings generated using BERT correlated better with manually annotated importance scores than word2vec-based word embeddings. We make available a pairing of word embeddings and their human-annotated importance scores. We also provide proof-of-concept utility by training word importance models, achieving an F1-score of 0.57 in the 6-class word importance classification task.more » « less
-
Caption text conveys salient auditory information to deaf or hard-of-hearing (DHH) viewers. However, the emotional information within the speech is not captured. We developed three emotive captioning schemas that map the output of audio-based emotion detection models to expressive caption text that can convey underlying emotions. The three schemas used typographic changes to the text, color changes, or both. Next, we designed a Unity framework to implement these schemas and used it to generate stimuli videos. In an experimental evaluation with 28 DHH viewers, we compared DHH viewers’ ability to understand emotions and their subjective judgments across the three captioning schemas. We found no significant difference in participants’ ability to understand the emotion based on the captions or their subjective preference ratings. Open-ended feedback revealed factors contributing to individual differences in preferences among the participants and challenges with automatically generated emotive captions that motivate future work.more » « less