skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Flexible and scalable annotation tool to develop scene understanding datasets
Recent progress in data-driven vision and language-based tasks demands developing training datasets enriched with multiple modalities representing human intelligence. The link between text and image data is one of the crucial modalities for developing AI models. The development process of such datasets in the video domain requires much effort from researchers and annotators (experts and non-experts). Researchers re-design annotation tools to extract knowledge from annotators to answer new research questions. The whole process repeats for each new question which is timeconsuming. However, since the last decade, there has been little change in how the researchers and annotators interact with the annotation process. We revisit the annotation workflow and propose a concept of an adaptable and scalable annotation tool. The concept emphasizes its users’ interactivity to make annotation process design seamless and efficient. Researchers can conveniently add newer modalities to or augment the extant datasets using the tool. The annotators can efficiently link free-form text to image objects. For conducting human-subject experiments on any scale, the tool supports the data collection for attaining group ground truth. We have conducted a case study using a prototype tool between two groups with the participation of 74 non-expert people. We find that the interactive linking of free-form text to image objects feels intuitive and evokes a thought process resulting in a high-quality annotation. The new design shows ≈ 35% improvement in the data annotation quality. On UX evaluation, we receive above-average positive feedback from 25 people regarding convenience, UI assistance, usability, and satisfaction.  more » « less
Award ID(s):
2145565
PAR ID:
10418231
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Workshop on Human-In-the-Loop Data Analytics (HILDA ’22)
Page Range / eLocation ID:
1 to 7
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Vlachos, Andreas; Augenstein, Isabelle (Ed.)
    Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines. 
    more » « less
  2. A growing swath of NLP research is tackling problems related to generating long text, including tasks such as open-ended story generation, summarization, dialogue, and more. However, we currently lack appropriate tools to evaluate these long outputs of generation models: classic automatic metrics such as ROUGE have been shown to perform poorly, and newer learned metrics do not necessarily work well for all tasks and domains of text. Human rating and error analysis remains a crucial component for any evaluation of long text generation. In this paper, we introduce FALTE, a web-based annotation toolkit designed to address this shortcoming. Our tool allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task. Using the task interface, annotators can select and assign error labels to text span selections in an incremental paragraph-level annotation workflow. The latter functionality is designed to simplify the document-level task into smaller units and reduce cognitive load on the annotators. Our tool has previously been used to run a large-scale annotation study that evaluates the coherence of long generated summaries, demonstrating its utility. 
    more » « less
  3. With the increased popularity of electronic textbooks, there is a growing interest in developing a new generation of “intelligent textbooks,” which have the ability to guide readers according to their learning goals and current knowledge. Intelligent textbooks extend regular textbooks by integrating machine-manipulable knowledge, and the most popular type of integrated knowledge is a list of relevant concepts mentioned in the textbooks. With these concepts, multiple intelligent operations, such as content linking, content recommendation, or student modeling, can be performed. However, existing automatic keyphrase extraction methods, even supervised ones, cannot deliver sufficient accuracy to be practically useful in this task. Manual annotation by experts has been demonstrated to be a preferred approach for producing high-quality labeled data for training supervised models. However, most researchers in the education domain still consider the concept annotation process as an ad-hoc activity rather than a carefully executed task, which can result in low-quality annotated data. Using the annotation of concepts for the Introduction to Information Retrieval textbook as a case study, this paper presents a knowledge engineering method to obtain reliable concept annotations. As demonstrated by the data we collected, the inter-annotator agreement gradually increased along with our procedure, and the concept annotations we produced led to better results in document linking and student modeling tasks. The contributions of our work include a validated knowledge engineering procedure, a codebook for technical concept annotation, and a set of concept annotations for the target textbook, which could be used as a gold standard in further intelligent textbook research. 
    more » « less
  4. Kim, Yoon_Jeon; Swiecki, Zachari (Ed.)
    Identifying and annotating student use of debugging strategies when solving computer programming problems can be a meaningful tool for studying and better understanding the development of debugging skills, which may lead to the design of effective pedagogical interventions. However, this process can be challenging when dealing with large datasets, especially when the strategies of interest are rare but important. This difficulty lies not only in the scale of the dataset but also in operationalizing these rare phenomena within the data. Operationalization requires annotators to first define how these rare phenomena manifest in the data and then obtain a sufficient number of positive examples to validate that this definition is reliable by accurately measuring Inter-Rater Reliability (IRR). This paper presents a method that leverages Large Language Models (LLMs) to efficiently exclude computer programming episodes that are unlikely to exhibit a specific debugging strategy. By using LLMs to filter out irrelevant programming episodes, this method focuses human annotation efforts on the most pertinent parts of the dataset, enabling experts to operationalize the coding scheme and reach IRR more efficiently. 
    more » « less
  5. This paper explores the application of sensemaking theory to support non-expert crowds in intricate data annotation tasks. We investigate the influence of procedural context and data context on the annotation quality of novice crowds, defining procedural context as completing multiple related annotation tasks on the same data point, and data context as annotating multiple data points with semantic relevance. We conducted a controlled experiment involving 140 non-expert crowd workers, who generated 1400 event annotations across various procedural and data context levels. Assessments of annotations demonstrate that high procedural context positively impacts annotation quality, although this effect diminishes with lower data context. Notably, assigning multiple related tasks to novice annotators yields comparable quality to expert annotations, without costing additional time or effort. We discuss the trade-offs associated with procedural and data contexts and draw design implications for engaging non-experts in crowdsourcing complex annotation tasks. 
    more » « less