skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Avoiding Overlap in Data Augmentation for AMR-to-Text Generation
Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMR-to-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple best-practice recommendations for leveraging additional data in AMR-to-text generation.  more » « less
Award ID(s):
2019805
PAR ID:
10494178
Author(s) / Creator(s):
;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Journal Name:
Proceedings of the conference Association for Computational Linguistics Meeting
ISSN:
0736-587X
Page Range / eLocation ID:
1043 to 1048
Format(s):
Medium: X
Location:
Online
Sponsoring Org:
National Science Foundation
More Like this
  1. Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, as well as evaluation more broadly, and highlight how confounding factors can impact analyses. 
    more » « less
  2. Wang, N. (Ed.)
    In education, intelligent learning environments allow students to choose how to tackle open-ended tasks while monitoring performance and behavior, allowing for the creation of adaptive support to help students overcome challenges. Timely feedback is critical to aid students’ progression toward learning and improved problem-solving. Feedback on text-based student responses can be delayed when teachers are overloaded with work. Automated evaluation can provide quick student feedback while easing the manual evaluation burden for teachers in areas with a high teacher-to-student ratio. Current methods of evaluating student essay responses to questions have included transformer-based natural language processing models with varying degrees of success. One main challenge in training these models is the scarcity of data for student-generated data. Larger volumes of training data are needed to create models that perform at a sufficient level of accuracy. Some studies have vast data, but large quantities are difficult to obtain when educational studies involve student-generated text. To overcome this data scarcity issue, text augmentation techniques have been employed to balance and expand the data set so that models can be trained with higher accuracy, leading to more reliable evaluation and categorization of student answers to aid teachers in the student’s learning progression. This paper examines the text-generating AI model, GPT-3.5, to determine if prompt-based text-generation methods are viable for generating additional text to supplement small sets of student responses for machine learning model training. We augmented student responses across two domains using GPT-3.5 completions and used that data to train a multilingual BERT model. Our results show that text generation can improve model performance on small data sets over simple self-augmentation. 
    more » « less
  3. Uniform Meaning Representation (UMR) is the next phase of semantic formalism following Abstract Meaning Representation (AMR), with added focus on inter-sentential relations allowing the representational scope of UMR to cover a full document. This, in turn, greatly increases the complexity of its parsing task with the additional requirement of capturing document-level linguistic phenomena such as coreference, modal and temporal dependencies. In order to establish a strong baseline despite the small size of recently released UMR v1.0 corpus, we introduce a pipeline model that does not require any training. At the core of our method is a two-track strategy of obtaining UMR’s sentence and document graphs separately, with the document-level triples being compiled at the token level and the sentence graph being converted from AMR graphs. By leveraging alignment between AMR and its sentence, we are able to generate the first automatic English UMR parses. 
    more » « less
  4. null (Ed.)
    The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends: (1) our system’s generated answers are not actually grounded in the documents that it retrieves; (2) ELI5 contains significant train / validation overlap, as at least 81% of ELI5 validation questions occur in paraphrased form in the training set; (3) ROUGE-L is not an informative metric of generated answer quality and can be easily gamed; and (4) human evaluations used for other text generation tasks are unreliable for LFQA. We offer suggestions to mitigate each of these issues, which we hope will lead to more rigorous LFQA research and meaningful progress in the future. 
    more » « less
  5. Despite the fact that robotic platforms can provide both consistent practice and objective assessments of users over the course of their training, there are relatively few instances where physical human–robot interaction has been significantly more effective than unassisted practice or human-mediated training. This article describes a hybrid shared control robot, which enhances task learning through kinesthetic feedback. The assistance assesses user actions using a task-specific evaluation criterion and selectively accepts or rejects them at each time instant. Through two human subject studies (total [Formula: see text]), we show that this hybrid approach of switching between full transparency and full rejection of user inputs leads to increased skill acquisition and short-term retention compared with unassisted practice. Moreover, we show that the shared control paradigm exhibits features previously shown to promote successful training. It avoids user passivity by only rejecting user actions and allowing failure at the task. It improves performance during assistance, providing meaningful task-specific feedback. It is sensitive to initial skill of the user and behaves as an “assist-as-needed” control scheme, adapting its engagement in real time based on the performance and needs of the user. Unlike other successful algorithms, it does not require explicit modulation of the level of impedance or error amplification during training and it is permissive to a range of strategies because of its evaluation criterion. We demonstrate that the proposed hybrid shared control paradigm with a task-based minimal intervention criterion significantly enhances task-specific training. 
    more » « less