Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Du, Wenchao; Flanigan, Jeffrey

doi:10.18653/v1/2021.acl-short.132

Citation Details

Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMR-to-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple best-practice recommendations for leveraging additional data in AMR-to-text generation. more »

Award ID(s):: 2019805

PAR ID:: 10494178

Author(s) / Creator(s):: Du, Wenchao; Flanigan, Jeffrey

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2021-01-01

Journal Name:: Proceedings of the conference Association for Computational Linguistics Meeting

ISSN:: 0736-587X

Page Range / eLocation ID:: 1043 to 1048

Format(s):: Medium: X

Location:: Online

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2021.acl-short.132

More Like this